9 sonuçlar
Arama Sonuçları
Listeleniyor 1 - 9 / 9
Yayın İlişkisel veri tabanlarında mükerrer kayıtların makine öğrenmesiyle tespiti(Institute of Electrical and Electronics Engineers Inc., 2018-07-05) Bayrak, Ahmet Tuğrul; Yılmaz, Aykut İnan; Yılmaz, Kemal Burak; Düzağaç, Remzi; Yıldız, Olcay TanerVeri miktarının artışına paralel olarak, ilişkisel veri tabanlarında mükerrer kayıtlar da artmaktadır. Artan bu kayıtlar kullanıldıkları rapor veya analizlerde tutarsızlığa sebep olabilmektedir. Bu sorunu en aza indirgemek için yaptığımız çalışmada, kayıtların birbirlerine olan benzerlikleri ve alan uzmanlık bilgisiyle belirlenen ağırlıklar, öznitelik olarak kullanılarak makine öğrenmesi algoritmaları ile mükerrer kayıtların bulunması hedeflenmiştir. Yapılan işlem sonucunda 9301467 satır veride 28412 mükerrer çift tespit edilmiştir. Bulunan bu mükerrer kayıtlar veri kaynağından temizlenerek verinin daha tutarlı hale gelmesi sağlanmaktadır.Yayın Regularizing soft decision trees(Springer, 2013) Yıldız, Olcay Taner; Alpaydın, Ahmet İbrahim EthemRecently, we have proposed a new decision tree family called soft decision trees where a node chooses both its left and right children with different probabilities as given by a gating function, different from a hard decision node which chooses one of the two. In this paper, we extend the original algorithm by introducing local dimension reduction via L-1 and L-2 regularization for feature selection and smoother fitting. We compare our novel approach with the standard decision tree algorithms over 27 classification data sets. We see that both regularized versions have similar generalization ability with less complexity in terms of number of nodes, where L-2 seems to work slightly better than L-1.Yayın Design and analysis of classifier learning experiments in bioinformatics: survey and case studies(IEEE Computer Soc, 2012-12) İrsoy, Ozan; Yıldız, Olcay Taner; Alpaydın, Ahmet İbrahim EthemIn many bioinformatics applications, it is important to assess and compare the performances of algorithms trained from data, to be able to draw conclusions unaffected by chance and are therefore significant. Both the design of such experiments and the analysis of the resulting data using statistical tests should be done carefully for the results to carry significance. In this paper, we first review the performance measures used in classification, the basics of experiment design and statistical tests. We then give the results of our survey over 1,500 papers published in the last two years in three bioinformatics journals (including this one). Although the basics of experiment design are well understood, such as resampling instead of using a single training set and the use of different performance metrics instead of error, only 21 percent of the papers use any statistical test for comparison. In the third part, we analyze four different scenarios which we encounter frequently in the bioinformatics literature, discussing the proper statistical methodology as well as showing an example case study for each. With the supplementary software, we hope that the guidelines we discuss will play an important role in future studies.Yayın Unsupervised morphological analysis using tries(Springer London, 2012) Ak, Koray; Yıldız, Olcay TanerThis article presents an unsupervised morphological analysis algorithm to segment words into roots and affixes. The algorithm relies on word occurrences in a given dataset. Target languages are English, Finnish, and Turkish, but the algorithm can be used to segment any word from any language given the wordlists acquired from a corpus consisting of words and word occurrences. In each iteration, the algorithm divides words with respect to occurrences and constructs a new trie for the remaining affixes. Preliminary experimental results on three languages show that our novel algorithm performs better than most of the previous algorithms.Yayın Parallel univariate decision trees(Elsevier B.V., 2007-05-01) Yıldız, Olcay Taner; Dikmen, OnurUnivariate decision tree algorithms are widely used in data mining because (i) they are easy to learn (ii) when trained they can be expressed in rule based manner. In several applications mainly including data mining, the dataset to be learned is very large. In those cases it is highly desirable to construct univariate decision trees in reasonable time. This may be accomplished by parallelizing univariate decision tree algorithms. In this paper, we first present two different univariate decision tree algorithms C4.5 and univariate linear discriminant tree. We show how to parallelize these algorithms in three ways: (i) feature based; (ii) node based; (iii) data based manners. Experimental results show that performance of the parallelizations highly depend on the dataset and the node based parallelization demonstrate good speedups.Yayın Eigenclassifiers for combining correlated classifiers(Elsevier Science Inc, 2012-03-15) Ulaş, Aydın; Yıldız, Olcay Taner; Alpaydın, Ahmet İbrahim EthemIn practice, classifiers in an ensemble are not independent. This paper is the continuation of our previous work on ensemble subset selection [A. Ulas, M. Semerci, O.T. Yildiz, E. Alpaydin, Incremental construction of classifier and discriminant ensembles, Information Sciences, 179 (9) (2009) 1298-1318] and has two parts: first, we investigate the effect of four factors on correlation: (i) algorithms used for training, (ii) hyperparameters of the algorithms, (iii) resampled training sets, (iv) input feature subsets. Simulations using 14 classifiers on 38 data sets indicate that hyperparameters and overlapping training sets have higher effect on positive correlation than features and algorithms. Second, we propose postprocessing before fusing using principal component analysis (PCA) to form uncorrelated eigenclassifiers from a set of correlated experts. Combining the information from all classifiers may be better than subset selection where some base classifiers are pruned before combination, because using all allows redundancy.Yayın Quadratic programming for class ordering in rule induction(Elsevier Science BV, 2015-03-01) Yıldız, Olcay TanerSeparate-and-conquer type rule induction algorithms such as Ripper, solve a K>2 class problem by converting it into a sequence of K - 1 two-class problems. As a usual heuristic, the classes are fed into the algorithm in the order of increasing prior probabilities. Although the heuristic works well in practice, there is much room for improvement. In this paper, we propose a novel approach to improve this heuristic. The approach transforms the ordering search problem into a quadratic optimization problem and uses the solution of the optimization problem to extract the optimal ordering. We compared new Ripper (guided by the ordering found with our approach) with original Ripper (guided by the heuristic ordering) on 27 datasets. Simulation results show that our approach produces rulesets that are significantly better than those produced by the original Ripper.Yayın Aynı oteli temsil eden farklı kayıtlar için akıllı eşleştirme(Institute of Electrical and Electronics Engineers Inc., 2019-09) Bayrak, Ahmet Tuğrul; Özbek, Eyüp Erkan; Kestepe, Sedat; Yıldız, Olcay TanerOtel sayısının her geçen gün arttığı turizm sektöründe, aracı firmaların tüm oteller ile ayrı ayrı çalışma imkanı bulunmadığından, firmalar dünya üzerinde bir çok otelle anlaşması bulunan servis sağlayıcılarıyla beraber çalışmaktadır. Farklı servis sağlayıcılarından alınan otel kayıtlarında tekrarlayan otel verileri olabilmektedir. Tekrarlayan bu kayıtlar aynı bilgilere sahip olabileceği gibi, farklı bilgilere sahip olmasına rağmen aynı oteli temsil edebilmektedir. Otel verilerini tutarlı hale getirmek için aynı oteli temsil eden kayıtlar eşleştirilmelidir. Bu amaçla, otel kayıtları üzerinde çalışılarak, adres zenginleştirmesi ve ön işleme yapılan aday kayıtlar için kategorik ve görsel verilerin benzerliklerinin kullanıldığı makine öğrenmesi algoritmaları uygulanmıştır. Yapılan işlem sonucunda, 132.287 satırlık otel verisinde 14.985 adet otel %99,12 doğruluk oranı ile eşleştirilmiştir.Yayın Statistical tests using hinge/ε-sensitive loss(Springer-Verlag, 2013) Yıldız, Olcay Taner; Alpaydın, Ahmet İbrahim EthemStatistical tests used in the literature to compare algorithms use the misclassification error which is based on the 0/1 loss and square loss for regression. Kernel-based, support vector machine classifiers (regressors) however are trained to minimize the hinge (ε-sensitive) loss and hence they should not be assessed or compared in terms of the 0/1 (square loss) but with the loss measure they are trained to minimize. We discuss how the paired t test can use the hinge (ε-sensitive) loss and show in our experiments that doing that, we can detect differences that the test on error cannot detect, indicating higher power in distinguishing between the behavior of kernel-based classifiers (regressors). Such tests can be generalized to compare L > 2 algorithms.












