Arama Sonuçları

Listeleniyor 1 - 6 / 6
  • Yayın
    Effective semi-supervised learning strategies for automatic sentence segmentation
    (Elsevier Science BV, 2018-04-01) Dalva, Doğan; Güz, Ümit; Gürkan, Hakan
    The primary objective of sentence segmentation process is to determine the sentence boundaries of a stream of words output by the automatic speech recognizers. Statistical methods developed for sentence segmentation requires a significant amount of labeled data which is time-consuming, labor intensive and expensive. In this work, we propose new multi-view semi-supervised learning strategies for sentence boundary classification problem using lexical, prosodic, and morphological information. The aim is to find effective semi-supervised machine learning strategies when only small sets of sentence boundary labeled data are available. We primarily investigate two semi-supervised learning approaches, called self-training and co-training. Different example selection strategies were also used for co-training, namely, agreement, disagreement and self-combined. Furthermore, we propose three-view and committee-based algorithms incorporating with agreement, disagreement and self-combined strategies using three disjoint feature sets. We present comparative results of different learning strategies on the sentence segmentation task. The experimental results show that the sentence segmentation performance can be highly improved using multi-view learning strategies that we proposed since data sets can be represented by three redundantly sufficient and disjoint feature sets. We show that the proposed strategies substantially improve the average baseline F-measure of 67.66% to 75.15% and 64.84% to 66.32% when only a small set of manually labeled data is available for Turkish and English spoken languages, respectively.
  • Yayın
    Extension of conventional co-training learning strategies to three-view and committee-based learning strategies for effective automatic sentence segmentation
    (IEEE, 2018) Dalva, Doğan; Güz, Ümit; Gürkan, Hakan
    The objective of this work is to develop effective multi-view semi-supervised machine learning strategies for sentence boundary classification problem when only small sets of sentence boundary labeled data are available. We propose three-view and committee-based learning strategies incorporating with co-training algorithms with agreement, disagreement, and self-combined learning strategies using prosodic, lexical and morphological information. We compare experimental results of proposed three-view and committee-based learning strategies to other semi-supervised learning strategies in the literature namely, self-training and co-training with agreement, disagreement, and self-combined strategies. The experiment results show that sentence segmentation performance can be highly improved using multi-view learning strategies that we propose since data sets can be represented by three redundantly sufficient and disjoint feature sets. We show that the proposed strategies substantially improve the average performance when only a small set of manually labeled data is available for Turkish and English spoken languages, respectively.
  • Yayın
    Automatic speech recognition system for Turkish spoken language
    (Işık Üniversitesi, 2012-06-21) Dalva, Doğan; Güz, Ümit; Işık Üniversitesi, Fen Bilimleri Enstitüsü, Elektronik Mühendisliği Yüksek Lisans Programı
    The transmission and storage of speech sounds is possible for decades. In addition by using signal processing techniques, it is also possible tp process speech signals. By using time abd frequency analysis od speech signal and several machine learning algorithms, it is possible to build a system which is used to recognize spoken words. Such systems are called Automatic Speech Recognition systems. In our work, We have used the Automatic Speech Recognition system for Turkish spoken language which has built by BUSIM speech group. However, the output of the recognizer is the list of spoken words. Even for humans it is avery hard to understand a text without punctuation symbols. Hence to build more complex recognizer whose goal to perform topic segmentation and topic summarization, the output of ASR should be divided into sentences at first. Our goal is to build a system which performs the sentence segmentation. In our work We have used ASR system to obtain word level and phoneme level time marks and by using that time marks with the audio files, We have extracted prosodic features, where the prosodic properties of speech contains information about the punctuation in the text, which is not available at the output of ASR system.
  • Yayın
    Co-training using prosodic, lexical and morphological information for automatic sentence segmentation of Turkish spoken language
    (Işık Üniversitesi, 2018-01-15) Dalva, Doğan; Güz, Ümit; Işık Üniversitesi, Fen Bilimleri Enstitüsü, Elektronik Mühendisliği Doktora Programı
    Sentence segmentation of speech aims detecting sentence boundaries in a stream of words output by the speech recognizer. Sentence segmentation is a preliminary step toward speech understanding. It is of particular importance for speech related applications, as most of the further processing steps; such as parsing, machine translation and information extraction, assume the presence of sentence boundaries. Typically, statistical methods require a huge amount of manually labeled data, which is time and labor consuming process to prepare. In this work, novel multiview semi-supervised learning strategies for the solution of sentence segmentation problem are proposed. The aim of this work is to and effective semi-supervised machine learning strategies when only a small set of sentence boundary labeled data is available. This work proposes three-view co-training and committee-based strategies incorporating with agreement, disagreement and self-combined strategies using lexical, morphological and prosodic information, and investigates performance of the proposed learning strategies against baseline, self-training and co-training. The experimental results show that the proposed learning strategies highly improve the sentence segmentation problem, since data sets can be represented by three redundantly suffcient and disjoint feature sets.
  • Yayın
    Extraction and comparison of various prosodic feature sets on sentence segmentation task for Turkish broadcast news data
    (IEEE, 2014) Dalva, Doğan; Revidi, İzel D.; Güz, Ümit; Gürkan, Hakan
    In this work, prosodic features of the Turkish Broadcast News (BN) data are extracted using an open source prosodic feature extraction tool based on Praat. The profiles and effectiveness of these features are also investigated for the sentence segmentation task on the Turkish BN data. We not only used some combinations of the feature sets but also collected some of them in one prosodic feature model in order to achieve one of the best performance. The results of the experiments show that some combinations of the prosodic feature sets are very useful for the automatic sentence segmentation task on the Turkish BN data.
  • Yayın
    Türkçe haber yayını verileri için bürünsel bilginin çıkarılması ve cümle bölütlemede kullanılması
    (IEEE, 2014-04-23) Dalva, Doğan; Revidi, İzel D.; Güz, Ümit; Gürkan, Hakan
    Bu çalışmada, Türkçe haber yayını verilerine ilişkin bürünsel özelliklerin açık kaynak kodlu yazılımlar ile çıkarılması ve bürünsel özellik gruplarının Otomatik Konuşma Tanıma (Automatic Speech Recognition) Sistemi çıkışından elde edilen metin üzerinde cümle bölütlemedeki başarımlarının karşılaştırılması gerçekleştirilmiştir.Özellikle cümle bölütleme işlevi için oldukça yüksek başarım oranına sahip bir bürünsel özellik seti elde edilmiştir.