2011 Volume 25 Issue 4 Published: 15 August 2011
  

  • Select all
    |
    Review
  • Review
    Muhetaer·Shadike1, 2, LI Xiao1, Buheliqiguli·Wasili3
    2011, 25(4): 3-11.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we introduce the research and implementation of a HMM based Sensitive-word Spotting System for Uyghur Broadcast News. Our system has three advantages to be exploited. First, the speech corpus is very small because it has finite number sensitive-word. Second, the broadcast news word pronunciation is clear and the property of word speed is regular, which is beneficial to the recognition. Third, in our system, we used whole-word models as the basic speech unit, since the whole word has the property of well-defined acoustic representation, which facilitate the segmentation of the beginning and the end of the word.
    Key wordsUyghur; broadcast news; keyword spotting; HMM; MATLAB
  • Review
    YIN Jianmin1, DAO Fuxiang2, TANG Jinbao1, YU Kanglong2
    2011, 25(4): 11-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    The research contents and key technologies of the news site and digital newspaper for Tai Lue are introduced. The encoding and presentational character sets, input method and EOT font for Tai Lue are studied. The page digitalization, website publishing, multi-channel news information collecting, multi-media resource sharing technologies and the Chinese news information standards are also discussed.
    Key wordsTai Lue; news site; digital newspaper; encoding/presentational character sets; input method
  • Review
    ZHOU Guoqiang, CUI Rongyi
    2011, 25(4): 16-20.
    Abstract ( ) PDF ( ) Knowledge map Save
    Korean text categorization based on na ve bayesian classifier is studied in this paper . Firstly, features are selected by the category selection method, and weights are calculated by estimation method as TF - IDF; Secondly, the naive bayesian classifier is established; Finally, the classifier is applied to Korean text categorization. The experiment results show that the method has good performance on Korean text classification, and it provides certain basis for the classification of text with both Korean and Chinese.
    Key wordsKorean; Nave Bayesian; text categorization; TF-IDF
  • Review
    XU Guixian1,2, XIANG Chuncheng1, WENG Yu1,2, ZHAO Xiaobing1,2, YANG Guosheng1
    2011, 25(4): 20-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we present a simple and fast classification approach for Tibetan web pages. It takes advantage of the class characteristics of term in the web page columns and combines the text extraction technology of web pages to classify the Tibetan web pages into the predefined classes. The experiments show that the approach has high classification accuracy for Tibetan web pages classification. It has helpful for the construction of high quality and multi-classes corpus of Tibetan.
    Key wordsTibetan information processing; text classification; classification of Tibetan pages
  • Review
    LIU Kai1, Wuriliga2, Siqintu2, JIANG Wenbin1, LIU Qun1
    2011, 25(4): 24-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    Syntax parsing plays important role in natural language processing. However, there is less research on syntax parsing of minority languages, including Mongolian, which brings some difficulty to other related research. We present an unsupervised dependency parsing method for Mongolian based on bilingual constraint in this paper. Our method can parse Mongolian without Mongolian tree bank and syntax characteristic. And we get a result of 73.3%(undir) and 66.2%(dir) on test set. And it is applicableto practical application.
    Key wordsMongolian; unsupervised syntax parsing; dependency parsing; bilingual constraint
  • Review
    JIANG Wenbin1,WU Jinxing1,2,Wuriliga1,2,Nasan-urt2, LIU Qun1
    2011, 25(4): 30-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    In Mongolian lexical analysis, the directed-graph-based model achieves high performance. This model uses a directed-graph architecture to describe the probabilistic relationship of stems and affixes, thus to determine the best segmented and tagged candidate for each word according to the context. Therefore, it is essential for a directed-graph-based analyzer to enumerate all legal segmented and tagged candidates for each word. This paper proposes a novel stem-affix segmentation model based on discriminative classification method for Mongolian lexical analysis. Compared with the enumeration strategy based on the stem- and affix sets, this method shows better generalization ability for the words with unknown stems. Using the 3rd-level annotated corpus with about 200 000 words as the training data, the directed-graph-based lexical analyzer with discriminative stem-affix segmentation module achieves further 7% improvement on F1 measure( with unknown stems considered).
    Key wordsMongolian; lexical analysis; POS tagging; stemming; directed graph; discriminative
  • Review
    GONG Zheng, GUAN Gaowa
    2011, 25(4): 35-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we initially set the Mongolian stop words with the union entropy algorithm (union entropy, UE), and then remove Mongolian entities nouns and homographs from Mongolian stop words. finally we compare the parts of speech of English stop words and Mongolian stop wordsto determine the Mongolian stopword list. We test the Mongolian stopword list and the English stopword list on document information retrieval task. The results show that the method used in this paper to determine the Mongolian stopword list has higher accuracy in Mongolian document retrieval than the simple translation English stop words into Mongolian.
    Key wordsMongolian stop word, Mongolian information retrieval, English stop word
  • Review
    LI Xiang1, CAI Zangtai2, JIANG Wenbin1, LV Yajuan1, LIU Qun1
    2011, 25(4): 39-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentence boundary identification is a fundamental work in the field of Tibetan information processing. This paper proposes a maximum entropy and rules approach to identifying Tibetan sentence boundaries. First, the Tibetan boundary vocabulary based detector identifies the ambiguous sentence boundaries. Second, the maximum entropy model based detector identifies the ambiguous sentence boundaries which the former detector can’t identify. By making use of Tibetan sentence boundary rules, this approach further reduces the number of the incorrect sentence boundary identified by maximum entropy model owing to the sparse and inferior training corpus. The experiments show that this approach has a good performance in terms of 97.78% F1-measure.
    Key wordsmaximum entropy; sentence boundary identification; Tibetan information processing
  • Review
    QIU Lirong1,2,WENG Yu1,2,ZHAO Xiaobing 1,2
    2011, 25(4): 45-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic ontology is a formal, explicit specification of a shared conceptualization, which provides a shared vocabulary used to model a domain—that is, the type of objects and/or concepts that exist, and their properties and relations. Hyponymy pattern is a basic semantic relationship between concepts, which is used to concepts acquisition in an ontology automatically. In this paper, hyponymy pattern is represented as a pair of a meaning frame defining the necessary information extraction in Tibetan language. Then the concept acquisition algorithm and the method of how to get pattern-match sentences are presented. The research of hyponymy relationship pattern can assist concept enrichment in ontology, which can reduce the cost during the ontology engineering process.
    Key wordsknowledge acquisition; semantic ontology; concepts acquisition; hyponymic relation
  • Review
    Kahaerjiang·Abiderexiti1, Tuergen·Yibulayin2, YAO Tianfang1, Aishan·Wumaier2, Aishan·Maoliniyazi2
    2011, 25(4): 50-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    Uyghur sentence similarity computation plays an important role in Example Base Machine Translation. The characteristic of agglutination of Uyghur language requires stemming. This paper presents the method that computes Uyghur sentence similarity after stemming words and combines it with a naive sentence structure similarity computation method. The small-scale experimental result shows that it is close to human evaluation.
    Key wordsUyghur sentence similarity; EBMT; sentence structure similarity
  • Review
    SHI Xiaodong1, LU Yajun2
    2011, 25(4): 54-57.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes the porting of a Chinese segmentation system to handle Tibetan. The F-measure of the new Yangjin system is above 91% over a test corpus although the training corpus is relatively small .It also describes more processing upon error analysis which led to further improvement.
    Key wordsTibetan segmentation; natural language processing; HMM
  • Review
    YU Xin1,2, WU Jian1, HONG Jinling1
    2011, 25(4): 57-63.
    Abstract ( ) PDF ( ) Knowledge map Save
    To construct the bilingual parallel corpus, the alignment at sentence level is a basic task. Considering the specific characteristic of Tibetan language, this paper proposes a dictionary-based Chinese-Tibetan sentence automatic alignment method. It establishes a bilingual dictionary for alignment, and evaluates its word coverage. To address the issueof different granularity between Chinese word segmentation and Tibetan word segmentation, this paper chooseto further look up the remaining big Tibetan word in Tibetan-Chinese dictionary and then match it in the original Chinese sentence, which increases the precision . Experiments show an average precision of 81.11% for this approach.
    Key wordsChinese-Tibetan sentence alignment; dictionary; word segmentation granularity; parallel corpus; Tibetan information processing
  • Review
    ZHOU Maoxian, Toudancairang, CAI Rangjia
    2011, 25(4): 63-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    According to the actual demand of Tibetan-Chinese-English online dictionary, the key laboratory of Tibetan information processing of ministry of education designed and realized a Tibetan-Chinese-English online dictionary which using WAMP as a design platform, and also gives out the specific design method and the main code of the thesaurus database and the query page. The experiment proved tha the the online dictionary can instantly return the correct trilingual words(Tibetan, Chinese and English) for the query entered by the customer. The dictionary is based on B/S structure, and its implementation is helpful for the communications and studies among Tibetan, Chinese and English.
    Key wordsTibetan; online dictionary; WAMP; B/S structure; database
  • Review
    Corpus Tashigyal1, GAO Dingguo2
    2011, 25(4): 66-71.
    Abstract ( ) PDF ( ) Knowledge map Save
    Large-scale real text processing has become a hotspot in the language information processing. To annotate the Tibetan Corpus is very important for the research on Chinese-Tibetan machine translation, information retrieval, text data mining and dictionary compilation. To facilitate the data exchange and sharing, this paper studies on on adopting the TEI coding for Tibetan corpusannotation, including the text attribute information and structure information.
    Key wordsTibetan; corpus; TEI mark
  • Review
    ZHAO Xiaobing1,2, QIU Lirong1,2, ZHAO Tiejun3
    2011, 25(4): 71-75.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic ontology is a formal, explicit specification of a shared conceptualization, which provides a shared vocabulary used to model a domain-that is, the type of objects and/or concepts that exist, and their properties and relations. The construction of semantic ontology knowledge base is the vital phase in language processing, which can be applied in information retrieval, information extraction and automatic translation. This paper mainly focuses on the construction idea of multi-language ontology with unified criteria and interface. And the key problems are described, such as the description method of basic rules of the languages based on concept semantic similarity in multi-languages ontology.
    Key wordsknowledge base; semantic ontology; dictionary enrichment; ontology learning
  • Review
    WANG Zhiyang1,2, LV Yajuan1, LIU Qun1
    2011, 25(4): 75-82.
    Abstract ( ) PDF ( ) Knowledge map Save
    Morphologically rich language, characterized by complex morphological changes, has huge vocabulary and serious data sparseness issue, which has brought a great challenge to machine translation. In this paper, we first analyze such language and use different granularities to represent and then translate them respectively. As different granularities can catch features of such language in different levels, we integrate the translation hypotheses from different granularities by the system combination approach to generate better results. Experimental results on Uyghur-Chinese and Mongolian-Chinese translation tasks show that system combination with multiple granularities improved the performance of translation, and gained +1.41% and +2.03% compared to the best single system measured by BLEU.
    Key wordsmorphologically rich language; multiple granularities; system combination
  • Review
    LI Jiazheng1, LIU Kai1, Mairehaba·Aili1,2, LV Yajuan1, LIU Qun1, Tuergen·Yibulayin1,2
    2011, 25(4): 82-88.
    Abstract ( ) PDF ( ) Knowledge map Save
    Name translation in the minority languages is still in its infancy. This paper presents a method for recognizing and translating Chinese Names in Uighur Language. In addition to using the traditional rule approach, we use Uighur and Chinese language models to recognize and translate Chinese names in Uighur Language. On this basis, we add the appropriate rules and algorithms to solve the problem of names with noun affixes and incomplete rules. This improves the accuracy of translation and the recall rate. We test the translation system with 1 000 random sentences with Chinese names. The results show that the accuracy can reach 75.2% and the recall rate can reach 91.5%.
    Key wordslanguage model;noun affixes;spelling rules;recognition and translation of names
  • Review
    Wangsiriguleng1, Siqintu2, Nasan-urt3
    2011, 25(4): 88-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the study of phrase-based Chinese-Mongolian statistical machine translation, there exist substantial word order errors in the Chinese-Mongolian translation reuslts. This paper compared Chinese and Mongolian sentence’s word order and proposed Chinese sentence reordering method based on the Mongolian word order. Then, it introduced the design of reordering rules and reordering algorithm. Finally, the experimental results proved that the performance of the Chinese-Mongolian machine translation system can be improved by this method.
    Key wordsChinese-Mongolian statistical machine translation system; reordering; rule
  • Review
    BAO Guilan1, HU He2
    2011, 25(4): 93-101.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on previous experimental study, this paper discusses the co-articulation between adjacent segments in consonants combination by observating pictures such as sound spectrogram, elactropalatogram, LCV-graph of standard Mongolian consonants combination. The speech analysis equipment and tools such as KAY’s EPG (Model 6300), 3700 Multi-Speech and Mini Speech Lab of Nankai University are adopted in this study.
    Key wordsstandard Mongolian; consonants combination; co-articulation
  • Review
    Monghjaya1, Shandan2
    2011, 25(4): 101-105.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Mongolian has, lots of words with the same type but different pronunciation, which challenges the Mongolian information processing. Therefore, it is essential in Mongolian information processing to solve these kind of words and determine the codes. This paper mainly discusses how to realize Latin transformation and syllable segmentation for such words in Mongolian.
    Key wordsMongolian; latin transformation; syllable segmentation.
  • Review
    ComponentsI·Dawa 1,2,3, Wushour Slam1,2,Yoshinori Sagisaka3
    2011, 25(4): 105-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes a free calling speaker recognition system based on GMM (Gaussian Mixtures Model) using LPC and Pitch information from spontaneous speech. The base line system in our test uses a GMM with 8-gaussian mixtures with diagonal covariance matrix, and for acoustic feature vector, the LPC cepstrum coefficient are used. In addition, the fundamental frequency (f0) are added to the PLC cepstrum for voiced part of speech signal. The experimental results show the speaker recognition rate of 76.97% on the base line test, and 80.29% on proposed approach, respectively, among the speech data from 50 sneakers. This result is close to the rate of 82.34% of speaker recognition using the speech data by manual segmentation.
    Key wordsphone calling speech; speaker recognition system (SRS);LPC cepstrum;Voice speech; GMM
  • Review
    2011, 25(4): 110-115.
    Abstract ( ) PDF ( ) Knowledge map Save
    A method to compute the similarity of Tibetan sentences is proposed in this paper. This method takes advantage of the reverse index of a hashed vocabulary and the sentence length based coarse-selection algorithm toextract candidate sentences from the corpus rapidly. The reverse index of the hashed vocabulary promotes the searching speed effectively. The multi-strategy delicate selection algorithm adopting word shape based similarity and the continuous word sequence based similarity, which could effectively assess the similarity extent of two Tibetan sentences. The method is validated by the experiments.
    Key wordsnatural language processing;corpus; continuous word series; Tibetan language;sentence similarity
  • Review
    KANG Xuzhen, LI Ru, LI Shuanghong
    2011, 25(4): 115-122.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Frame Kernel Dependency Graph based on the Chinese FrameNet is adopted to convey the deep semantic understanding of a Chinese sentence. The Frame Kernel Dependency Graph is to be obtained by extracting the semantic core words of Frame Elements. The identification of the semantic core words of Frame Element is investigated by the Conditional Random Fields, the Maximum Entropy and the Support Vector Machine models. Various feature sets with respect to these three models are analyzed and different feature template settings are compared to select the optimum template and model. Experimental results show that the CRF model has the best performance. When its feature template is improved further, the results also increase to some extent. The average precision of experiment result achieves 97.34% and 94.03% for Frame Elements of simple and complex phrase type, respectively.
    Key wordsframe elements;frame kernel dependency graph;conditional random fields;maximum entropy model;support vector machine model
  • Review
    LI Wen 1,2, LI Miao1, LIANG Qing3, ZHU Hai1,2, YING Yulong1,2, Wudabala1
    2011, 25(4): 122-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a Mongolian morphological segmentation approach by statistical machine translation method and minimum constituent-context cost model. The phrase based statistical machine translation and minimum constituent-context cost model are adopted to deal with in-vocabulary and out-of-vocabulary morphological segmentation, respectively. Three features commonly used in phrase based statistical machine translation were selected for the segmentation, i.e. the phrase translation probability, the lexical translation probability and the language model score. The uni-gram morpheme context and N-gram suffix context are considered in the minimum constituent-context cost model. Experiments show that the precision of the morphological segmentation system achieves 96.94%, and the translation results of the statistical machine translation system is improved obviously.
    Key wordsmorphology; morphological segmentation; machine translation; statistical model