2012 Volume 26 Issue 1 Published: 15 February 2012
  

  • Select all
    |
    Review
  • Review
    ZHANG Yangsen,GUO Jiang
    2012, 26(1): 3-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word Sense Disambiguation (WSD) has been a hot but difficult issue of natural language processing. Ensemble method is considered as one of the four major trends in machine learning research. After a survey of machine learning methods applied in Chinese word sense disambiguation,we introduce the ensembled classifier in the pattern recognition into this issue and propose a classifier ensembled by dynamic weight adaptation. Experimental results show that the proposed classifier has improved the Chinese WDS accuracy significantly.
    Key wordsword sense disambiguation; classifier; ensembled classifier; context features
  • Review
    CHEN Gong, LUO Senlin, CHEN Kaijiang, FENG Yang, PAN Limin
    ()
    2012, 26(1): 9-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    To deal with the defiency in employing sentence information of probabilistic context free grammar (PCFG) on parsing, subsidiary context and lexical information are introduced to propose two structure disambiguation methods based on PCFG. Both the accuracy and comprehensiveness are elevated at the cost of efficiency through the proposed layered parsing strategy. The experimental results show that the model of Chinese syntactic parsing based on subsidiary context and lexical information using more sentence information do better at disambiguation compared to PCFG.
    Key wordsChinese syntactic parsing; probabilistic context free grammar; subsidiary context; lexical information; layered parsing
  • Review
    ZHANG Ruixia1, YANG Guozeng2, WU Huixin1
    2012, 26(1): 16-22.
    Abstract ( ) PDF ( ) Knowledge map Save
    A new measure based on HowNet is put forward to compute the semantic similarity between unknown Chinese words. Firstly, the semantic matching function is constructed according the YiHeNet; secondly, nodes in the concept graphs of unknown Chinese words are classified according to their different effects in expressing the semantic information; then, the three notions of arcs, node pairs and node pair sets are classified according to matching functions; finally, similarity measures are designed to compute the similarities of unknown Chinese words, similarities of different node pairs and similarities of different node pair sets. This new measure helps to classify the semantic information of those unknown words and to apply it into the computing course, and experiments prove its effectiveness.
    Key wordsHowNet; semantic similarity; unknown words; concept graphs
  • Review
    ZHAO Hongmei, LV Yajuan, BEN Guosheng, HUANG Yun, LIU Qun
    2012, 26(1): 22-31.
    Abstract ( ) PDF ( ) Knowledge map Save
    The 7th China Workshop on Machine Translation(CWMT2011)Evaluation continues the ongoing series of evaluation of machine translation technology in China. This paper presents an overall introduction to CWMT2011 evaluation. This evaluation focuses on the evaluation of MT translation from other languages to Chinese, especially, from ethnic languages (including Mongolian, Tibetan, Uyghur, Kazakh and Kirghiz). 165 systems of 19 participants from home and aboard have taken part in the evaluation. The paper introduces the evaluation tasks, the evaluation data, the evaluation procedure and the participants. We also discuss the evaluation results in details. The examples from this evaluation show that the evaluation result depends on the following factorsthe similarity between the source language and the target language, the range of the field which the evaluation task involves, the similarity between the test data and the training/development data, the size of the training data, the technology and the maturity of the participating system, and etc.
    Key wordsmachine translation; machine translation evaluation; BLEU-SBP; WoodPecker evaluation
  • Review
    FENG Yang1,ZHANG Dongdong2,LIU Qun1
    2012, 26(1): 31-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    In different languages, the relative order of syntactic constituents is usually different, especially for prepositional phrases. Therefore, to the proper treatment of the syntactic order differences is of vital importance to translation quality. Hierarchical phrase-based model learns formally syntax from a parallel corpus and is capable of dealing with long-distance reordering. But it fails in discriminating the syntactic constituents to select the correct translation rule. In this paper, we introduce linguistic information into hierarchical phrase-based model in the form of prepositional phrases so as to well capture the reordering of prepositional phrases. This method first identifies prepositional phrases via conditional random fields and extracts rules including prepositional phrases. Under the SCFG defined by these rules, it searches for the best derivation and produces translation simultaneously. Experiments show that, comparing to hierarchical phrase-based model, our method can get an absolute improvement of 0.8 BLEU point on our in-house English-Chinese test set and 0.5 BLEU point on the NIST 2008 English-Chinese test set.
    Key wordsstatistical machine translation; hierarchical phrase-based translation; prepositional phrase reordering; conditional random field
  • Review
    XIAO Xinyan1,2, LIU Yang1, LIU Qun1, LIN Shouxun1
    2012, 26(1): 37-42.
    Abstract ( ) PDF ( ) Knowledge map Save
    Lexical information plays an important role in the phrase reordering. However, the reordering in the hierarchical phrase-based (HPB) model does not consider the lexical information within the phrases, resulting in the reordering ambiguity. To alleviate this, we propose a lexicalized reordering method for the HPB translation. We distinguish two orientations of a variable comparing to its adjacent words, and use boundary words covered by the variable to guide reordering choices. In the large scale Chinese-English translation evaluation task, the proposed method improves the translation performance ranging from 0.6 to 1.2 BLEU on NIST 2003-2005 test-sets.
    Key wordsstatistical machine translation; hierarchical phrase-based; lexical reordering
  • Review
    JIANG Mengjin, ZHOU Yaqian, HUANG Xuanjing
    2012, 26(1): 42-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    Information De-duplication is an important task of Information Extraction. This paper focuses on the multi-field information de-duplication. Previous works usually treat each information field equally. We separate information fields into several categories, generalize the computing method of similarity for each single filed, and use those similarities as the features in a machine learning method to distinguish duplicate information pairs. For the most difficult named entity field, we expand co-reference pairs by using the other easy predicted fields, and use the expanded knowledge to improve the de-duplication performance.
    Key wordsinformation extraction; information de-duplication; named entity
  • Review
    CHANG Peng1,2, FENG Nan1
    2012, 26(1): 51-58.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a novel co-occurrence terms based vector space model (CTVSM) for automatic document indexing which is inspired by the Vector Space Model (VSM). In contrast to the traditional VSM which presents the document with a bag of words regardless the position of these words in the texts, the proposed technique uses the co-occurrence terms instead of the single term. Firstly the pairs of obvious co-occurrence terms are extracted from the document set by association rules, and then the similarity between documents is also defined in this paper. The experiments indicate substantial and consistent improvements of the CTVSM over standard VSM.
    Key wordsdocument model; co-occurrence; document similarity; text mining
  • Review
    HAN Yongfeng, XU Xuyang, LI Bicheng, ZHU Wubin, CHEN Gang
    2012, 26(1): 58-67.
    Abstract ( ) PDF ( ) Knowledge map Save
    State-of-the-art automatic summarization is based on text segment clustering to avoid redundancy defects in the traditional approaches. But some of the text segments in the web news are irrelevant to the subject, which affects the result of clustering and damages the conciseness of summarization. This paper introduces the event extraction technology and proposes an event extraction based web news multi-document summarization method. Firstly, the method distinguishes event and non-event from the news through a binary classifier. Then, the original documents' physical division based on paragraphs or sentences are transformed into event based content logical division through clustering. Finally, the summarization is derived from the extraction, taxis and embellishment of the major events. Experimental results demonstrate the effectiveness of the proposed method, which improves summarization quality significantly.
    Key wordsevent extraction; Chinese information processing; classification; news text; clustering; automatic summarization
  • Review
    SUN Yan, ZHOU Xueguang
    2012, 26(1): 67-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    Treating the webpage filtering as a classification task, a new method based on Rough set and Bayesian decision theory is proposed. Attribute reduction of Webpages classification is obtained by the discernibility matrix and discernibility function according to the the Rough Set theory. Then, the Webpage is classified and filtered by the Bayesian decision theory. Simulation experiments show the effectiveness of the proposed method.
    Key wordsinformation security, Webpage filtering, rough set, discernibility matrix, Bayesian decision
  • Review
    XIE Lixing1, ZHOU Ming 2, SUN Maosong1
    2012, 26(1): 73-84.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the development of Web 2.0, micro blog has drawn substantial attention from both academia and industry communities. This paper utilizes micro blog API from Sina and carries out sentiment analysis on Chinese micro blog. We compare performances of three method, based on the emoticon, the sentiment lexicon and the hybrid approach over hierarchical structure using SVM, respectively. Through the experiments, we find that SVM based hybrid approach achieves the best performance. Furthermore, we analyze the contribution of various features in this model, including target-independent features and target-dependent features. Experimental results show that SVM based method can gain an accuracy of 66.467% with target-independent features, and an improved accuracy of 67.283% with the addition of target-dependent features.
    Key wordssina micro blog; sentiment analysis; SVM
  • Review
    YANG Liang, LIN Yuan, LIN Hongfei
    2012, 26(1): 84-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    Mirco-Blog(Twitter) is becoming a major source for producing and spreading hot events on the internet, in that it provides a short and convenient way for users to express and share their attitudes instantly. To detect the hot events on the Micro-Blog platform, this paper discloses find that the emergences of hot events will increase the number of emotion words and change their distribution. Accordingly, an emotion distribution language model is proposed to analyze the differences between adjacent time interval to find the hot events. Experiment results show that the method proposed can detect hot events in Mirco-Blog platform effectively, facilitating the management and monitoring of the hot events in Mirco-Blog platform.
    Key wordsmicro-blog; hot events; emotion distribution language model
  • Review
    Mairehaba·aili1,2,JIANG Wenbin1,Tuergen·yibulayin2
    2012, 26(1): 91-97.
    Abstract ( ) PDF ( ) Knowledge map Save
    We propose an automatic lemmatization model for Uyghur inflectional phenomenon. In contrast to previous methods, we generalize the inflection in Uyghur conceptually, and treat the lemmatization with the sequence tagging models,. Using the "Uyghur million word Part-of-Speech tagging corpus" as the training data, the proposed method improves the F value of lemmatization up to 91.4% from 84.1%, especially attaining an F value of 88.6% for Uyghur verbs which are rich in suffixes and complex.
    Key wordsUyghur language; morphological analysis; Uyghur inflection
  • Review
    LIU Huidan1,2, NUO Minghua1,2, ZHAO Weina3,4, WU Jian1, HE Yeping1
    2012, 26(1): 97-104.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper designs and implements a Tibetan word segmentation system named “SegT”. It identifies critical words with a fast algorithm based on the trie structure when it segments each Tibetan sentence to blocks with case-auxiliary words. Then, it identifies abbreviated words when it segments each block to words by maximum matching. Finally, it detects ambiguities by bidirectional segmentation, and solve them by word frequency. Experiments show that it improves the segmenting speed by about 15% after applying the block segmentation method based on case-auxiliary words, but the block segmentation doesnt significantly increase or decrease the precision. The precision of the system reaches 96.98%, which shows that its a practical system.
    Key wordsTibetan word segmentation; case-auxiliary words; critical word detection; word frequency statistics; Tibetan information processing; Chinese information processing
  • Review
    CHEN Bin, ZHANG Lianhai, NIU Tong, WANG Bo
    2012, 26(1): 104-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    A Chinese nasal detection method based on energy distribute and formant structure characteristics is presented. According to this method, the energy distribute and formant structure features are first acquired by Seneffs auditory spectrum, then SVM classifier is combined to realize candidate nasal detection. Finally, post processing is conducted to remove the insertion errors in accordance with parameters of segment duration, front vowel energy, energy difference of high and low frequency, energy ratio of middle and low frequency, etc. The experimental results show that the accuracy is 90.4% for clean speech, above 84.4% for noisy speech with the SNR of 10dB.
    Key wordsnasal detection; energy distribute; formant structure; Seneff auditory model
  • Review
    Askar·Hamdulla
    2012, 26(1): 110-119.
    Abstract ( ) PDF ( ) Knowledge map Save
    To improve the naturalness of speech synthesis system, the paper presents a systematic empirical study on the nasal in Uyghur language as an enrichment of the prosody in Uyghur language. The experimental analysis is carried on an acoustical database with prosodic measurements from the word of containing nasals, recorded by 1 male 1 female speaker. The investigation concentrates on an acoustical analysis of the prosodic features of the nasal in Uyghur language, including formant, length, intensity and duration etc. To our best knowledge, this is the first study in this aspect for Uyghur language, which is enlightening for the study of entire Altay language family as well.
    Key wordsUyghur language; nasal; formant; length; intensity; speech; variant
  • Review
    CHEN Mo
    2012, 26(1): 119-128.
    Abstract ( ) PDF ( ) Knowledge map Save
    Mandarin tones category acquisition is very difficult for secondary language learner. In order to catch the acquisition mechanism of Mandarin tones category, we constructed a growing tree-structured self-organizing feature map model to simulate this dynamic acquisition process. The selected model is fit for simulating the Mandarin tones category acquisition for its good topology mapping and dynamic nodes expansion. The simulation results are consistent with the experimental results, revealing the dynamic developmental process of Mandarin tones category. In the mean ting, the research proved that self-organization is the most important mechanism of the tones category emergence. The study also provides meaningful implications for Mandarin tones-teaching.
    Key wordsmandarin tones category; computer simulation; growing tree-structured self-organizing feature map