2009 Volume 23 Issue 5 Published: 19 October 2009
  

  • Select all
    |
    Review
  • Review
    LUO Yanyan,HUANG Degen
    2009, 23(5): 3-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    The method of treating the word segmentation issue as a sequence tagging problem and using CRFs has been widely applied recently. However, in this method, some wrong tags are produced by CRFs. To reduce the number of wrong tags, we propose a new method based on the marginal probabilities generated by CRFs for Chinese word segmentation. Firstly, the candidate words with high marginal probabilities are extracted from the tagging results. Then, the candidate words of low marginal probabilities in the tagging results are recombined. Finally, a mechanism of premium that is built on FMM is introduced to complement the sub-strings produced by the recombinant procedure. Evalued by the closed track of SXU and NCC corpora in the fourth SIGHAN Chinese Word Segmentation Bakeoff, this method produces an F-score of 96.41% and 94.30%, respectively.
    Key words computer application; Chinese information processing; Chinese word segmentation; Conditional Random Fields(CRFs); Marginal probability; Forward Maximum Matching(FMM); global feature
  • Review
    ZHAO Chunli,SHI Dingxu
    2009, 23(5): 9-19.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the analyses of the disadvantages of the traditional probing of the Adjective-Noun constructions, this paper first puts forward a basic semantic model of "thing, attribute value and attribute domain" for understanding this construction. Then, according to the basic semantic model and the investigation of the corpus, it suggests to classify the noun into 5 sub-groups of subjects, events, objects, space & time, and logical nouns, and the adjectives into 5 sub-groups of subjective, eventual, objective, spatial & temporal, and evaluative. Finally, the semantic construction model between sub-classes of the noun and the adjective is established in linght of the computational linguistic scheme and the theoretical principles of the semantic grammar. The research result proves this model to be effective in revealing the rules between nouns and adjectives construction.
    Key words computer application; Chinese information processing;adjective-noun constructions; attribute domain; the semantic construction model; semantic grammar
  • Review
    WEN Miaomiao, WU Yunfang
    2009, 23(5): 19-25.
    Abstract ( ) PDF ( ) Knowledge map Save
    Owing to the high frequency of prepositional structure, this paper systematically explores the prepositional phrase boundary identification, which plays an important role in Chinese parsing, as well as for some application systems such as text to speech system. We apply support vector machine model to identify phrase boundary, and the boundary word is selected based on the output probability rather than the binary classification results. We also investigate different kinds of features, and try to employ rich features such as semantic classes involved. Together 68 frequently used prepositions are experimented in our test and the results show it achieves a precision of 90.95% in a five-fold cross validation.
    Key words computer application; Chinese information processing; prepositional phrase identification; SVM; semantic class
  • Review
    XU Qian, ZHOU Junsheng, CHEN Jiajun
    2009, 23(5): 25-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    Dirichlet process is a well-known nonparametric Bayesian model, with the attractive property of a flexible number of components determined by the model and the data. The Dirichlet process is an active area of research both within machine learning and in the natural language processing community. This paper introduces the origin and development of Dirichlet process, and the methods for model calculating. This paper also demonstrates how to use this model to solve natural language processing task. In the end, the future research and development trend of Dirichlet process is discussed.
    Key words computer applicationg; Chinese information processing; nonparametric Bayesian model;Dirichlet process;Dirichlet process mixture model;Markov chain Monte Carlo
  • Review
    WANG Haidong, HU Naiquan, KONG Fang, ZHOU Guodong
    2009, 23(5): 33-40.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a tree kernel-based approach to anaphora resolution of English pronoun. In our method, the convolution kernel of SVM is first used to obtain structured information, and then such achieved feature of the syntax is combined with other basic features in the literature. A system analysis of the impact of the filtering of training instances and different pruning strategies on the results is conducted. Further examination on the pronoun resolution performances in regard to the sentence distances is also carried out. Evaluation on the ACE2004 NWIRE benchmark corpus shows that tree kernel can improve the performance significantly, especially for the pronoun resolution within a sentence.
    Key words computer application; Chinese information processing; coreference resolution; structured syntax; tree kernel; pruning strategy
  • Review
    LIU Peng, ZONG Chengqing
    2009, 23(5): 40-47.
    Abstract ( ) PDF ( ) Knowledge map Save
    In recent years, the phrase-based statistical machine translation model has obtained more attention for its good translation performance. However, the model uses the strategy of precise matching in decoding, and the data sparseness becomes a serious problem. On the one hand, some phrases become the “unknown phrases” because they cannot be matched precisely in the phrase table; On the other hand, most of the phrases in the phrase table can’t be used in the translation process. Therefore, we propose a novel translation approach based on phrase fuzzy matching and sentence expansion. In our approach, for a phrase out of the phrase table, i.e. unknown phrase, we find its similar phrase in the phrase table through fuzzy matching. Then the sentence is expanded by replacing the original phrase with the similar ones before being translated into the target language. Finally, a combination of multi-classifier is employed to select the best translation. The experiment results show that this approach significantly improves the translation quality.
    Key wordsartificial intelligence; machine translation; phrase-based statistical machine translation; fuzzy matching; combination classifier
  • Review
    GUO Jianyi, XUE Zhengshan, YU Zhengtao, ZHANG Zhikun, ZHANG Yihao, YAO Xianming
    2009, 23(5): 47-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a method for named entity recognition in the tourism domain based on the cascaded conditional random fields. This method consists of two steps. The first step is used to identify simple tourism named entities, using Chinese characters as units with the dictionary of common character and suffix in tourism attractions, the dictionary of common character in location names and other dictionaries. Then the results of the first step are sent to the second step, in which the nesting tourist attractions, special snacks and location names are recognized by the word unit and other complex features. The results of six experiments indicated that in open testing, the proposed method increases by 8% in the F-score compared to the model of single layer, and by 15% in the F-score (with 8% in the precision and 22% in recall, respectively) compared to the HMM model.
    Key words computer application; Chinese information processing; tourism domain; named entity recognition; cascaded conditional random fields; feature template
  • Review
    DING Weiwei, CHANG Baobao
    2009, 23(5): 53-62.
    Abstract ( ) PDF ( ) Knowledge map Save
    In recent years, the Chinese SRL (semantic role labeling) has aroused the intensive attention. Many SRL systems have been built on the parsing trees, in which the constituents of the sentence structure are identified and then classified. In contrast, this paper establishes a semantic chunking based method which changes the SRL task from the traditional “parsing-semantic role identification-semantic role classification” process into a simple “semantic chunk identification-semantic chunk classification” pipeline. The semantic chunking, which is named after the syntactic chunking, is used to identify the semantic chunk, namely the arguments of the verbs. Based on the semantic chunking result, the Chinese SRL can be changed into a sequence labeling problem instead of the classification problem. We apply the conditional random fields to the problem and get better performance. Along with the removal of the parsing stage, the SRL task avoids the dependence on parsing, which is always the bottleneck both of speed and precision. The experiments have shown that the outperforms of our approach previously best-reported methods on Chinese SRL with an impressive time reduction. We also show that the proposed method works much better on gold word segmentation and POS tagging than on the automatic results.
    Key words computer application; Chinese information processing; semantic role labeling; semantic chunking; conditional random fields; sequence labeling
  • Review
    LIN Jing, YUAN Chunfa
    2009, 23(5): 62-68.
    Abstract ( ) PDF ( ) Knowledge map Save
    Temporal relation is inherited in most time and event concepts as an evident natural clue for information organization. This paper studies temporal relations between time and time, event and time as well as event and event in Chinese on the base of information extraction and temporal annotation. Temporal relations in Chinese texts are extracted with the help of syntactic and semantic knowledge. And context-dependent rules are defined to compute temporal relations between information from different texts. This study contributes to organize, accumulate and share results by information extraction. It is also be useful in some advanced NLP tasks, such as event tracing and multi-text summarization.
    Key words computer application; Chinese information processing; temporal relation extraction; temporal relation computation; information organization
  • Review
    WANG Su-ge, LI De-yu, WEI Ying-jie , SONG Xiao-lei
    2009, 23(5): 68-75.
    Abstract ( ) PDF ( ) Knowledge map Save
    The word sentiment orientation directly influences the sentiment orientation of higher level linguistic unit, such as the phrase, the sentence, the paragraph and the text. This paper proposes a paradigm word selection method based on the category distinguishing ability of a word and the sentiment word table. In consideration of that a word usually has the same sentiment orientation with its synonyms, we propose a method for word sentiment orientation discriminating based on synonyms. The method can avoid the data sparseness issue in a certain extent. The experiment results indicate that the proposed method is superior to the method based on the object word and paradigm words.
    Key words computer application; Chinese information processing; word sentiment orientation; paradigm word; relation intensity; synonym
  • Review
    LI Chengwei, PENG Qinke,XU Tao
    2009, 23(5): 75-80.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic classification of online comments according to their sentiment orientation is a complicated task with promising application in many domains such as enterprise intelligence system, administration of public emergencies and so on. Based on the Hyperspace Analogue to Language (HAL space) and the information inference, we propose a new model for classification of online comments in accordance with implicit or explicit sentiment expressed in comments. We first extract phrases which match our pre-defined patterns from sentences in comments, and then based on HAL space, a conception combination algorithm is employed to blend the words in the extracted phrases into one conception, whose sentiment orientation is calculated by the proposed information inference model. Finally, the sum of information inference degree of individual phrases indicates the sentimental polarity of comments. The experiment results show that, compared to SVM and Term-Count algorithms based on valence shifters and sentimental words table, our model performs better and achieves an accuracy as high as 89%.
    Key words computer application; Chinese information processing; information inference; sentiment classification; HAL; semantic orientation
  • Review
    ZHOU Jiaying,ZHU Zhenmin,GAO Xiaofang,
    2009, 23(5): 80-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a new method for content extraction from Web pages based on statistic and content-features. This method not only inherits the merits of the traditional statistic method, but also can extract the multi-body documents which can not be obtained by the pure statistic method. According to the fact that the multi-body documents are corresponding to multi-subtrees with the similar characteristics in the DOM tree of the web page, we first get a content path using the statistic method. Then, the content region and a trunk of subtree are modeled by the important features of the path, which are applied to get the whole information of the body content. Our experiment results show that the extraction precision of the single-body documents is 94%, and the multi-body documents is 91%.
    Key words computer application; Chinese information processing; content extraction;single-body documents;multi-body documents
  • Review
    WANG Ye, ZHANG Honggang, FANG Xu, GUO Jun
    2009, 23(5): 86-92.
    Abstract ( ) PDF ( ) Knowledge map Save
    Low-resolution Chinese character recognition of vehicle license plate is a challenge issue in the character recognition. With the development of intelligent traffic, the traditional approaches based on binary images cannot meet the practical requirement. This paper applied a gray-scale image based recognition approach, avoiding the undesired structural information loss from the traditional binarization process. We introduce Local Binary Patterns(LBP) into Chinese characters recognition for the first time and achieve good results. In addition, we present an improved and efficient Advanced LBP(ALBP) operator as feature extraction, which further improves the processing speed. Experiments prove our approach is robust against low quality characters, and it is more adaptive than conventional approach both on precision (from 74.25% to 98.80) and recognition speed.
    Key words artificial intelligence; pattern recognition; Chinese character recognition; ALBP; precision; recognition speed
  • Review
    LI Rong
    2009, 23(5): 92-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes a spelling check system for OCR output of Chinese text. A large training corpus is used to set up an error-pattern database. At first, the correct sentence and the sentence with errors are matched with the different characters between them extracted. Then an error-word extracting algorithm is executed to get the the error-patterns in the form of (correction-word , error-word, count). In such built error-pattern database, every error pattern in it can be considered as a rule for correcting errors. We further apply the error-pattern database according to the length of the error-worddirect application if the length is larger than two characters otherwise a verification algorithm will be applied. The above method lies in the core of the THOCR spelling check system, and experimental results are provided.
    Key words computer application; Chinese information processing; spelling check; training corpus; learning algorithm
  • Review
    ZHAO Hui, LIN Chenglong, TANG Chaojing
    2009, 23(5): 98-104.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a method of constructing the Chinese bimodal corpus which is vital to the data-driven visual speech synthesis and bimodal speech recognition. According to the visual features of the lip in pronunciation in the video, the fuzzy c-means clustering method is used to cluster triphone model and establish the visual triphone model. Based on visual triphone model, evaluation function is utilized to score sentences in the original corpus and finally the corpus is thus selected automatically. Compared with other bimodal corpus, the proposed method substantially improves the Chinese bimodal corpus in the coverage rate, the coverage efficiency and the high-frequency words distribution, revealing the bimodal phenomenon of Chinese Mandarin more faithfully.
    Key words computer application; Chinese information processing; visual speech synthesis; bimodal speech recognition; bimodal corpus; visual triphone; evaluation function
  • Review
    Zulpiye·Aman, Askar·Hemdulla
    2009, 23(5): 104-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to improve the naturalness of the speech synthesis system, this paper conducts an acoustical analysis of the prosodic features of the disyllabic words in Uyghur language. This study is based on an acoustical database containing prosodic measurements of 969 words from 1 male and 1 female speaker. According to our knowledge, this is the first systematic empirical work on word stress in Uyghur language with the purpose of establishing a basis for the study of the prosody in Uyghur language. Besides, this study is of high research value for investigating the prosody of entire Altay language family.
    Key words computer application; Chinese information processing; speech synthesis; Chinese information processing; Uyghur language; prosodic features; acoustic analysis
  • Review
    ZHAO Li, CUI Duwu
    2009, 23(5): 108-114.
    Abstract ( ) PDF ( ) Knowledge map Save
    A text watermarking algorithm based on the tone of Chinese characters and genetic algorithm is proposed. The text watermarking, with the modification of value of the characters by transformations, can be embedded into the area marked by the marked codes. The marked codes are decided by the characteristic of the document, and the watermarking capacity decided by the amount of marked codes can also be flexible. The whole article can be divided into many parts, with independent text watermarking embedded and extracted, which reduces the computation complexity significantly.
    Key words computer application; Chinese information processing; text watermarking; robustness; watermarking capacity; genetic algorithm
  • Review
    Zilikam Kasim, Nasirjan Tursun, Wushour Silamu
    2009, 23(5): 114-119.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper conducts an acoustical analysis of the initial syllabic vowels in Uyghur language based on an acoustical database containing prosodic measurements from 500 words, recorded by 1 male and 1 female speaker. Among the various ways of the acoustic vowel paintings, the paper chooses the JOOS-acoustic vowel map with F1 for Ordinate and F2 for horizontal, which has a very good correspondence with the vowel tongue. The initial syllabic vowels in Standards Uyghur language are[y, i, e, , u, o, , ], with [u, o, ] as the vowels and the [y, i, e, , ] as the front vowels. As for the distribution about the opening of the tongue-level, [y, i, u] are the high-vowels, [e, , o] the secondary high vowels, [;]the secondary low vowel and [] the low vowel.
    Key words computer application; Chinese information processing; Uyghur; initial syllabic vowel; acoustic analysis
  • Review
    CHENG Xinfang, WUshouer Silamu, ZHANG Yongcai
    2009, 23(5): 119-123.
    Abstract ( ) PDF ( ) Knowledge map Save
    A kind of Uyghur IME for the remote control of the IP Set-Top Box is presented, by which the input of Uygur character can be accomplished by numeric keys 2 ~ 9 and five control keys on the remote control. Firstly, in this paper, the IP Set-Top Box and Uygur letters characteristics are analyzed. Then, the Uygur IME framework for the remote control, the function description, the keyboard layout and the handling and transplantation procedure are discussed. This mixed multi-language input and display technique is featured by the IP Set-Top Box as the video decoding terminal, the household TV as the main display terminal equipment and the direct employment of the broadband network infrastructure. This method has been successfully applied to interactive service of two-way cable networks.
    Key words computer application; Chinese information processing; embeded system; IP Set-Top Box; Uyghur; IME
  • Review
    DIAO Hongjun, LI Peifeng, QIAN Peide
    2009, 23(5): 123-128.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper analyzes state-of-the-art IME models for intelligent mobile and then decomposes input state into three different states of inputting Chinese, English and numerals respectively. Based on the state design pattern given by GOF, the paper proposes an automata based IME model for the intelligent mobiles, which is compatible with many intelligent mobile system such as Windows Mobile Symbian S60 etc. The model can not only simplify development effort but also provide much convenience in maintenance and update of intelligent mobile IME.
    Key words computer application; Chinese information processing; finite-state automata; IME; intelligent mobile; design pattern