2009 Volume 23 Issue 6 Published: 15 December 2009
  

  • Select all
    |
    Review
  • Review
    SHAO Yanqiu, SUI Zhifang, WU Yunfang
    2009, 23(6): 3-11.
    Abstract ( ) PDF ( ) Knowledge map Save
    Besides Syntax structure, lexical semantic features are also closely related to semantic roles. Therefore, lexical semantic features could help solve the problems that could not be well-done only by syntax features. In this paper, some lexical semantic features such as valency number, semantic class of subject and object are introduced according to the Peking University semantic dictionary CSD, The 10-fold cross validation results show that, by applying the semantic dictionary, the overall F-score increases by 1.11%. And the F-score of Arg0 and Arg1 reach 93.85% and 90.60% respectively, which are 1.10% and 1.26% higher than the results only depending on syntax features.
    Key words artificial intelligence; natural language processing; semantic analysis; semantic role labeling; syntax analysis; semantic dictionary; lexical semantic feature
  • Review
    LI Junhui, WANG Hongling, ZHOU Guodong, ZHU Qiaoming, QIAN Peide
    2009, 23(6): 11-19.
    Abstract ( ) PDF ( ) Knowledge map Save
    A feature-based semantic role labeling system operated on signal syntactic parse is constructed. The system is divided into three sequential tasks(1) filtering out constituents that represent no semantic arguments with high probabilities, (2) classifying constituents of candidate semantic arguments into the specific categories (including NULL class), and (3) dealing with overlap arguments and constituents all labeled as core-arguments in the post-processing step. Besides combining and optimizing the existing features presented in other work, the paper extracts new features according to knowledge of grammar, pattern and collocation. The experiments show the effectiveness and robustness of the new extracted features, with which the finally SRL system achieves F1 value 77.54% and 78.75% on the development and WSJ test set respectively. As far as we know, it is the best result based on single syntactic parsers on the CoNLL-2005 Shared Task.
    Key words artificial intelligence; natural language processing; semantic role labeling; grammar-driven feature; pattern feature; collocation feature
  • Review
    ZHU Hong, LIU Yang, YU Shiwen
    2009, 23(6): 19-26.
    Abstract ( ) PDF ( ) Knowledge map Save
    Lexical knowledge acquisition is the bottleneck for many tasks like word sense disambiguation, lexical knowledge base construction et al. This paper introduces an automatic word sense discrimination method for Chinese mid-high-frequency adjectives. We employ the EM algorithm and exploit the features of Chinese character, contextual bag-of-words and host-attribute pair instead of the more unreliable syntactic information. We further optimize the morphology selection by utilizing HowNet in our work. The experimental results show that word sense discrimination results are different from Chinese lexicons and could be used for lexicon modification and expansion even for other type of Chinese words.
    Key words computer application; Chinese information processing; knowledge acquisition; word sense discrimination; feature selection; EM algorithm
  • Review
    LIANG Yinghong, ZHANG Wenjing, ZHOU Defu
    2009, 23(6): 26-31.
    Abstract ( ) PDF ( ) Knowledge map Save
    For term recognition, the current precision of double-word term has achieved 90.36% while the precision of multi-word term is only 66.63%. To address the issue of multi-word term recognition, this paper proposes a method of higher precision, which integrates the predominance of NC-value parameter in long term recognition with the mutual information. The experimental result shows the precision, recall and F-measure fo this method can reach 88.5%, 76.6% and 82.2%, respectively.
    Key words computer application; Chinese information processing; Term recognition; NC-value; mutual information
  • Review
    CHE Chao, TENG Hongfei,
    2009, 23(6): 31-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    The corpus-based method for word sense disambiguation (WSD) suffers from “knowledge acquisition bottleneck” problem. The automatic lexical sample acquisition method based on equivalent pseudo-words (EPs) is an effective way to solve of this problem. However, some pseudo-samples collected by EPs have low quality and the EPs can not be acquired when the ambiguous word has few monosemous synonyms. This paper proposes a WSD method combining pseudo-samples and man-acquired samples. The method calculates the sentence similarity with the context of the ambiguous word to remove pseudo-samples with low quality. Moreover, the method utilizes the manually-tagged corpus to get the sense distribution probability and provide samples for the ambiguous words that have little monosemous synonym. Our method achieves an average F-measure of 0.79 through the WSD experiments performed on Senseval-3 Chinese lexical sample task.
    Key words computer application; Chinese information processing; word sense disambiguation; HowNet; equivalent pseudo-words; Bayesian classifier; automatic sample acquisition;
  • Review
    WU Xiaofeng, ZONG Chengqing
    2009, 23(6): 39-46.
    Abstract ( ) PDF ( ) Knowledge map Save
    In recent years, Latent Dirichlet Allocation (LDA) has been widely applied in the document clustering, the text classification, the text segmentation, and even the query based multi-document summarization without supervision. LDA is recognized for its great power in modeling a document in a semantic way. In this paper we propose a new superivised method for the extraction-based single document summarization by adding LDA of the document as new features into a CRF summarization system. We study the power of LDA and analyze its different effects by changing the number of topics. Our experiments show that, by adding LDA features, the result of traditional CRF summarization system can be impressively increased.
    Key words computer application; Chinese information processing; natural language processing; automatic document summarization; latent Dirichlet allocation; conditional random field
  • Review
    JIA Yuxiang, YU Shiwen, ZHU Xuefeng
    2009, 23(6): 46-56.
    Abstract ( ) PDF ( ) Knowledge map Save
    Metaphor is pervasive in human language and must be treated for natural language understanding. Firstly, this paper discusses the nature of metaphor and the manifestation of metaphorical expressions in language. Then the automatic metaphor processing is divided into three subtasksmetaphor recognition, metaphor understanding and metaphor generation. This paper makes an extensive survey of researches on the automatic metaphor processing over the last three decades, emphasizing achievements in recent years. Researches on metaphor knowledge bases are also introduced, which are indispensable for metaphor processing. The applications of metaphor processing to natural language processing tasks are also discussed. Finally, this paper puts forward some suggestions for the future researches on the automatic Chinese metaphor processing.
    Key words atrificial intelligence; machine translation; automatic metaphor processing; natural language processing; machine learning; knowledge acquisition
  • Review
    ZHONG Zhaoman, LIU Zongtian, ZHOU Wen, FU Jianfeng
    2009, 23(6): 56-61.
    Abstract ( ) PDF ( ) Knowledge map Save
    The representations of events relations and event reasoning are the keys for events-based knowledge processing. The paper presents the event influence factor to depict the strong and weak of interaction between events, and introduces a method of computing the event influence factor. Then ERM (Event Relation Map) is then constructed to describe domain event relations. The algorithm of event reasoning can be carried out from event relations and event elements, and we focus on event reasoning based on event relation. Finally, an experiment of event reasoning has been carried, and the results prove that the proposed models and algorithms are in line with the human judgments.
    Key words artificial intelligence; natural language processing; event representation; event relation; event influence factor; event reasoning
  • Review
    ZHANG Yanping, SHI Ke, XU Qingpeng, XIE Fei
    2009, 23(6): 61-67.
    Abstract ( ) PDF ( ) Knowledge map Save
    The aim of spam filtering is to distinguish the spam and the ham. The traditional methods used vector space model and feature selection approaches to extract features representing the contents of emails. However, these methods do not take the semantic information among words into account. In this paper, a new method is proposed to extract email features by combining the vector space model and the term co-occurrence. The covering algorithm is then employed to classify emails. Experiments show that the proposed method significantly improves the filtering performances compared with traditional ones. The features selected by utilizing term co-occurrence model are more representative than those chosen by the vector space model.
    Key words computer application; Chinese information processing; vector space model; spam filter; term co-occurrence model; covering algorithm
  • Review
    ZHANG Hongtao, LONG Chong, ZHU Xiaoyan, SUN Jun
    2009, 23(6): 67-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    In Chinese OCR post-processing, the high-order Chinese n-gram language models, such as word based tri-gram and four-gram is still a challenging issue because of the data sparseness issue and large memory cost led by big model size. In this paper, we focus on the post-processing of printed Chinese character recognition and propose a byte-based language model. By choosing byte as the representing unit of language model, we achieve a remarkable reduction of model size which overcomes the sparseness problem to a great extent. The experimental results show that the new language model based on byte works very well with higher performance and lowest time and space costs. For the test set with segmentation errors, the recognition accuracy increases from 88.67% to 98.32%, which means 85.18% error reduction. Compared with the system using traditional word based tri-gram, the new system saves 95% time cost and nearly 98% memory cost at almost no cost in the accuracy performance.
    Key wordscomputer application; Chinese information processing; Chinese character recognition; OCR; language model; post-processing
  • Review
    LI Sujian, SONG Tao, GAO Jie, YAO Pengyue, LI Wenjie
    2009, 23(6): 72-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    The representation of domain knowledge usually focuses on the domain lexicons, and then domain analysis for terms or term components is a natural task. In this paper, we propose a novel domain analysis method based on the discrepancy of lexical usage. Based on the word segmentation result, we introduce a link analysis method to compute the usage degree of each word for several typical domain corpora. Then through analyzing the discrepancy of the word usage in different domains, we can acquire the domain term component with larger usage discrepancy. This method is experimented on several domains such as military, entertainment and so on, achieving better results than the commonly used tf×idf method and Bootstapping method.
    Key wordsartificial intelligence; natural language processing; domain analysis; domain term; domain term component; link analysis; usage discrepancy
  • Review
    ZHANG Shunchang, SUN Le
    2009, 23(6): 79-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    Pinyin-to-Character conversion is an important task in Chinese Information Processing with widely applications in such tasks as Chinese Speech Recognition, Chinese Pinyin input method et al. This paper investigates the Pinyin-to-Character conversion and the segmentation of pinyin stream and proposes a method using Language Model to improve pinyin stream segmentation model. This method achieves about 3% enhancement in precision of the first character compared to the traditional hierarchical model.
    Key wordsartifical intelligence; natural language processing; pinyin-to-character conversion; hidden markov model; Chinese information processing; segmentation ambiguity
  • Review
    WANG Lei, LIU Jia
    2009, 23(6): 86-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    Fundamental frequency (or pitch), usually named as F0, is the vibration frequency of vocal cord during the production of voiced sounds. In a syllable or continuous voice paragraph, F0 changes with time and yields the fundamental frequency (or pitch) curves. It is particularly important to descript and investigate the F0 curve because it usually reflects the rhythm information, such as tone and stress. This paper first proposes a new method to describe F0 curve——the derivative domain codes, and then it discusses the role of the coding method on the rhythm in the evaluation of speech pronunciation. Experimental results show that the method can be used to evaluate the English prosody. The correlation coefficient between the subjective and objective scores of pitch extreme difference improves from 0.38 to 0.49.
    Keywordartifical intelligence;pattern recognition;pitch;derivative;codes;application
  • Review
    MENG Sha LIU Jia
    2009, 23(6): 91-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    While the Out-of-Vocabulary (OOV) problem remains a challenge for English spoken term detection tasks, it is underestimated for Chinese. This is because a Chinese OOV query term can still be matched as a sequence of Chinese characters, with each character itself being a word in the vocabulary. However, our experiments show that search accuracy levels differ significantly when a query is or is not in the vocabulary. We examine this problem with a word-lattice-based spoken term detection task. We propose a two-stage method by first locating candidates by partial phonetic matching and then refining the matching score with word lattice rescoring. Experiments show that the proposed method achieves a 24.1% relative improvement for OOV queries on a large-scale Chinese spoken term detection task.
    Key wordscomputer application; Chinese information processing; Chinese spoken term detection; out-of-vocabulary; lattice; large-vocabulary continuous speech recognition
  • Review
    LUO Kai, LI Miao , Wudabala,YANG Pan, ZHU Hai,
    2009, 23(6): 98-105.
    Abstract ( ) PDF ( ) Knowledge map Save
    We present a dependency informed Chinese-to-Mongolian translation model with morphological information to reduce the error of word form. We get the information by adding per-word dependency information to the source language and morphological information to the target. Then, we construct LOP-Factored translation model. Experimental results demonstrate significant improvements of translation quality achieved in terms of BLEU compared to the baseline phrase-based system. Balancing the grammar structure between source and target, this method is particularly suitable for translating morphological poor into rich languages.
    Key wordsartificial intelligence; machine translation; dependency grammar; morphological information; Chinese-to-Mongolian translation model; LOP-Factored model; statistical machine translation
  • Review
    LIU Zhiwen, HOU Hongxu, LI Saragul, LIU Lin
    2009, 23(6): 105-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    Long-distance Mongolian language model adopts the statistical method to establish the natural language models. This paper introduces three approaches to long-distance Mongolian language model based on trigger pairs and compares their performances with Chinese-Mongolian machine translation system. The analysis based on the experiment is also conducted in the paper. This comparative study on these models can be exploited for the use of long-distance Mongolian language model based on trigger pairs.
    Key wordsartifical intelligence; natural language processing; trigger pair; Mongolian; language mode
  • Review
    S·Loglo
    2009, 23(6): 110-116.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper first analyzes the error types existing in Mongolian text, the reasons for mistakes and the commonly used methods for spell-check and error-correcting in Mongolian text proofreading. Then according to the characteristics of Mongolian code and writing rules, an automatic proofreading algorithm based on Nondeterministic Finite Automata has been introduced. This algorithm has greatly improved the spell-check and error-correcting speed by using Nondeterministic Finite Automata in it's knowledge dictionary.
    Key wordsartificial intelligence; natural language processing; Mongolian; proofread; automata; morphological analysis
  • Review
    Zaokere·Kadeer, Aishan·Wumaier, Tuergen·Yibulayin, Askar·Hamudula
    2009, 23(6): 116-122.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we presents the generation of Uyghur Noun Inflectional Suffix DFA for the purpose of morphological analysis. Because Uyghur language is an agglutinative language, morphological analysis is an essential task for Uyghur language processing. In Uyghur, the inflectional suffixes are affixed to the stem according to certain ordering rules, which paves the way for modellingthe morphological structure of Uyghur inflectional suffixes by Finite State Machines (FSMs). In this paper, we introduce the word formation structure of Uyghur noun, then construct a right to left ordered by using the inflectional suffix concatination rule, and finally invert it in reverse order and convert it into a DFA.
    Key wordsartificial intelligence; natural language processing; Uyghur; agglutinative; inflectional suffix; DFA; cowel harmony; stemming