2010 Volume 24 Issue 5 Published: 18 October 2010
  

  • Select all
    |
    Review
  • Review
    WU Xiaofeng, ZONG Chengqing
    2010, 24(5): 3-10.
    Abstract ( ) PDF ( ) Knowledge map Save
    Paraphrase Recognition can be regarded as a sub-problem of Text Entailment Recognition. This problem is difficult in that simply using term frequency or syntax information is prone to error judgment because even the same pack of words can cook up sentences with totally different meanings and similar parsing trees can either have different meanings. In this paper we present a new approach based on Semantic Role Labeling (SRL) to identify paraphrase. In our approach, we first label sentences with semantic role, and then get features partly representing the meaning of the sentence. By doing so, we also take the specialty of News sentences under consideration. Our experiment proves the effectiveness of our approach.
    Key wordsnatural language processing; semantic role labeling; paraphrase recognition
  • Review
    ZAN Hongying1, ZHANG Junhui1, ZHU Xuefeng2, YU Shiwen2
    2010, 24(5): 10-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    Focused on the knowledge base of Contemporary Chinese function words, the authors have preliminarily finished the triune knowledge base(usage dictionary, usage rule and usage corpus) of Contemporary Chinese function words, including adverbs, prepositions, conjunctions, auxiliary and modal words. This paper examines the usages of the adverb JIU in the corpus of People’s Daily(Jan 1998) with segmentation and part-of-speech taggers. It provbides the adverb JIU’s usages’ formal description and their recognition through rule describing, automatic tagging, manual analysis, machine learning modeling, as well as detailed experimental results.
    Key wordsChinese function word; usage rule; Conditional Random Fields; Maximum Entropy; Support Vector Machine; automatic indentification
  • Review
    YU Huanhuan, QIAN Longhua, ZHOU Guodong, ZHU Qiaoming
    2010, 24(5): 17-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a convolution tree kernel-based approach to Chinese semantic relation extraction. It constructs a unified syntactic and entity semantic tree by incorporating entity semantic information, such as entity type, entity subtype and mention type etc., into the structural information of a relation instance. The motivation behind this approach is to effectively capture both the structural and entity semantic information in a unified way in order to boost the predictive performance of relation extraction. Evaluation on the ACE RDC 2005 Chinese benchmark corpus shows that our method significantly improves the performance of Chinese semantic relation extraction, specifically achieving the highest F-measure of 67.0 on the top-level relation extraction, and exhibits the complementation of the structure of syntactic information and semantic information in Chinese Semantic Relation Extraction.
    Key wordsChinese semantic relation extraction; convolution tree kernel; entity semantic information
  • Review
    CHEN Jiuchang, KONG Fang, ZHU Qiaoming, ZHOU Guodong
    2010, 24(5): 24-31.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents an automatic approach using Composite Kernel of SVM to determining whether “it” in text refers to a preceding noun phrase or is instead non-referential in the platform of feature-based English pronoun coreference resolution. We extract structure information and plane feature information about "it" in order to construct an anaphoricity filter. We examine the performance of the filter by introducing it into the pronoun coreference resolution task. Evaluation on the ACE2003 benchmark corpus shows that the filter achieves the highest performance by using Composite Kernel and the pronoun coreference resolution is improved by employing the filter.
    Key wordsanaphoricity determination; composite kernel; coreference resolution
  • Review
    ZHAO Wei,HOU Hongxu,CONG Wei,SONG Meina
    2010, 24(5): 31-36.
    Abstract ( ) PDF ( ) Knowledge map Save
    Etyma and morphological affix are the components of Mongolian words, which include lots of grammar information. Using this grammar information is helpful for effective processing Mongolian language. Mongolian words exhibit as a whole in the structure, and therefore, the detection of etyma and each morphological affix is necessary to capture this grammar information. By analyzing the characteristics of morphological construction of Mongolian words, this paper proposes an effective Mongolian word labeling method, and constructs a practical Mongolian word segmentation system based on conditional random fields model. Experiments show that the accuracy of segmentation has a significant improvement than current system, reaching an accuracy rate of 0.992.
    Key wordsMongolian; word segmentation; etyma; morphological affix; conditional random fields; statistical language model
  • Review
    Duojiezhuoma
    2010, 24(5): 36-41.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper introduces the ideology of the frame knowledge to describe the Tibetan letter. After analyzing the constitute process of the frame knowledge with examples, it researches on certain issues of frame knowledge representation for Tibetan letters according to the composition of the framework of knowledge, structural description, internal organization and internal representation, etc. This provides a description of Tibetan letters in frame structure and preliminary construct a Tibetan letter frame system, which it foundamental to further research on the frame knowledge of Tibetan vocabularies, phrase and, eventually the Tibetan language as a whole.
    Key wordsTibetan; frame; frame knowledge; knowledge representation
  • Review
    Zhaxijia, Dunzhuciren
    2010, 24(5): 41-46.
    Abstract ( ) PDF ( ) Knowledge map Save
    The article analyzes the grammatical information, the semantic information and the functional structure of Tibetan auxiliary words. It supplies detailed parameters for establishing the grammatical property database of Tibetan auxiliary words. It has a great significance in analyzing, describing, and observing ambiguity of the sentences.
    Key wordsTibetan auxiliary; grammatical information
  • Review
    CAIzhijie,CAIrangzhuoma
    2010, 24(5): 46-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    Corpus processing is a complicated project of language engineering, in which the segmentation and tagging are the fundamental work. The part–of-speech tagging dictionary is an exxential composition of the taggging process, relating directly to the speed and efficiency of tagging. Based on design of project “Ban Zhi Da Titetan Auto-tagging System” supported by the National Language Committee, this paper presents the construction of tagging dictionary and its index and search algorithm. The experiment on the 850 KB corpus of Tibetan shows that word segmentation accuracy rate can reach 99% and tagging accuracy rate can reach 97% .
    Key wordscorpus of Tibetan; segmentation;tagging; dictionory; index
  • Review
    XIAO Lei,CHEN Xiaohe
    2010, 24(5): 50-56.
    Abstract ( ) PDF ( ) Knowledge map Save
    An automatic approach to find the version differences among ancient Chinese text is proposed . First, we try to find sentence beads with the highest similarity by comparing the bigrams in the pair of sentences. Second, we iteratively remove the longest common substrings in the pair of different sentences and output differences remained. We take three versions of CHUNQIU as a running case. The results of the experiment indicate that our approach succeeds in finding all sentence beads and identifying all text differences definedas the version differences in this paper.
    Key wordsancient Chinese text; version difference; sentence bead; similarity
  • Review
    LI Shoushan, HUANG ChuRen
    2010, 24(5): 56-62.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentiment-based text categorization (for short, sentiment classification) is a task of classifying text according to the subjective information in the text. Nowadays, it has been closely studied in the research field of natural language processing (NLP) due to its wide real applications. As a result, many supervised machine learning classification approaches have been applied to this task. In this paper, we research on four classification approaches and propose a new combination method based on stacking to combine these four approaches. Experimental results show that our combination method achieves better performances than the best single one. Therefore, this combination method can avoid selecting a suitable classification approach according to different domains.
    Key wordscomputer application; natural language processing; sentiment classification; multiple classifier combination
  • Review
    WU Shiyong, WANG Mingwen
    2010, 24(5): 62-70.
    Abstract ( ) PDF ( ) Knowledge map Save
    Traditional search engine evaluation methods need manual annotation of correct answers for a set of queries, which is costly and time comsuming. In this paper, we present an automatic search engine performance evaluation method based on clustering analysis. This method includes three stepsfirst, computing the coverage score of the query for information; second, clustering the search results by the coverage score; last, evaluating the retrieval performance using intra-cluster cohesion and inter-cluster separation. Experimental results show that the automatic method gets a similar evaluation result with traditional assessor-based ones.
    Key wordsinformation retrieval; performance evaluation; clustering analysis
  • Review
    ZHANG Hui, ZHOU Jingmin, WANG Liang, ZHAO Liping
    2010, 24(5): 70-77.
    Abstract ( ) PDF ( ) Knowledge map Save
    Topic Tracking (TT), which grows out of the Topic Detection and Tracking (TDT) tasks, is a technology of information intelligent acquisition for dynamic developments of events. Its aim is to automatically track the subsequent news stories of known events from the information stream of news media. By analyzing the lacks of traditional document vector space model and the characteristics of news reports, this paper presents a new document vector model of 3 dimensions, which stresses the theme and entities of news stories. Then we proposed a topic model consistent with the feature of news reports, which can adjust itself to the developments of events in the process of topic tracking by means of self-learning. Combining with the topic model, we also designed a complete adaptive KNN topic tricking model for Chinese topic tracking. The experimental results show that the proposed approach can accurately describe the news topic and effectively avoid theme drift and eventually achieve good performance in Chinese topic tracking.
    Key wordstopic tracking; topic model; 3-dimensional document vector model; adaptive KNN
  • Review
    LIU Maofu1, LI Wenjie2, JI Donghong3
    2010, 24(5): 77-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    Event-based extractive summarization attempts to extract sentences and re-organize them in a summary according to the important events that the sentences describe. In this paper, we define the event as event terms and their associated entities and emphasize on the event term semantic relations derived from external linguistic resource. Firstly, the graph based on the event term semantic relations is constructed and the event terms in the graph are grouped into clusters using the revised DBSCAN clustering algorithm. Then, we select one event term as the representative term for each cluster or one cluster to present the main topic of the documents. Lastly, we generate the summary by extracting the sentences which contain more informative representative terms from the documents. The evaluation on the DUC 2001 document sets shows it is necessary to take the semantic relations among the event terms into consideration and our summarization approach based on event term semantic relation graph clustering is effective.
    Key wordsevent-based summarization; event semantic relation graph; DBSCAN clustering algorithm
  • Review
    LIN Zheng, LV Yajuan, LIU Qun, MA Xirong
    2010, 24(5): 85-92.
    Abstract ( ) PDF ( ) Knowledge map Save
    Bilingual parallel corpora can be used in many applications of NLP, but it’s not easy to acquire the large-scale corpora automatically. This paper proposes an effective solution to mine high-quality bilingual parallel corpora from web pages and analyses the key technology of obtaining candidate mix-languages web pages and sentence alignment. We have extracted 1.67 million parallel sentences, which average accuracy is 93.75%, and the accuracy of the first 1 million sentences is 96%.This paper also proposes the sentences re-ranking method and domain information retrieval method to apply the web data to the training of SMT model. Experiments conducted on the IWSLT tasks show 2 to 5 BLEU gains over baseline.
    Key wordsWeb mining; parallel corpora; sentence alignment; statistical machine translation
  • Review
    Wang.siriguleng1, Siqintu2, Nasun-urtu3
    2010, 24(5): 92-96.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the study of phrase-based statistical machine translation for Chinese-Mongolian, we noticed that there are some errors in the Chinese-Mongolian quantifier translation. This paper compares Chinese-Mongolian quantifier translation methods, concluding one-to-one,many-to-one, one-to-zero and one-to-many relations of translation between Chinese and Mongolian quantifier words. It is proved by experiment that the performance of the Chinese-Mongolian machine translation system can be improved by this method.
    Key wordsChinese-Mongolian machine translation system; Chinese quantifier; Mongolian quantifier
  • Review
    CHANG Weiling1, FANG Binxing1,2, YUN Xiaochun2, WANG Shupeng2, YU Xiangzhan1
    2010, 24(5): 96-106.
    Abstract ( ) PDF ( ) Knowledge map Save
    After surveying the proposal for compressing Chinese text, we present in this paper a universal compression algorithm for Chinese text, CRecode, which demonstrates an accurate understanding of the properties of the ANSI coded Chinese text. CRecode highlights the importance of pre-processing work for Chineseit collect the Chinese Characters and sorts them by frequency order, then recode them into 8-bit, 16-bit or 24-bit code. CRecode can act as a pre-processing tool for ANSI coded Chinese text by all the popular compression utilities, which can improve their compression ratio from 4% to 30%.
    Key wordsCRecode; data compression; Huffman; compression algorithm
  • Review
    XIAO Yunpeng, YE Weiping
    2010, 24(5): 106-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    The performance of current automatic speech recognition (ASR) systems often deteriorates radically when the input speech is corrupted by various kinds of noise sources. Such performance degradation is mainly caused by mismatch between the training and recognition environments. Quite a few techniques have been proposed to reduce this mismatch over the past several years. Some of the techniques, like feature-based normalization, are generally simple yet powerful to provide robustness against several forms of signal degradation. So normalization strategies are often chosen as the preferred method for speech robustness. They are employed by normalizing the statistical properties (moment), cumulative density function or power spectral density (PSD) of feature vector to compensate for the effects of environmental mismatch. In this paper, most commonly used feature normalization methods are reviewed, such as cepstral moment normalization, histogram equalization technique (HEQ) and Modulation Spectrum Normalization etc.
    Key wordsrobust speech recognition; cepstral mean normalization; high order cepstral moment normalization; histogram equalization; cepstral shape normalization
  • Review
    Dilmurar Tursun, Askar Hamdulla
    2010, 24(5): 117-124.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents the first systematic empirical work on vowel devoicing in Uyghur based on the text analysis module, with the purpose of enhancinh the naturalness of Uyghur speech synthesis and speech recognition system. The experimental analysis is based on an acoustical database which contains phonetic measurements of 460 words with the vowel /i/, 280 words with the vowel /u/ and 150 words with the vowel /ü/. The major part of the investigation is concentrated on an acoustical analysis of the duration, pitch and intensity features of the vowels /i/, /u/ and /ü/ in Uyghur when they become voiceless as well as when they keep their original voiced features. We deem that the study has a high research value for both the study of Uyghur language itself and the study of entire Altaic language family.
    Key wordsnatural language processing; Uyghur language; speech; vowel; devoicing
  • Review
    TIAN Wei, WANG Jiangqing, ZHU Zongxiao, LIU Sai, CHENG Li
    2010, 24(5): 124-127.
    Abstract ( ) PDF ( ) Knowledge map Save
    At present, the research of women script has been mostly based on handwriting so that it developed slowly. So informationization of women script has stared us in the face. Based on the theory of keyboard arrangement and the rule of women script writing, this paper analyzed the rationality of women script's keyboard arrangement and designed a radical input method of women script. Finally, the input method was tested through stylebooks, and it showed the efficiency of input method was improved much better. This input method would be widely used in women script printing and information processing.
    Key wordswomen script; keyboard arrangement; radical input method; etymon