2007 Volume 21 Issue 6 Published: 17 December 2007
  

  • Select all
    |
    Review
  • Review
    YU Shi-wen
    2007, 21(6): 3-12.
    Abstract ( ) PDF ( ) Knowledge map Save
    After accumulation and hard work for over two decades, one of the research achievements made by the Institute of Computational Linguistics at Peking University (ICL/PKU), the Comprehensive Language Knowledge-base (CLKB), passed the Technical Appraisal organized by the Ministry of Education in February 2007. The conclusion is: The scale, depth, quality and application result of CLKB are unprecedented in China’s language engineering practice. This achievement is the most comprehensive and important research fruit in the building of multi-language knowledge-base with Chinese as the center and has generally reached world-class level. This paper briefly describes the scale, composition, quality and development of CLKB based on the Grammatical Knowledge-base of Contemporary Chinese (GKB), and then lays an emphasis on illustrating the rationale of the building of CLKB, with an expectation to share the knowledge and experience with readers obtained in the study and research on the cross-disciplines—Computational Linguistics and Natural Language Processing. Meanwhile, the author also explores the application practice of this achievement and assesses its application potential in the hope of paving the path, or sending out a trial balloon, for the development of multi-language information processing techniques with Chinese as the center.
  • Review
    FENG Su-qin, CHEN Hui-ming
    2007, 21(6): 13-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    Combinational ambiguity is a challenging issue in Chinese word segmentation in that its disambiguation depends on the contextual information. This paper collected contextual information statistics of combinational ambiguity words and establishes a context model using log likelihood ratio. A weight calculation formula is designed considering contextual information’s window size, location and the frequency. Based on this, two methods are investigated for disambiguation. One uses the maximum log likelihood ratio in contextual information; the other uses the maximum sum of log likelihood ratio between the situation of combination and separation in contextual information. Tested on 14 high-frequence ambiguous words, the average accuracy of the former method reaches 84.93%, and that of the latter reaches 95.60 %. The result of the experiment reveals that using the combination of contextual information is effective for disambiguation.
  • Review
    SHEN Jia-yi, LI Fang, XU Fei-yu, Hans Uszkoreit
    2007, 21(6): 17-21.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a method for recognizing Chinese organization names and their abbreviations based on rules. The right boundary of an organization name is identified with the help of the organization suffix lexicon. The left boundary is recognized by the optimum rules based on Bayesian probability model. After idendifying an organization name, we can get candidate abbreviations based on abbreviation rules accordingly. In open test, the recall is 85.19%, the precision is 83.03%, the F Measure is 84.10% for name recognition, and the recall is 67.18%, the precision is 74.14% for abbreviation recognition. This method has been applied in the Chinese relation identification system.
  • Review
    FENG Yuan-yong, SUN Le, DONG Jing, LI Wen-bo,
    2007, 21(6): 22-28.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a typical phenomenon in language, coreference entails vital attention to be resolved in nature language processing. We describe a novel algorithm, which integrates global-evaluated confidence in classification in order to make sure that those pairs which high confidence take high priority in the clustering procedure. The experiments, under supervised learning framework both isolated and joint, show significant gains of the coreference resolution sy-stem.
  • Review
    HAO Xiao-yan, LI Ji-hong, YOU Li-ping, LIU Kai-ying
    2007, 21(6): 29-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    A Question Answering System for Reading Comprehension (QARC) can automatically analyze a passage of natural language text and generate an answer for each question based on information in the passage. The reading comprehension task can be a valuable tool to evaluate the performance of a natural language understanding system. Unfortunately, insufficiency of Chinese Reading Comprehension Corpus(CRCC) is the main problem to the research and development of Chinese QARC. The paper describes in detail the process of building a Chinese Reading Comprehension Corpus (CRCC), including materials selecting, questions compiling, answers labeling, corpus processing and evaluation methods. In particular, we annotated texts on such three layers as frame element, phrase type and syntactic function, based on the knowledge base of Chinese FrameNet (CFN).
  • Review
    Wumaierjiang.Kuerban, Alifu.Kuerban
    2007, 21(6): 36-42.
    Abstract ( ) PDF ( ) Knowledge map Save
    This article elaborates the construction of Uyghur FrameNet based on valence description and actual corpus which has a wide range of applications and growing prospects in many fields such as the constructing of Uyghur glossary and machine-readable Uyghur Semantic dictionary. The paper puts forward significance of establishing an online corpus which describes the relationship between the syntactic function and conceptual structure (or semantic structure) and can be used in natural language processing in Uyghur. It introduces the Uyghur FrameNet into the study of Uyghur language processing. Being an online corpus, the Uyghur FrameNet includes the valence concept—the syntactic and semantic information about the respective implication of each lexeme. It proposes a new research method for the language information processing research in the Uyghur FrameNet such as the explanation of syntactic and semantic characteristics of each frame element, and probes into the method of constructing a Uyghur FrameNet based on valences.
  • Review
    TIAN Xuan, DU Xiao-yong, LI Hai-hua
    2007, 21(6): 43-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    We propose a Term-Subject-Association-based Language Model (TSA-LM) for document retrieval. Its main idea is to divide a document into two parts: one is only composed of subject words (named as subject block), and the other contains no subject words (named as non-subject block). Query-likelihood of a document is measured by the combination of the query-likelihood of the two blocks. For non-subject block, we follow classical language model. For subject block, we use the language model smoothed by term-subject association. The term-subject association is weighted by term-subject co-occurrence and term-document-subject labeling relationship. The experimental results on public dataset show that TSA-LM improves search effectiveness.
  • Review
    LIU Zhi-yuan, SUN Mao-song
    2007, 21(6): 52-58.
    Abstract ( ) PDF ( ) Knowledge map Save
    Some perspectives of human languages can be characterized by complex network analysis. In this paper, word co-occurrence networks for the Chinese language are automatically constructed based on very large manually word-segmented Chinese corpora with different size and style at first. Then systematic observations on these networks are made from the complex network’s point of view. Experimental results show that these networks display two important features of complex networks: (1) The average distance between two words is 2.63-2.75, and the clustering coefficient is much greater than that given by a random network with the same parameters, which exhibits a typical small-world effect; and (2) The degree distributions of these networks generally obey the power-law, i.e., the scale-free property. In addition, quantitative analysis is conducted for the kernel lexicons derived from these experiments.
  • Review
    WANG Jing-fan, WU Xiao-jun, XIA Yun-qing, ZHENG Fang
    2007, 21(6): 59-64.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the modern Chinese information retrieval systems, classical keyword based string matching can not work when the input string is different from the entries in the database. This paper proposed a method based on Tarhio and Ukkonen’s filtering algorithm to solve the problem. Because the Chinese Pinyin typewriting usually consists Chinese characters with the same or similar pronunciations, we defined a special Edit Distance and expended our method accordingly. The experimental results showed that our algorithm can improve the recall rate of the retrieval systems and obtain practical sub-linear complexity.
  • Review
    GU Bo, LI Ji-hong, LIU Kai-ying
    2007, 21(6): 65-70.
    Abstract ( ) PDF ( ) Knowledge map Save
    Most traditional clustering algorithms treat each attribute equally. However, COSA [1] (clustering on subsets of attributes) algorithm believes that each separate attribute in different groups may have different weight, and that objects in different groups may cluster in different subsets of attributes. A new distance definition is presented in literature [1], which also presented two COSA algorithms. COSA1 is a partitioning algorithm and COSA2 is a hierarchical cluster algorithm. In this paper, COSA and COSA1 were used for Chinese documents in order to compare the COSA distance and the Euclidean distance. The results show that COSA algorithms achieve better performance and are more robust when the number of attributes changes.
  • Review
    HONG Yu,ZHANG Yu,LIU Ting,LI Sheng
    2007, 21(6): 71-87.
    Abstract ( ) PDF ( ) Knowledge map Save
    Topic detection and tracking, as one of natural language processing technologies, is to detect unknown topic and track known topic from the information of news medium. Since its pilot research in 1996, several large-scale evaluation conferences have provided a good environment for evaluating technologies of recognition, collection and organization. As topic detection and tracking shares similar challenges with information retrieval, data mining and information extraction in abrupt and successive data, it has become a hot research issue in the field of nature language processing. This paper introduced the background, definition, evaluation and methods in topic detection and tracking, and explored its future development trend through analyzing current research.
  • Review
    TANG Hui-feng, TAN Song-bo, CHENG Xue-qi
    2007, 21(6): 88-94.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentiment classification is an applied technology with great significance. It can solve information disorder and help people locate the required reviews in the Internet. Up to now, most research of sentiment classification is on English reviews, and little work has been done on Chinese reviews. To find an effective way for the task based on supervised machine learning method, and analyze the influence by term expression and term selection, this paper conducted some experiments under distinct environments, including different feature representation, different feature selection, different categorization technique, different size of features and different size of training data, over Chinese text collections. The experimental results show that sentiment classification will obtain high performance, when using bigrams representation, information gain and SVM classifier, enough training data and plenty of features.
  • Review
    XU Jun, DING Yu-xin, WANG Xiao-long
    2007, 21(6): 95-100.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we study how to apply machine learning techniques to solve sentiment classification problems. The main task of sentiment classification is to determine whether news or reviews is negative or positive. Naive Bayes and Maximum Entropy classification are used for the sentiment classification of Chinese news and reviews. The experimental results show that the methods we employed perform well. The accuracy of classification can achieve about 90%. Moreover, we find that selecting the words with polarity as features, negation tagging and representing test documents as feature presence vectors can improve the performance of sentiment classification. Conclusively, sentiment classification is a more challenging problem.
  • Review
    SUN Hong-gang, LU Yu-liang , LIU Jin-hong, GONG Bi-hong
    2007, 21(6): 101-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    The disproportion of dimensions of class vectors brings troubles to text categorization by VSM, so we introduce the idea of vector expansion and define the Concentration of Effective Original Information(CEOInfo) to resolve the problem. Based on HowNet, which is a semantic dictionary, we use different expanding strategies for vectors of high dimensions and low dimensions. This method reduces the margin of CEOInfo among different classes. The experiment shows that the precision of categorization is enhanced by VSM expansion based on HowNet under the condition of the disproportion of dimensions of class vectors.
  • Review
    QIU Zhi-hong, GONG Lei-guang
    2007, 21(6): 109-115.
    Abstract ( ) PDF ( ) Knowledge map Save
    In traditional text clustering with Vector Space Model, only frequency is used as the weight of a word, while the context is not taken into account. In this paper we describe a method to weight the supporting degree of the context to a word using relations between words, which are captured by an ontology dictionary. The supporting degree is weighted by both the frequencies of related words and the weights of the relations. A general methodology for automatically structuring the ontology dictionary is also given in this paper. Experiments show that our method tends to effectively reduce the noise, and performs better than the traditional method.
  • Review
    TANG Lin, , YIN Jun-xun
    2007, 21(6): 116-124.
    Abstract ( ) PDF ( ) Knowledge map Save
    Tone is an important factor in Puttonghua's speech and plays a crucial role in assessing Putonghua proficiency levels. The objective evaluation system of syllabic tone is an important part of the Putonghua Proficiency Test (PPT). Based on the analysis of tone features, the paper presents a novel method to eliminate influences caused by the speaking rate and the interaction between syllables. Both five pitch ratios which describe tonal features and normalized fundamental frequencies are selected as tone evaluation parameters. The Gaussian Mixture Model is also used to test the recorded speech from 60 different speakers. Experiment results show that the system can get 88.24% accuracy by comparing with subjective evaluation results.
  • Review
    Dilmurat.Tursun, Wayit.Ablez, Turgun.Ibrahim
    2007, 21(6): 125.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper designs and presents a Unicode-based Smart Input Method(IME) system, which is an important tool for Chagatai scripture digitalizing system. This module affects the efficiency of the transcription bwtween modern Uyghur and Chagatai scriptures. It is validated by experimental tests that the smart IME tools presented in this paper is accurate, stable and easily operable.