2000 Volume 14 Issue 6 Published: 15 December 2000
  

  • Select all
    |
    Review
  • Review
    HUANG Xuan-jing,WU Li-de,Ishizaki Hiroyuki,XU Guo-wei
    2000, 14(6): 1-7.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text categorization is defined as the task of assigning pre-defined category labels to new documents. This paper proposes a language-independent text categorization model based on machine learning ,and describes the feature extraction ,classifier and evaluation method in detail. This model has been implemented on two news corpus of Chinese and Japanese and satisfactory categorization effectiveness has been achieved.
  • Review
    LU Song,LI Xiao-li,BAI Shuo,WANG Shi
    2000, 14(6): 8-13,20.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text Representation has been the fundamental problem in Information Retrieval ,such as text retrieval ,automatic summary and search engine. tf.idf (term frequency ,inverse document frequency) as one of term-weighting schemes in Vector Space Model is a good text representation which is popular and make good results in the field of Information Retrieval. The proportion of distribution of terms in text collection is one of the most important factors of expressing the content of text , but it is beyond tf.idf’s power.Because of this ,this paper provides an improved approach named tf.idf.IG to remedy this defect by Information Gain from Information Theory. The Information Gain of terms as one factor for term-weighting schemes can effectively weight the proportion of distribution of terms. In text classification ,tf.idf.IG in this paper overcomes old tf.idf.
  • Review
    DU Lin,ZHANG Yi-bo,SUN Yu-fang
    2000, 14(6): 14-20.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper introduces the design and implementation of Web-based Chinese text retrieval system Search2000 in detail. Compared with traditional full text retrieval systems , the Web-based text retrieval systems have lots of new properties. The Web pages are semi-structured documents and are connected through hyperlinks. The different Web sites and different Web pages may cover different application domains ,so there are lots of new words and phrases , such as the proper names and domain terminology ,which affect the further improvement of the query precision.Based on the above analysis ,a new search scheme based on the intelligent relevant analyzing and scoring ,efficient accessing of the index and knowledge databases has been designed for the Search2000 system so as to improve the query precision and reduce the response time.
  • Review
    ZHOU Qiang,FENG Song-yan
    2000, 14(6): 21-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper ,we introduce a relation network representation for how-net and it s implementation method. Through the construction of three tables (concept table ,feature table and relation table) and the bi-directions ,multi-angles connections among them ,all the information in how-net can be integrated into a relation network. It provides good foundation for the research of information retrieval and knowledge reasoning based on the knowledge in how-net .
  • Review
    LIU Fang,ZHAO Tie-jun,YU Hao,YANG Mu-yun,FANG Gao-lin
    2000, 14(6): 28-32,39.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chunk parsing is an effective method to decrease the difficulty of language parsing. This paper proposes a formal description representing the characteristics of Chinese chunks. Based on the description , a statistical algorithm is accomplished to recognize definite levels of Chinese chunks. The experiments have proved that the algorithm gives a high accuracy for shallow parsing of real Chinese texts with robustness.
  • Review
    SUN Le,JIN You-bing,DU Lin,SUN Yu-fang
    2000, 14(6): 33-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    An algorithm for the automatic extraction of a bilingual term lexicon from English-Chinese parallel corpora is proposed in this paper. Parallel corpora are firstly aligned by improved statistical method ,which is based on character length ,and tagged with their part-of-speech categories respectively. The term candidate set is produced by statistical the nouns and noun phrases of both corpora. Then the translation probability between every English candidate term and its Chinese translation term are calculated. Finally , the Chinese translation of English term is selected by threshold value ,which varies with word frequency. A better performance is obtained in the experiments of term extraction on real corpora.
  • Review
    WANG Bin
    2000, 14(6): 40-44,57.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper focuses on extracting translation pairs from unaligned Chinese-English bilingual corpora. First ,it introduces two methods proposed by Dr. Pascale Fung. Then ,we revises the latter one to satisfy the need of real texts. The experiment results show the effectiveness of our method and it can be applied widely in many NLP applications such as phrase extraction ,bilingual lexicography ,etc.
  • Review
    WEN Yang,YUAN Chun-fa,HUANG Chang-ning
    2000, 14(6): 45-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a bidirctional hierarchical clustering algorithm of simultaneous clustering words of different categories. During clustering ,the process is interactional and alternant . We construct an objective function based on Minimum Description Length (MDL) . In order to solve the problem caused by sparse data two concepts of modificatory degree and modificatory distance are proposed. The further application to clustering Chinese adjectives and nouns demonstrates the algorithm is effective.
  • Review
    ZHANG Jing,ZHAO Tie-jun,YAO Jian-min,LI Sheng
    2000, 14(6): 51-57.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to improve the performance of translating English complex sentence in a English-Chinese translation system ,we present a new approach to subordinate clause recognition using a corpus-based method. With the information of part-of-speech tagging of an English sentence ,this approach integrates rule and statistical methods to recognize subordinate clauses. The precision and recall ratio of recognizing subordinate clauses are tested on both closed corpus and open corpora. A result of 92.9% precision and 80% recall is obtained from the closed test and from the open test ,the result is 80.34% precision and 83.93% recall.
  • Review
    YU Shi-wen,ZHU Xue-feng,DUAN Hui-ming
    2000, 14(6): 58-64.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Institute of Computational Linguistics of Peking University is developing a very large-scale contemporary Chinese corpus segmented and with many tags based on the owned resources ,e. g. the Grammatical Knowledge-base of Contemporary Chinese. There are about 40 tags in the tag set . It contains common Part-Of-Speech tags ,special usage tags of verbs and adjectives , proper noun ,placename of phrase type ,organization name of phrase type and so on. The scale of the corpus is about 27 millions Chinese characters. The Institute of Computational Linguistics of PKU has completed the task of 14 millions characters and the processing quality is very high. It is necessary to work out a complete guideline of corpus processing to obtain high quality tagged corpus. This paper introduces the principles of making out the guideline and the experiences of carrying out the guideline.