Journal of Chinese Information Processing

Select

Review

Language Independent Text Categorization

HUANG Xuan-jing,WU Li-de,Ishizaki Hiroyuki,XU Guo-wei

2000, 14(6): 1-7.

Abstract ( ) PDF ( )

Knowledge map

Save

Text categorization is defined as the task of assigning pre-defined category labels to new documents. This paper proposes a language-independent text categorization model based on machine learning ,and describes the feature extraction ,classifier and evaluation method in detail. This model has been implemented on two news corpus of Chinese and Japanese and satisfactory categorization effectiveness has been achieved.

Select

Review

An Improved Approach to Weighting Terms in Text

LU Song,LI Xiao-li,BAI Shuo,WANG Shi

2000, 14(6): 8-13,20.

Abstract ( ) PDF ( )

Knowledge map

Save

Text Representation has been the fundamental problem in Information Retrieval ,such as text retrieval ,automatic summary and search engine. tf.idf (term frequency ,inverse document frequency) as one of term-weighting schemes in Vector Space Model is a good text representation which is popular and make good results in the field of Information Retrieval. The proportion of distribution of terms in text collection is one of the most important factors of expressing the content of text , but it is beyond tf.idf’s power.Because of this ,this paper provides an improved approach named tf.idf.IG to remedy this defect by Information Gain from Information Theory. The Information Gain of terms as one factor for term-weighting schemes can effectively weight the proportion of distribution of terms. In text classification ,tf.idf.IG in this paper overcomes old tf.idf.

Select

Review

The Design and Implementation of WEB-Based Chinese Text Retrieval System Search2000

DU Lin,ZHANG Yi-bo,SUN Yu-fang

2000, 14(6): 14-20.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper introduces the design and implementation of Web-based Chinese text retrieval system Search2000 in detail. Compared with traditional full text retrieval systems , the Web-based text retrieval systems have lots of new properties. The Web pages are semi-structured documents and are connected through hyperlinks. The different Web sites and different Web pages may cover different application domains ,so there are lots of new words and phrases , such as the proper names and domain terminology ,which affect the further improvement of the query precision.Based on the above analysis ,a new search scheme based on the intelligent relevant analyzing and scoring ,efficient accessing of the index and knowledge databases has been designed for the Search2000 system so as to improve the query precision and reduce the response time.

Select

Review

Build a relation network representation for How-net

ZHOU Qiang,FENG Song-yan

2000, 14(6): 21-27.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper ,we introduce a relation network representation for how-net and it s implementation method. Through the construction of three tables (concept table ,feature table and relation table) and the bi-directions ,multi-angles connections among them ,all the information in how-net can be integrated into a relation network. It provides good foundation for the research of information retrieval and knowledge reasoning based on the knowledge in how-net .

Select

Review

Statistics-Based Chinese Chunk Parsin

LIU Fang,ZHAO Tie-jun,YU Hao,YANG Mu-yun,FANG Gao-lin

2000, 14(6): 28-32,39.

Abstract ( ) PDF ( )

Knowledge map

Save

Chunk parsing is an effective method to decrease the difficulty of language parsing. This paper proposes a formal description representing the characteristics of Chinese chunks. Based on the description , a statistical algorithm is accomplished to recognize definite levels of Chinese chunks. The experiments have proved that the algorithm gives a high accuracy for shallow parsing of real Chinese texts with robustness.

Select

Review

Automatic Extraction of Bilingual Term Lexicon from Parallel Corpora

SUN Le,JIN You-bing,DU Lin,SUN Yu-fang

2000, 14(6): 33-39.

Abstract ( ) PDF ( )

Knowledge map

Save

An algorithm for the automatic extraction of a bilingual term lexicon from English-Chinese parallel corpora is proposed in this paper. Parallel corpora are firstly aligned by improved statistical method ,which is based on character length ,and tagged with their part-of-speech categories respectively. The term candidate set is produced by statistical the nouns and noun phrases of both corpora. Then the translation probability between every English candidate term and its Chinese translation term are calculated. Finally , the Chinese translation of English term is selected by threshold value ,which varies with word frequency. A better performance is obtained in the experiments of term extraction on real corpora.

Select

Review

Translation Pairs Extraction from Unaligned Chinese-English Bilingual Corpora

WANG Bin

2000, 14(6): 40-44,57.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper focuses on extracting translation pairs from unaligned Chinese-English bilingual corpora. First ,it introduces two methods proposed by Dr. Pascale Fung. Then ,we revises the latter one to satisfy the need of real texts. The experiment results show the effectiveness of our method and it can be applied widely in many NLP applications such as phrase extraction ,bilingual lexicography ,etc.

Select

Review

Clustering Of Chinese Adjectives-Nouns Based on Compositional Pairs

WEN Yang,YUAN Chun-fa,HUANG Chang-ning

2000, 14(6): 45-50.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a bidirctional hierarchical clustering algorithm of simultaneous clustering words of different categories. During clustering ,the process is interactional and alternant . We construct an objective function based on Minimum Description Length (MDL) . In order to solve the problem caused by sparse data two concepts of modificatory degree and modificatory distance are proposed. The further application to clustering Chinese adjectives and nouns demonstrates the algorithm is effective.

Select

Review

Research on English Subordinate Clause Recognition : A Corpus-based Approach

ZHANG Jing,ZHAO Tie-jun,YAO Jian-min,LI Sheng

2000, 14(6): 51-57.

Abstract ( ) PDF ( )

Knowledge map

Save

In order to improve the performance of translating English complex sentence in a English-Chinese translation system ,we present a new approach to subordinate clause recognition using a corpus-based method. With the information of part-of-speech tagging of an English sentence ,this approach integrates rule and statistical methods to recognize subordinate clauses. The precision and recall ratio of recognizing subordinate clauses are tested on both closed corpus and open corpora. A result of 92.9% precision and 80% recall is obtained from the closed test and from the open test ,the result is 80.34% precision and 83.93% recall.

Select

Review

The Guideline for Segmentation and Part-Of-Speech Tagging on Very Large Scale Corpus of Contemporary Chinese

YU Shi-wen,ZHU Xue-feng,DUAN Hui-ming

2000, 14(6): 58-64.

Abstract ( ) PDF ( )

Knowledge map

Save

The Institute of Computational Linguistics of Peking University is developing a very large-scale contemporary Chinese corpus segmented and with many tags based on the owned resources ,e. g. the Grammatical Knowledge-base of Contemporary Chinese. There are about 40 tags in the tag set . It contains common Part-Of-Speech tags ,special usage tags of verbs and adjectives , proper noun ,placename of phrase type ,organization name of phrase type and so on. The scale of the corpus is about 27 millions Chinese characters. The Institute of Computational Linguistics of PKU has completed the task of 14 millions characters and the processing quality is very high. It is necessary to work out a complete guideline of corpus processing to obtain high quality tagged corpus. This paper introduces the principles of making out the guideline and the experiences of carrying out the guideline.

Please choose a citation manager

Content to export

2000 Volume 14 Issue 6 Published: 15 December 2000