Journal of Chinese Information Processing

Select

Review

Building English-Chinese Statistical Translation Models from Semi-structured Parallel Texts

NIE Jian-yun,CHEN Jiang

2001, 15(1): 1-12.

Abstract ( ) PDF ( )

Knowledge map

Save

A statistical translation model tries to capture translation relationships from a set of parallel texts (or translation examples) . This paper describes our attempt to train such translation models from a set of semi-structured parallel texts in Chinese and English. These texts are gathered from the Web by an automatic mining tool-PTMiner. Our work takes advantage of the HTML structure of the texts. Some special processing is necessary on Chinese. Our experiments show that we can obtain a translation precision of about 80% with the trained model. This performance is reasonable for less critical tasks such as cross-language information retrieval. This work shows that it is possible to construct a means of query translation at a much lower cost than a machine translation system.

Select

Review

Increasing Accuracy of Chinese Segmentation with Strategy of Multi-step Processing

ZHAO Tie-jun,LV Ya-juan,YU Hao,YANG Mu-yun,LIU Fang

2001, 15(1): 13-18.

Abstract ( ) PDF ( )

Knowledge map

Save

The automatic word segmentation of Chinese sentences is difficult when the processing mechanism faces large-scale real texts. The crucial two issues in Chinese segmentation are the identification of unknown words and the disambiguation of segmentation strings. This paper describes a strategy based on multi-steps processing for decreasing the difficulties and improving the accuracy of the segmentation. The processing steps include seven parts , i. e. , disambiguation of pseudo-ambiguities ,full segmentation of a sentence , determinate segmentation for some words , processing of numeral string ,processing for reduplication of words ,statistical identification for unknown words and final correction for segmentation ambiguities with part-of-speech which is integrated in the tagger. The output of this procedure is promising with above 98% accuracy in open test .

Select

Review

Using Genetic Algorithms for Optimizing Part-of-Speech Tagset

SUN Hong-lin,LU Qin,YU Shi-wen

2001, 15(1): 19-27.

Abstract ( ) PDF ( )

Knowledge map

Save

POS tagset selection in the past was mainly done by experts using human knowledge manually ,since there is no automatic or semi-automatic way to assist the selection process. This paper proposes a novel method to search for an optimal POS tagset using genetic algorithms (GA) . The experiment shows that GA provides an efficient optimization of POS tagset and allows for the adjustment of parameters according to user requirement . It provides a systematic way to help people in making an intelligent choice on the selection of a tagset .

Select

Review

Leveled Unknown Chinese Words Resolution by Dynamic Programming

LV Ya-juan,ZHAO Tie-jun,YANG Mu-yun,YU Hao,LI Sheng

2001, 15(1): 28-33.

Abstract ( ) PDF ( )

Knowledge map

Save

Unknown word resolution is a dilemma for automatic Chinese segmentation. Aiming at solving Chinese human names ,Chinese place names and translated names of other language ,this paper puts forward a leveled unknown word resolution strategy with dynamic programming searching the best path. This method successfully solves the contradictions among these unknown words identification. Experiment on real corpus shows that the proposed method possesses a high performance.

Select

Review

Chinese Documents Categorization Based on N-gram Information

ZHOU Shui-geng,GUAN Ji-hong,YU Hong-qi,HU Yun-fa

2001, 15(1): 34-39.

Abstract ( ) PDF ( )

Knowledge map

Save

Traditional document classifiers are based on keywords in the documents ,which need dictionaries support and efficient segmentation procedures. This paper explores the problem of utilizing N-gram information to categorize Chinese documents so that the classifiers can shake off the burden of large dictionaries and complex segmentation procedures ,and subsequently be domain and time independent . Such a Chinese documents categorization system is implemented with kNN classification method. Experimental results show that it can achieve comparable performance to other classifiers of the same type.

Select

Review

Mandarin Name Recognition Based on Variable Frame Rate HMM Training

LIU Gang,ZHANG Hong-gang,GUO Jun

2001, 15(1): 40-45.

Abstract ( ) PDF ( )

Knowledge map

Save

In speech recognition HMM requires a large number of data for training , however ,in some applications it is impractical. Therefore ,a VFR training method based on pattern transform method with consine function is presented ,in this paper ,to solve this problem. We apply this original method to a voice control dialling system. System accuracy increases 4.2% on the condition of training just only one time. It isdemonstrated that this method has obvious effect on the scarcity of training data.

Select

Review

An Adaptive Post-processing Method using Proofreading Information for Chinese Character Recognition

LI Yuan-xiang,LIU Chang-song,DING Xiao-qing

2001, 15(1): 46-52.

Abstract ( ) PDF ( )

Knowledge map

Save

Post-processing is a key component of Chinese character recognition system. Conventional post-processing methods ,which to a large extent rely on statistical language model ,can’t track dependencies within an article. They also can’t take the dynamic idiosyncrasy of recognizer into account . This paper presents a novel adaptive post-processing method that utilizes the partly corrected texts. These texts can be used to construct adaptive language model and to obtain the idiosyncrasy of recognizer which can help dynamically adjust candidates set . The method makes the post-processing of successive documents recognition be of adaptability. Experiments on about 400000 Chinese characters show that the proposed method has 35.24% error reduction rate in average ,compared with the conventional post-processing method. This method can efficiently reduce the workload in the case of large-scale data input and has higher practicability.

Select

Review

Using ISAPI Filter To Implement the Web Pages Auto Conversion Between Simplified and Traditional Chinese Characters

ZHANG Zhen,ZHANG Zeng-ke

2001, 15(1): 53-58.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper put forward a new conversion method between simplified and traditional Chinese characters.By using the ISAPI filter ,homepages in Chinese characters on the web site could be translated automatically to support browsers on different Chinese systems without any software or language packages installed on the clients. This idea is completed in the Internet Information Server 4.0 under the Window NT.

Select

New Progress of the Grammatical Knowledge-base of Contemporary Chinese

YU Shi-wen,ZHU Xue-feng,WANG Hui

2001, 15(1): 59-65.

Abstract ( ) PDF ( )

Knowledge map

Save

The Grammatical Knowledge-base of Contemporary Chinese serves as a basic linguistic knowledge-base for Chinese Information Processing. It passed the technical appraisement in Nov. 1995. Through the continuous development in the past five years , it is extended to 73,000 entries from 50,000 and the classification of these seventy thousands words is accomplished. In addition , a new morpheme database has been developed for the undefined word recognition. Up to now , the distinct grammatical descriptions in every class have been carefully checked and corrected while more than 20 new attributes as well as a great quantity of examples are added. So the scale and quality of the whole knowledge-base are improved remarkably.

Please choose a citation manager

Content to export

2001 Volume 15 Issue 1 Published: 15 February 2001