Journal of Chinese Information Processing

Select

A Comparative Study of Four Primary Statistical Models in Chinese Parsing

MENG Yao,LI Sheng,ZHAO Tie-jun,CAO Hai-long

2003, 17(3): 2-9.

Abstract ( ) PDF ( )

Knowledge map

Save

Choosing the statistical model is the key problem in statistical parsing. Statistical model lies in the core of NLP parsing. This paper investigates 4 primary statistical parsing models , namely PCFG, history-based model , cascading parsing model and head-driven parsing model , and compares their performances in a 10000 Chinese treebank. The analysis based on the experiment were shown in the paper. The comparative study of these models can be exploited to build the practical and effective Chinese parser.

Select

Chinese Word Extraction Based on the Internal Associative Strength of Character Strings

LUO Sheng-fen,SUN Mao-song

2003, 17(3): 10-15.

Abstract ( ) PDF ( )

Knowledge map

Save

Word extraction is one of the important tasks in text information processing. A conventional scheme for word extraction is to estimate the soundness of a candidate character string being a word by the internal associative strength among characters involved. In this paper , the authors at first test the performance of nine widely adopted statistical measures of such kind in Chinese word extraction on the individual basis , then try the possibility of improving the performance by properly combining these measures. Genetic algorithm is explored to automatically adjust the weighting of combination. Experiments focusing on two-character Chinese word extraction show that mutual information is most powerful in these measures , achieving the F-measure 54.77% , and the effectiveness of combination is not significant , only achieving the F-measure 55.47%. This suggests that these measures could not supplement well each other , and the simplest and effective way in Chinese word extraction would be using mutual information directly.

Select

The Design and Implementation of a Tibetan Word Segmentation System

CHEN Yu-zhong,LI Bao-li,YU Shi-wen

2003, 17(3): 16-21,66.

Abstract ( ) PDF ( )

Knowledge map

Save

Word segmentation for Tibetan has not been well studied yet . This paper reports a Tibetan word segmentation system that we designed and implemented. Several issues about the system are explained , which include system architecture , knowledge bases , segmentation strategy , and algorithms. In preliminary experiments , the system demonstrates higher accuracy and domain independency.

Select

Study on Topic-Based Web Clustering

SUN Xue-gang,CHEN Qun-xiu,MA Liang

2003, 17(3): 22-27.

Abstract ( ) PDF ( )

Knowledge map

Save

With the ceaseless resource inflation and rapid change of information on Web , it has become difficult to manage vast e-data through traditional manual method. Web clustering can automatically classify documents and help us to discover new information. Considering the complexity of Web documents , we offer a method of feature re-select and document re-cluster and perform a good Web clustering.

Select

Optimizing Feature Encoding for Self-Organizing Chinese Semantic Maps

ZHANG Min,MA Qing,MA Shao-ping

2003, 17(3): 28-34.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper , we introduce self-organizing Chinese semantic map , then study and propose six different approaches of feature encoding which is crucial to the performance of a SOM. The approaches are based on set theory , algebra , and probabilistic theory respectively. We conclude from the evaluation results that the method of combining frequency density approach and TFIDF approach has the best performance with 94.4% of precision and 90.7% of recall on semantic mapping , and vector space oriented approaches are not suitable for the task. Analyses of results are also given. Comparative experiments show that the best approach in this paper is better than conventional hierarchy clustering technique , and much better than multivariate statistical analyses such as principle component analyses on dimension reduction based feature encoding.

Select

Extraction of Machine Translation Units Based on“Similarity and Difference”

CHEN Bo-xing,DU Li-min

2003, 17(3): 35-41.

Abstract ( ) PDF ( )

Knowledge map

Save

The Machine Translation Units extracted from the bilingual corpora can cover the natural language text even more. This paper will describe an algorithm for obtaining the Machine Translation Units by learning the Similarity and Difference that are not all high frequency function words from two bilingual sentence pairs and aligning the Similarity parts and Difference parts by utilizing the Translation Lexicon and Dynamic Programming approach. Then , the Bilingual Chunk Similar Score Filter and the Part of Speech Similar Score Filter are used to test whether the meaning and syntax of the source part of the Machine Translation Unit is corresponding to the target part of the Machine Translation Unit ; finally , the Begin and End Stopword Filter is applied to check whether the Machine Translation Units’collocations are correct or wrong. We get an 86% precision and 61.34% recall. This algorithm provides a new practical approach to get Machine Translation Units.

Select

Prediction of Prosodic Organization Based on Grammatical Information

CAO Jian-fen

2003, 17(3): 42-47.

Abstract ( ) PDF ( )

Knowledge map

Save

It is especially necessary to generate prosody automatically in a Chinese TTS system. The main task is to segment the Chinese character sequence into proper speech units and organize them as a prosodic hierarchy. This goal can be satisfied by inserting different strength of breaks and assigning certain degree of stresses within a sentence. This paper will discuss how to predict the location and strength of break and stress based on syntactic and lexical information extracted from text analysis. Our attention will be paid to following aspects : (1) a brief description on Chinese prosody ; (2) text analysis ; (3) predictive-tree building ; (4) break index prediction ; (5) stress index prediction.

Select

Observation Density Comparison of Triphone Models in Chinese Language

LIU Yu-yu,WU Ji,WANG Zuo-ying

2003, 17(3): 48-53.

Abstract ( ) PDF ( )

Knowledge map

Save

It has great impact on the recognition performance how well HMM observation density can describe the actual distribution. To compare triphone models under different observation densities in Chinese language , three models and their respective algorithms of training and recognizing are constructed. By comparison of these three models in terms of different aspects , a conclusion is drawn , which can be the basis for the future selection of triphone observation density in Chinese language.

Select

The Tonal Template of Mandarin Basic Tune and Quality Improvement of the Machine Speech Synthesis

WU Bing-ya,ZHOU Chang-le,WU Jie-min

2003, 17(3): 54-59.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper achieves the tonal template of Mandarin discourse , by judging from the emotional color and colloquial style of every word in the input discourse and using a kind of phraseological attribute grammar and the relevant combinational arithmetic. The speed and scale of the machine synthesized speech are adjusted by the basic value of the pitch and duration of syllables corresponding to the tonal template , therefore the naturalness and fluency of the machine synthesized speech are enhanced , i. e. the quality improvement is realized.

Select

Towards Correctness , Easiness and Completeness : Building a Chinese Character Coding Input Method Based on Dynamic Structured Stroke Groups

ZHANG Xiao-heng

2003, 17(3): 60-66.

Abstract ( ) PDF ( )

Knowledge map

Save

In Chinese character input , the form-based coding method is an indispensable complement to the Pinyin-based method. The former is preferable in the cases where high-speed input is needed , where a large character set is required , where words of single characters or words missing in normal dictionaries are abundant , and where unfamiliar or rarely-used characters are more frequently used. The present paper introduces ZYQ , a stroke-group-based Chinese character input method whose development has been kept under the guidance of being Correct (in respect to the norms of language education and language application) , Easy (in respect to user friendliness and convenience) and Complete (in respect to the Chinese character set available) . ZYQ has a key-selection rate of 16.4% , while the maximum and average code lengths are 5 and 4.315 respectively.

Please choose a citation manager

Content to export

2003 Volume 17 Issue 3 Published: 16 June 2003