Journal of Chinese Information Processing

Select

Current Statistical Machine Translation Research in China

XU Bo,SHI Xiao-dong,LIU Qun,ZONG Cheng-qing,PANG Wei,CHEN Zhen-biao,YANG Zhen-dong,WEI Wei,DU Jin-hua,CHEN Yi-dong,LIU Yang,XIONG De-yi,HOU Hong-xu,HE Zhong-jun

2006, 20(5): 3-11.

Abstract ( ) PDF ( )

Knowledge map

Save

Institute of Automation, Institute of Computing Technology of Chinese Academy of Sciences, and Department of Computer Science of Xiamen University held the first Statistical Machine Translation Workshop in China together, from July 13th to 15th in 2005. This paper describes the tested systems of involved institutions, and analyzes the results of their experiments. The test results show that although the research of statistical machine translation started late in China, it develops rapidly. The tested systems got quite good results in a short period. Compared with the rule-based systems reported in the formal “863” evaluation, the performance is somewhat lower; however, the difference is small. According to the state of art and the trend of international statistic machine translation research,we believe that there is still great space for the improvement of statistic machine translation, with larger-scale data resources and more powerful hardware. In near future, phrase-based method incorporated with syntax and semantic information will become the mainstream of statistical machine translation.

Select

Research on Automatic Summarization Based on Rules and Statistics for Chinese Texts

FU Jian-lian,CHEN Qun-xiu

2006, 20(5): 12-18.

Abstract ( ) PDF ( )

Knowledge map

Save

As automatic summarization is an important research topic in the natural language processing, the paper presents an approach for Chinese text summarization on the basis of traditional methods. For text structure analysis, an algorithm is proposed formulti-topic text partitioning based on sequential paragraphic similarity, which can makes the abstract of the multi-topic article have more general content and more balanced structure. Futhermore, a series of rules are combined to enhance the readability of the output abstract. Finally, a new evaluation method is put forward. The primary test shows that its value is stable.

Select

A New Approach for Domain New Words Detection

LIU Hua

2006, 20(5): 19-25.

Abstract ( ) PDF ( )

Knowledge map

Save

The paper puts forward a new method for domain new words detection, which directly extracts keywords labeled by specialist in web pages, and stored them in classified wordlist according to the column of source web page. The simple approach can detects new words and clusters quickly. Using the approach, from 6 hundred million web pages covering 15 domains, we extracted 229237 words, including 175187 new words, the new words ratio is 76.42%. New words are mostly Named Entities, which have steady structure and integrated meaning, and are conducive to ambiguity and unknown words in Chinese word segmentation. They will be useful for text representation, such as text categorization and key words indexing.

Select

Research of Optimization on Double-Array Trie and its Application

WANG Si-li,ZHANG Hua-ping,WANG Bin

2006, 20(5): 26-32.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes an improved strategy for the algorithm of Double-Array Trie that is, the node with most child nodes is praessed firstly when constructing the array. This strategy can reduce the data sparseness and keep the search efficiency meanwhile. We implement a program for lexicon management base on the improved Double-Array Trie and compare it with other index mechanisms. The results clearly show that the improved Double-Array-Trie algorithm has a much higher search speed and needs a smaller space for data store than other index machanisms.

Select

A Study on Fast Algorithm for Chinese Dictionary Lookup

LI Jiang-bo,ZHOU Qiang,CHEN Zu-shun

2006, 20(5): 33-41.

Abstract ( ) PDF ( )

Knowledge map

Save

The dictionary mechanism serves as one of the basic components in Chinese information processing systems. Its performance influences the performances of these systems significantly. In this paper, we review the algorithms for Chinese dictionary lookup at first, then design and implement a Chinese dictionary based on Double-Array TRIE mechanism, and present a new Chinese dictionary based on Double Coding mechanism. In the end, we compare their space and time complexity experimentally with the binary-seek-by-characters mechanism. It can be seen that the Chinese dictionary based on Double-Array TRIE mechanism improves the speed obviously.

Select

A Chinese Word Extraction Algorithm Based on Information Entropy

REN He,ZENG Jun-fang

2006, 20(5): 42-45,92.

Abstract ( ) PDF ( )

Knowledge map

Save

Targeting at extending the dictionary forword segmentation so as to improve its accuracy, this paper presents a high-frequency Chinese word extraction algorithm based on information entropy. We firstly transform noisy words and characters to separators, thus a text can be viewed as a Chinese string collection isolated by separators. Then we compute the frequencies of all the substrings of these Chinese strings. Finally, we judge whether each substring is a word by computing its information entropy. Preliminary experiments show that this simple algorithm is effective in extracting high-frequency Chinese words, with the accept rate up to 91.68%.

Select

Chinese Name Recognition Based on Boundary Templates and Local Frequency

LI Zhong-guo,LIU Ying

2006, 20(5): 46-52.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper an effective algorithm for Chinese person name recognition is proposed. Person name's left and right boundary words and person name's character frequency are extracted from tagged corpus, which will be used as the knowledge for recognition. First we use these boundary templates to find possible person names. Then these recognized person names are used to match the missed occurrence in the text. At last, the local frequency obtained from the whole text is used to check and correct the name boundaries. The time complexity of this algorithm is linear, and the test result on 1,354 news articles (with 3.04 million Chinese characters and 37,014 Chinese names in all) gives the precision of 94.52% and the recall of 98.97% , which is fairly satisfying in comparison with other published algorithms.

Select

Identifying Chinese Place Names Based on Support Vector Machines and Rules

LI Li-shuang,HUANG De-gen,CHEN Chun-rong,YANG Yuan-sheng

2006, 20(5): 53-59.

Abstract ( ) PDF ( )

Knowledge map

Save

By analyzing the characteristics of place names in Chinese texts, a method of automatic recognition of Chinese place names is presented, which combining support vector machines (SVMs) with rules. Firstly, feature vectors based on characters are extracted, and transferred into binary vectors. A training set is established, and the machine learning models for automatic identification of Chinese place names are obtained using polynomial kernel functions. Then, through careful error analysis, a rulebase is constructed and a post-processing step based on it is used, to overcome the shortcoming of low recall of machine learning model. The results show that the method is efficient for identifying Chinese place names. In open test, the recall, precision and F-measure reach 89.57% , 93.52% and 91.50% respectively.

Select

A Continuous-Recognition-Oriented Handwritten Chinese Character Radical Set and its Statistical Rule

ZHAO Wei,LI Chun-di,LIU Jia-feng,TANG Xiang-long

2006, 20(5): 60-66.

Abstract ( ) PDF ( )

Knowledge map

Save

The paper introduces a handwritten Chinese character radical set which is established oriented for the research on continuous handwritten character sequence recognition. Based on the set, the task of radical-based splitting and coding for 6,763 Chinese characters was done. From the statistical data we can found that the distribution of Chinese character numbers with regard to radical numbers fits the logarithmic normal distribution model. Futhermore the composing power of handwritten radicals are analyzed and discussed in the view of statistics and character recognition technique. Finally, two radical recognizers were built for test. 70.21% radical recognition rate was obtained in single Chinese character test, while the rate in continuous handwritten character sequence test is 58.49%. The results show that the radical set accords with the characteristic of Chinese character and on-line handwritten text.

Select

Deformable Elastic Matching Based on High Order Statistics for Handwritten Characters Recognition

MA Rui,YANG Jing-yu

2006, 20(5): 67-72.

Abstract ( ) PDF ( )

Knowledge map

Save

Aiming at the problem of misrecognitions due to overfitting in conventional elastic matching for handwritten character recognitin, a deformable elastic matching approach based on high order statistics is proposed in this paper. According to the handwriting variations in shape details contained in high order statistics, the intrinsic deformations within each character class are extracted from the actual deformations by independent component analysis. Then they are applied to the deformable model. Thus any deformation of a class can be described by the weighted linear combination of the independent components. In this model the prototype character is deformed gradually in an effort to be much closer to the input character. In experimental results, higher recognition rates are obtained with average rate up to 92.81% , which shows that the proposed approach is very effective for handwritten characters recognition.

Select

A Smoothing Method for Voiced Units Concatenation Based on Time-Domain Unit Fusion

GUO Wu,WU Yi-jian

2006, 20(5): 73-78.

Abstract ( ) PDF ( )

Knowledge map

Save

The corpus-based concatenative speech synthesis methods have became popular for its high-quality speech. However, the quality of concatenated speech often suffers from discontinuities between the acoustic units, due to contexual differences and variations in speaking styles across the database, especially between the voiced units. In this paper, we proposed a smoothing method called time-domain unit fusion (TD-UF) to smooth the discontinuities between the voiced units. In the proposed method, the appropriate fusion unit, i.e. transition template, was obtained by periodic matching in time-domain, and then the fusion procedure was performed between the concatenated unit and fusion unit in time domain by TD-PSOLA. From the result of comparison in spectral and perceptive aspect between the smoothed and un-smoothed data, the method has distinct smoothing effect on speech quality and high efficiency due to the operation in time domain.

Select

Phonotatics Based Chinese Dialects Identification

GU Ming-liang,SHEN Zhao-yong

2006, 20(5): 79-84.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper discusses the criterions for distinguishing different Chinese dialects and the basic features selection firstly. According to these principals, a novel feature named district differential cepstral feature was proposed. Then, a novel dialect identification system combining GMM tokenizer, N-gram language model and ANN is constructed. Compared with traditional LID system, the new system has following characteristics: first, it is unnecessary to use tagged dialects speech database ,which becomes less labour-intensive to build corpora. Second, GMM tokenizer is more computationally efficient. Third, the system has more accurate and robust performance. In a test under Chinese dialects classification, averagely 83.8% accuracy is achicved.

Select

Research on Coded Character Set Standards and Classification

XIE Qian,RUI Jian-wu,WU Jian

2006, 20(5): 85-92.

Abstract ( ) PDF ( )

Knowledge map

Save

Coded character set standard are the bases of the computer text information processing. In this paper, a 3-turples model is proposed to descibe the coded character set. The existing code standards are reviewed and summarized. And the ISO 2022 and it's deriving standards are analyzed in detail; including the limitation of utilizing ISO 2022 in multilingual environment. Necessity of founding UCS (Universal Character Set) is presented, along with an outline analysis of UCS. After evaluating current classification methods of coded character set standards, a new method is produced with application in cataloguing existing standards. We close our paper with a brief analysis of important Chinese national standards on Han character set.

Select

A Chinese Character Input Model Based on ISO/IEC 10646

LI Pei-feng,ZHU Qiao-ming,QIAN Pei-de

2006, 20(5): 93-98.

Abstract ( ) PDF ( )

Knowledge map

Save

With the trend of unifying all the native character encoding schemes in computers, ISO has published an international standard named ISO/IEC 10646 to meet that developing tide. In this paper, we firstly analyze the limitation of the existing Chinese character input methods. It has been observed that almost all the existing Chinese character input methods are based on ANSI Code, such as GB2312, GBK, BIG-5, and these input methods have many shortcomings including the inconvenience of transference and the lack of supporting cross-lingual platforms. Then we propose a model of Chinese character input method based on ISO10646/IEC which consists of six parts: input method management, code mapping table based on ISO10646/IEC, code searching and filtering, interface with OS, input method kernel as well as localization interface. At last, we discuss the design and processing technology for the input codes-Chinese characters mapping table, a key factor in the proposed model.

Select

Discussing on the Merits and Demerits of the Character Input Code Evaluation Method

SUN Ji-shou

2006, 20(5): 99-106.

Abstract ( ) PDF ( )

Knowledge map

Save

The merits and demerits of Character code must be evaluated scientifically. The relaxation and the speed potential of the coding rule are two key indicators. This paper is divided into four parts. The first part introduces the concept of simplification, standard, ease-to-learn and relaxation, and proposes the reason for choosing the topic of relaxation. The second part explains the intrinsic factors that result in relaxation or tenseness based on, some concrete examples and also proposes an experimental draft for the relaxation evaluation. The third part analyses the present situation of character input system with universal keyboard, insisting on the viewpoint that the character code should be inspected by separating the level of coding from the level of software. In addition, the coding level should test the relaxation and the speed potential of coding rule. The last part analyses the relationship between the average excursion and the speed potential in both practice and theory aspects, that is, the smaller the average excursion is, the greater the speed potential is. Finally, a parameter indicator which reflects the speed potential also proposed.

Please choose a citation manager

Content to export

2006 Volume 20 Issue 5 Published: 16 October 2006