Journal of Chinese Information Processing

Select

A Method of Identifying the Predicate Head Based on the Correspondence Between the Subject and the Predicate

LI Guo-chen ,MENGJing

2005, 19(1): 2-8,42.

Abstract ( ) PDF ( )

Knowledge map

Save

Identifying the predicate head plays a very important role in sentence parsing. The traditional approaches rely on the static and dynamic grammatical features of the candidate predicate heads to identify the predicate head. Based on this ,the paper proposes a method which identifies the predicate head based on not only the static and dynamic grammatical features of the candidate predicate heads , but also the syntactic relations between the subject and the predicate. The experimental results show that , in comparison with the traditional methods , the proposed method can improve the precision of predicate head identification by about 3 percent.

Select

An Improved Cache-based Adaptive Chinese Language Model

ZHANGJun-lin ,SUN Le ,SUN Yu-fang

2005, 19(1): 9-14.

Abstract ( ) PDF ( )

Knowledge map

Save

Even if n-grams language models were proved to be very powerful and robust in various tasks , they have a certain handicap that the dependency is limited to very short local context because of the Markov assumption. Though cache-based language models adapt to crossdomain environment very well , the hypothesis behind this language model is too simple. It assumes that a word that has been used often reappears in the same document. We extend this model by introducing the Chinese concept lexicon into it. The cache of the extended language model contains not only the words occurred recently but also the semantically related words. Experiments have shown that the performance of the adaptive model has been improved greatly and the perplexity has decreased almost 4011 % compared with n-gram language model.

Select

Zero Anaphora in Chinese and How to Process it in Chinese-English MT

HOU min ,SUN Jian-jun

2005, 19(1): 15-21.

Abstract ( ) PDF ( )

Knowledge map

Save

Anaphora is an important means of discourse cohesion , and zero anaphora is a common anaphora in Chinese.From typological viewpoint , there are some differences between Chinese and English , thus zero anaphora may influence the quality of Chinese2EnglishMT. This paper analyzes the recognition , classification , and produced cause and service condition of the zero anaphora in Chinese in detail. The author points out that the problem makes the generated sentences of target language (English) ungrammatical. Some algorithms are given based on the level of sentence group.

Select

WANG Rong-bo ,CHI Zhe-ru

2005, 19(1): 22-30.

Abstract ( ) PDF ( )

Knowledge map

Save

Example-based machine translation(EBMT) is an important branch of machine translation that has been studied extensively for about twenty years. So far ,some progresses have been gained because of researchersp hard work. Sentence similarity measure certainly is one of the most important problems addressed in EBMT. For EBMT from Chinese to English ,the performance of similarity measure of Chinese sentences affects directly final translation result of an input sentence. In this paper ,we proposed a similarity measure method of Chinese sentence structures for example-based Chinese to English machine translation. In this method ,the algorithmperforms the optimal matching between the word type sequences of two compared sentences. The preliminary experimental results show that the measure method works well when it is tested on a small dataset.

Select

Chinese WSD Based on Selecting the Best Seeds from Collocations

QUAN Chang-qin , HE Ting-ting , JIDong-hong , LIU Hui

2005, 19(1): 31-36.

Abstract ( ) PDF ( )

Knowledge map

Save

The key problemof word sense disambiguation based on statistic model lies in how to acquiring the word sense indicators automatically. Although it is feasible to acquire a large number of collocations by learning examples , it is hard to select good seeds manually to increase new collocations effectively. The method of selecting the best seeds by machine learning is provided in this paper to solve this problem. The best seeds are used to augment more new word sense indicators ; finally disambiguate polysemous words with the acquired indicators. The average accuracy is 8717 % for 8 polysemous words by this method.

Select

Research and Implementation of Text Classif ication System Based on VSP

CHEN Zhi-gang ,HE Pi-lian ,SUN Yue-heng ,ZHENG Xiao-shen

2005, 19(1): 37-42.

Abstract ( ) PDF ( )

Knowledge map

Save

Text classification is an important research task of natural language processing , which can efficiently resolve the issue of information chaos and help to locate the required information. The traditional approaches of text classification commonly extract feature terms from a single test criterion , which will lead to the problemof“over fitting”. This paper comprehensively takes test criterions such as frequency , distribution and concentration into account and proposes a new arithmetic of feature extraction and implements text classification systemwith two-level mode. The experimental results show that two-level classification mode has higher classification precision and recall compared with center classification method.

Select

Strategy Performance Evaluation of IR Based on Cloud Model

KANG Hai-yan , LI Yan-fang , LIN Pei-guang ,Fan Xiao-zhong

2005, 19(1): 43-48.

Abstract ( ) PDF ( )

Knowledge map

Save

At present the most popular methods of strategy evaluation in information retrieval system cannot reflect stability and randomicity. So the tradition methods are not comprehensive enough for strategy evaluation. This research presents a new method of strategy evaluation based on cloud model. This method can reflect not only average performance of a strategy but also stability and randomicity. This method sets up a transform of qualitative concepts and quantity. This kind of transform is carried out through strict mathematic means. Experimental data show the method is practical. Results of evaluation will be more accurate and approach to the fact better. This method is also good for strategy evaluation of text classification.

Select

The Mechanism for Information RecommendationBased on Content and Collaboration

LIN Hong-fei , YANG Zhi-hao , ZHAO Jing

2005, 19(1): 49-56.

Abstract ( ) PDF ( )

Knowledge map

Save

Internet becomes the important tool and media for knowledge acquisition , and it is the hotspot to recommend the information related to userspinteresting. The new mechanismof information recommendation based on content and collaboration is presented in this paper. It divides users into several content classes based on related texts , at the same time , and it also classifies the users into some collaborative classes based on userspannotations. Finally , it considers that both the content classes and collaborative classes can impress on the usersp interest , and it applies the discriminating approach to integrate two classes into a similarity formula as a basis of information recommendation. In addition , it can automatically adjust the system parameters to adapt to the outer environment , system load and the responding speed.

Select

Natural Language Watermarking

ZHANG Yu ,LIU Ting ,CHEN Yi-heng ,ZHAO Shi-qi ,LI Sheng

2005, 19(1): 57-63,71.

Abstract ( ) PDF ( )

Knowledge map

Save

A new technique based on natural language processing was proposed in this paper ,that is ,natural language water marking. It is a novel technique for in formation hiding. The meaning of the original text can not be changed after embedding the hiding information (watermark message) in it using this technique. Firstly ,the concept ,characteristic and the adversary model of natural language watermarking were presented in this paper. This paper also investigated some related research works in this field. This method was more flexible than traditional methods ,and the watermark can not be damaged under moderate attacks. Secondly , the design of watermarking system ,including the theory of quadratic - residue which is the basis theory of this method were described in detail in this paper. Finally ,two marking schemes were described in detail ,the syntactic watermarking approach and the semantic watermarking approach.

Select

Merge Information in HowNet and TongYiCi CiLin

MEI Li-jun , ZHOU Qiang , ZANGLu ,CHEN Zu-shun

2005, 19(1): 64-71.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper ,we study the problem of merging information in HowNet and a Chinese thesaurus — TongYiCi CiLin. In order to integrate both the conception descriptions of words in HowNet and the semantic categories of words in TongYiCi CiLin ,we propose several useful merging strategies : Firstly ,we establish a DEF description for each SynSet in TongYiCi CiLin ,which is similar with the word sense definition in HowNet. Then ,we make bidirectional link for the words which have only one sense in both dictionaries. Finally we make bidirectional link for other words with multiple senses by using a classification algorithm based on salient frequency and vector distance of two sense descriptions. Experimental result shows that these merging strategies are effective and the merging accuracy is about 93 %. The merged results form a new dictionary ,which not only has semantic category of TongYi CiLin ,but also has conception description of HowNet.

Select

A Statistically Study on the Qualities of All Modern Tibetan Character Set

GAO Ding-guo , GONG Yu-chang

2005, 19(1): 72-76.

Abstract ( ) PDF ( )

Knowledge map

Save

A study of the basic qualities of the Tibetan language forms the basis for the Tibetan information processing. Study of modern Tibetan character is an important aspect in developing Tibetan information processing. All modern Tibetan characters set is finite , and useful for better researching modern Tibetan character , This thesis is concerned with the modern Tibetan character and how to , according to Tibetan grammar rules and using computer , do the following : calculate the total number of character , length of character , structural mode , quality of position , letter frequency , and entire character. Moreover , this thesis will also examine in a summary manner the above figures. This thesis will use modern Tibetan language analysis to better understand the nature of the language , thus offering a basic understanding for the study of the Tibetan language and Tibetan information processing.

Select

The Disambiguation Strategy of Semantic Analysis in Spoken Dialogue Systems

LIU Bei , DU Li-min

2005, 19(1): 77-84.

Abstract ( ) PDF ( )

Knowledge map

Save

Frame semantic analysis is one of the most commonly used semantic analysis method in Chinese spoken dialogue systems research. And the two typical ambiguous structures encountered in semantic analysis are outer-ambiguity and structural-ambiguity. According to the features of these two ambiguous structures , this paper puts forth the semantic PCFGmodel based disambiguation strategy to solve structural-ambiguity and the semantic ExpectationModel (EM) integrated disambiguation strategy to solve outer-ambiguity. Efficient algorithms of these two methods are also provided. The experimental results show that synthetically use of these two disambiguation methods can most greatly improve the performance of the understanding module of the base-line system. The sentence accuracy is improved from 7517 % to 9115 % , and the three targets of semantic understanding rate-correction , recall , and precision are also improved 10 % averagely.

Select

Polynomial Regression Model for Duration Prediction in Mandarin

SUN Lu , HU Yu , WANG Ren-hua

2005, 19(1): 85-91.

Abstract ( ) PDF ( )

Knowledge map

Save

Duration information is an essential part of speech prosody , and plays a critical role in improving the naturalness and understandability of synthesized speech. Duration modeling is to establish a mapping relationship between the prosodic environment and the final duration engendered in natural speech. In this paper , we first study the effect of prosodic features on segmental duration by introducing a statistical concept —eta squared , then choose more forceful prosodic features and design an algorithm to quantify the interaction among them , and finally bring forward the method of determining the duration model using a polynomial equation and obtain the coefficients through non - linear regression. Our research work indicates that 5 or 6 prosodic features might by and large assist a close and accurate mapping between prosodic environment and perceived duration. Compared to Wagon tree method , this method has undeniable merits.

Select

Segmentation of Touching Chinese Character Based on Convex Hull Ratio Feature

WEI Xiang-hui , MA Shao-ping

2005, 19(1): 92-98.

Abstract ( ) PDF ( )

Knowledge map

Save

Accuracy of segmenting Chinese characters , especially touching characters , is essential for performance of a Chinese characters recognition system. The paper applied a background2thinning algorithm to segment two2touching Chinesecharacters that come from the dataset of four vaults. A newfeature called convex hull ratio was proposed for selection of the best segmentation path , as this feature exploits the property on the balance of Chinese charactersp structure. The experimental results show that segmentation accuracy improved consistently using the new feature when three different classifiers were experimented. And gaussian mixture model achieves the accuracy of 8816 %.

Select

Further Development of ZYQ: A Three-staged Coding Series for Chinese Character Input

ZHANG Xiao-heng

2005, 19(1): 99-105.

Abstract ( ) PDF ( )

Knowledge map

Save

The ZYQ Chinese character input method has been developed into a three-staged series including the whole-character stroke order method , the whole-character stroke group method and the-21 stroke group method. The first two methods are simple and effective for Chinese character retrieval on large character sets , while the third method is more suitable to normal typing and writing at higher speed. Technically , further simplification , systemization and regularization has been applied to the selection of multi-stroke coding units , the definition of structural components and the bisegmentation of multi-component Chinese characters. In addition , the coded Chinese character set has been extended to include 1164 characters specific to regions of Hong Kong , Taiwan and Macao.

Please choose a citation manager

Content to export

2005 Volume 19 Issue 1 Published: 15 February 2005