Journal of Chinese Information Processing

Select

An Approach of Processing New Words Based on HMM in Tagging of Speech of Part

ZHANG Xiao-fei,CHEN Zhao-xiong,HUANG He-yan,,CAI Zhi

2003, 17(5): 2-6.

Abstract ( ) PDF ( )

Knowledge map

Save

Ambiguity of part of speech (POS) which urgent needs to be resolved is a very important ambiguous phenomenon in natural language processing. Furthermore , it is very difficult to disambiguate the ambiguity of part of speech of the new words. In this paper , through converting the problem of tagging of POS to the problem of calculation of word’s emission probability ; a new approach based on HMM is proposed to solve this problem. This approach uses nothing more than a tagged corpus (e.g. no grammar dictionaries , no grammar rules), and the result shows that the correct rata arrive at 97% in close test and 92% in open test .

Select

Noun Phrase Alignment in Chinese-English Bilingual Corpora

LIU Dong-ming,ZHAO Jun,YANG Er-hong

2003, 17(5): 7-13.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper , a method is proposed to align bilingual noun phrases automatically in sentence-aligned Chinese-English bilingual corpus. The characteristic of our method is to deal with high-frequency noun phrases and low-frequency noun phrases separately without recognizing Chinese noun phrase accurately. High-frequency noun phrases in English corpus are aligned to those in Chinese corpus using an iterative re-evaluation algorithm according to the co-occurrence between English phrases and Chinese words in bilingual corpora ; Low-frequency noun phrases are aligned using the manual rules and Dice coefficient which is based on English-Chinese dictionary. This method can take into account the alignment information on the whole , and acquire the result with high coverage rate.

Select

Research on Cache-based Adaptive Chinese Language Model

QU Wei-min,ZHANG Jun-lin,SUN Le,SUN Yu-fang

2003, 17(5): 14-19,41.

Abstract ( ) PDF ( )

Knowledge map

Save

Though cache-based language models can better adapt to cross-domain environment , the hypothesis that it has made is too simple. It assumes that a word that has appeared in the article often reappears later in the same article. But it does not take into account the influence of stop words and mutual action between different words. According to this problem , we have made two improvements to the model. First , we use TFIDF scheme instead of simple statistics. Second , we adopt an extended cache-based 2-gram model , which expand the information that the model exploits. Experiments have shown that the performance of the adaptive model has been improved greatly.

Select

On the SC Transfer of Action-effect Sentences from Chinese to English

ZHANG Ke-liang,HUANG Zeng-yang

2003, 17(5): 20-27.

Abstract ( ) PDF ( )

Knowledge map

Save

In the light of the HNC conceptual network , action-effect sentences in the Chinese language arise directly from causative verbs and compelling verbs , and indirectly from general acting verbs , i.e. via the use of“de (得) ”construction. In Chinese-English machine translation , action-effect sentences arising from the three conceptual types of verbs mentioned above follow different SC and SF (sentence format) transfer rules. Therefore , different transfer frames (TransFrame) should be adopted so as to ensure the generation of TL sentences with proper syntactico-semantic structures. Experiments show that the SC-SF transfer rules underlying the transfer of action-effect sentences from Chinese to English have a wide coverage.

Select

The Relevance Evaluation of the Celebrities’WebPages

ZAN Hong-ying,SU Yu-mei,SUN Bin,YU Shi-wen

2003, 17(5): 28-34.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper introduced the design and implementation of Tianwang Fame System. It mainly discussed on the factors and algorithms that affect matching of a named entity with Chinese webpages’relevance evaluation on the celebrities. Aiming at shortages of the current Search Engines , the project is to improve the quality of the web information services , and to enhance the ability of the personalizing services. Based on the Tianwang Search Engine of Peking University , the Fame System adopted new techniques in Nature Language Processing , especially in Chinese information extraction according to the features of webpage information. The paper proposed a new method to the relevance evaluation of webpages against attributes of named enties. This method optimizes the order of the search results , and improves the service quality of Tianwang Fame System.

Select

The Analysis of a Contest Result on Chinese Web Page Automatic Categorization

FENG Shi-cong,Wang Ji-min

2003, 17(5): 35-41.

Abstract ( ) PDF ( )

Knowledge map

Save

A Chinese Web page automatic categorization contest was hold in national symposium on Search Engine and Web Mining and ten teams took part in this contest . After describing the contest rules , this paper analyses the contest results in details and we can have an explicit view on the present technologies of Chinese Web page automatic categorization: no explicit difference is shown among those classifiers had been developed and Chinese Web page categorization is more difficult than plain text categorization. This paper also attempt to provide a standard Chinese Web page categorization instance examples and develops them to be a base corpus of Chinese Web page categorization by continuous modification.

Select

Study of Dialog Turn-Based Decaying Cache Adaptation Model

HE Wei,LI Hong-lian,YUAN Bao-zong,LIN Bi-qin

2003, 17(5): 42-48.

Abstract ( ) PDF ( )

Knowledge map

Save

The substantial investment required for developing a spoken language system in each specific task is a hamper to the widespread use of speech technology. In this paper , to develop the toolkits for porting a spoken language system to a new application rapidly and simply , an improved cache model - a history unit based decaying cache model is provided for on-line language model adaptation of spoken language systems. To capture the dialog state change , each user’s utterance and system response are collected and trained. When each dialog turn finished , the cache is updated and bigram counts would be decimal after decaying. The cache bigram is interpolated with the generic trigram. Experiments are performed on two contrastive tasks : the train travel reservation and the park guide. When the training data just arrived to several hundred utterances , in both tasks there is a satisfying reduction in character error rate for both supervised and unsupervised adaptation.

Select

Detection Model of Prosodic Boundary Based on Prosodic Features and Syntactic Information

WU Xiao-ru,WANG Ren-hua,LIU Qing-feng

2003, 17(5): 49-55.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic detection of prosodic boundary for continuous speech is very useful for labeling corpus in TTS system and for separating phrase in speech recognition. we propose an automatic break detection algorithm for mandarin Chinese speech. Our labeling model includes following steps : Firstly acoustic parameters are analyzed to select some useful parameters for detection model. Then relationship between syntactic information and prosodic word is obtained by statistical method. At the same time F0 value is estimated by F0 prediction model , in which all of syllable boundary is assumed as non-prosodic boundary. Finally all of the acoustic parameters、syntactic information and estimated F0 value are input into the decision tree for predicting potential prosodic word boundary. Experiments show this detection model can speed manual labeling of prosodic boundary and had little impact to label assignment .

Select

Character Extraction in Complex Color Document Images

CHEN You-xin,LIU Chang-song,DING Xiao-qing

2003, 17(5): 56-60.

Abstract ( ) PDF ( )

Knowledge map

Save

Today there are a lot of documents with text characters printed on colored and/or complex backgrounds. To recognize these characters , they must be extracted from the images in advance. In this paper , two novel techniques are proposed and constitute a robust character extraction algorithm. First , we search color connected component by applying a new region-growth algorithm--color run-length adjacency graph algorithm (CRAG) , then divide the image to several layers by clustering the central color of all the components. Finally the character layers are selected by connected components (CC) analysis and recognition information of OCR. The algorithm modifies and expands the BAG algorithm to color document images , fully utilizing the information of color and position. The experiment showed that the new method is especially good at extracting gradient-color characters with high speed and can restore the original color of characters as well.

Select

Study for the Method of JiaGuWen Symbol Coding

XIAO Ming,ZHAO Hui,GAN Zong-wei

2003, 17(5): 61-66.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper researches JiaGuWen symbol coding using the fuzzy Mathematical theory , and sets up a method for clustering JiaGuWen symbol code roots and coding JiaGuWen characters. Then on the basis , we use the entropy in informatics to analyze the efficiency and rationality , and thus provide theory foundation for coding scientifically for JiaGuWen characters.

Please choose a citation manager

Content to export

2003 Volume 17 Issue 5 Published: 15 October 2003