Journal of Chinese Information Processing

Select

The Construction and Utilization of A Comprehensive Language Knowledge-base

YU Shi-wen,DUAN Hui-ming,ZHU Xue-feng,ZHANG Hua-rui

2004, 18(5): 2-11.

Abstract ( ) PDF ( )

Knowledge map

Save

The scale and quality of the knowledge-base decides the success or failure of the natural language processing system. Institute of computational linguistics of Peking university has accumulated a series of languages-data resources that have good quality with considerable scale after 18 years of diligent work : the grammatical knowledge-base of contemporary Chinese , the large-scale POS-Tagged corpus of contemporary Chinese , Semantics Knowledge-base of Contemporary Chinese (SKCC) , Chinese Concept Dictionary (CCD) , a bilingual parallel corpus with different aligned units , special term bank of different disciplines , the phrase structure knowledge-base of contemporary Chinese , a corpus of ancient Chinese poems. The present research will integrate these language data resources into one unified and comprehensive language knowledge-base. While incorporating all these different resources , the gaps between them must be filled up. The comprehensive language knowledge-base being planned will provide not only friendly using interface and convenient application program interface but also various software toolssupporting knowledge mining. Therefore , the research promotes the present language data resources to develop constantly from primary products into deep processed products. It will set up diversified forms of knowledge spreading mechanism and information service mechanism to offer omni-directional and multi-level support to language information processing , traditional linguistics research and language teaching.

Select

The Inspecting Method Study on Consistence of Part of Speech Tagging of Corpus

ZHANG Hu，ZHENG Jia-heng，LIU Jiang

2004, 18(5): 12-17.

Abstract ( ) PDF ( )

Knowledge map

Save

In the deep processing of large-scale corpus , it has been a chief problem to assure the consistence of part of speech tagging to build the high-quantity corpus. A new inspecting method on consistence of part of speech tagging based on clustering and classifying is put forward , firstly we cluster the sequences of part of speech of the example and get the threshold value , then classify the test sequences to judge its’correctness , furthermore , we can know the condition of the consistence of part of speech on every text , and assure the correctness of the part of speech tagging on large-scale corpus further.

Select

A Statistical Approach for Content Extraction from Web Page

SUN Cheng-jie,GUAN Yi

2004, 18(5): 18-23.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a statistical approach for extracting text content from Chinese news web pages in order to effectively apply natural language processing technologies to web page documents. The method uses a tree to represent a web page according to HTML tags , and then chooses the node which contains text content by using the number of the Chinese characters in each node of the tree. In comparison with traditional methods , the method needn't construct different wrappers for different data sources. It is simple , accurate and easy to be implemented. Experimental results show that the extraction precision is higher than 95 %. The method has been adopted to provide web text data for a question answering system of traveling domain.

Select

Study on Logical description of Chinese Metaphor Comprehension

ZHANG Wei,ZHOU Chang-le

2004, 18(5): 24-29.

Abstract ( ) PDF ( )

Knowledge map

Save

People always encounter metaphor phenomenon in daily life. It plays an important role in language epistemology and discourse understanding. But its logic system and computational method researches are all in the early stage. The paper creates a metaphor logic system by giving definition , constructing , and analyzing the properties of the logic. The paper also fractionizes the rules of the logic to analyze Chinese sentences containing nominal metaphor , verbal metaphor and so on. Then it uses a logical method to uncover the latent information of metaphorical sentences based on metaphor logic system. The logic system we proposed here gives a new method for Chinese metaphor comprehension in natural language processing field. The result shows that the metaphor logic has good capability to analyze the metaphor sentences' meaning. It gives instructional method for computer to perform metaphor comprehension.

Select

The Research and Development of Web-based Chinese-English CAPP

YANG Yu-tu,YE Wen-hua,WANG Ning-sheng

2004, 18(5): 30-36.

Abstract ( ) PDF ( )

Knowledge map

Save

It is necessary for CAPP system to support long-distance design and product data sharing. However most CAPP systems developed so far can only support local design and generate process plans in one kind of language. In addition , general translation software could not been used in CAPP system directly. Therefore , it is compulsory to develop Web-based Chinese-English CAPP system integrated with computer aided translation software. In this paper , based on reviewing of Computer Aided Translation (CAT) and analyzing of process plans language feature , the framework , function and key technologies of the system are probed. One kind of Web-based Chinese-English CAPP system is put forwards. At last a process planning and computer aided translation software aimed at aircraft enterprises is implemented and applied successfully.

Select

A Method for Ordering Tibetan Text in Arccord with ISO 14651

LIN He-shui,CHENG Wei,CAO Hui,LI Wen-bo,WU Jia,SUN Yu-fang

2004, 18(5): 37-42.

Abstract ( ) PDF ( )

Knowledge map

Save

This thesis discusses the machine ordering of Tibetan words on the basis of linear characters , which means any pre-composed forms or vertical stack can be processed as a single Tibetan character. Our method is to divide Tibetan words into two types : with or without pre-consonant character. And by defining base characters without pre-consonants and base characters with pre-consonants , we convert the Tibetan words into all kinds of strings like base characters without pre-consonants , base characters with pre-consonants , pre-consonant characters , base characters , post-consonant characters , ppost-consonant characters. Then compare all the defined units with their weight and acquire results. The method is according with the semantic of ISO/IEC 14651.

Select

Text Location in Natural Scene

OU Wen-wu,ZHU Jun-min,LIU Chang-ping

2004, 18(5): 43-48,64.

Abstract ( ) PDF ( )

Knowledge map

Save

With the rapid growth of research on text recognition in natural scene , it turns to be urgent to understand the development situation of this art and to establish common benchmark datasets. So the organizers of international conference on document analysis and recognition 2003 develop a dataset on this art specially and organize the robust reading competition , and we take part in the sub-competition of robust text location. In this paper we shall introduce our algorithm on this competition and give the competition result ; in the end of paper we give the compare of each entries' algorithms and point out the development situation of robust text location presently.

Select

Assigning Break Indices for Unrestricted Texts in Mandarin Text to Speech System

ZHAO Yong-zhen,LIU Ting,WANG Zhi-wei,CHEN Hui-peng,SHAO Yan-qiu

2004, 18(5): 49-56.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper uses a corpus with break indices based on C-TOBI. Applying supervised learning method , some useful attempts are made in the field of automatic break indices intonation. Three approaches , namely , the basic Markov model approach , the Markov model using word length approach , and the Markov model using word length combining transformation-based error-driven learning approach , are presented. After implementing these three approaches , open tests are made on a corpus of 3,000 sentences. The performances are getting better and the last approach produces the highest accuracy , 78.5% , and results in 14.5% decrease in error-cost taking the result of Markov model as baseline.

Select

Prosody Phrase Break Prediction Based on Maximum Entropy Model

LI Jian-feng,HU Guo-ping,WANG Ren-hua

2004, 18(5): 57-64.

Abstract ( ) PDF ( )

Knowledge map

Save

In TTS (Text-To-Speech) systems , prosody phrase breaks can not be predicted with high accuracy , which slows down the improvement of naturalness of synthesized speech. In this paper , a maximum entropy based model for prosody phrase break prediction is proposed , and a comparison is conducted on large corpora between the new model and the decision tree based model which is the mainstream method for prosody phrase break prediction. The contribution of lexical feature set and influences of different cutoff values are also investigated. It is demonstrated that , utilizing the same feature set , maximum entropy based model makes an improvement of 5.5% on F-Score over decision tree based model. Integrating lexical information , an improvement of 9.4% over decision tree based model is achieved. In the end , it is pointed out that a maximum entropy model can be considered as a weighted rule system , which solves the problem of rule conflicting in an elegant way.

Select

Robust Speech Recognition Based on the Compensation of Speech Enhancement Distortion

DING Pei,CAO Zhi-gang

2004, 18(5): 65-70.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a roubst speech recognition method based on the compensation of speech enhancement distortion. In the front-end ,speech enhancement effectively suppresses the background noise to improve the Signal-to-Noise Ratio (SNR) of the input signal. The residual noise and the spectrum distortion after enhancement are adverse factors for speech recognition , and their effects will be compensated by Parallel Model Combination (PMC) in recognition stage or by Cepstral Mean Normalization (CMN) in feature extraction stage. Experiment results show the proposed method can significantly improve the accuracy of speech recognition system across a wide range of SNRs , especially in very noisy environments. For example , in -5dB white noise , this method can reduce the error rate by 67.4% versus the baseline recognizer.

Select

A Novel Endpoint Detection Algorithm Based on HOS in Noisy Environment

WANG Zhuo,SU Mu,LI Peng,XU Bo

2004, 18(5): 71-78.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper , various endpoint detection methods are classified into three kinds : robust feature , feature filtering and template match , which are evaluated and analyzed against each other. After Exploring the essential differences between noise and speech , Higher Order Statistics (HOS) are introduced in this paper and a method of using radially integrated polyspectra is applied as a feature , where multi-dimentional spectra space is transformed into one-dimentional spectra space. Experiments show that this algorithm can achieve fairly good performance in different kind of noise and various SNR.

Select

Research of Chinese-English Bilingual Acoustic Modeling

YU Sheng-min,ZHANG Shu-wu,XU Bo

2004, 18(5): 79-85.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper , three different approaches of Chinese-English bilingual acoustic modeling are investigated and compared. The first approach is to simply combine Chinese and English phone inventories together without phone shared across the languages. The second one is to map language-dependent phones to the inventory of the International Phonetic Association (IPA) based on phonetic knowledge to construct the bilingual phone inventory. The third one is to merge the language-dependent phone models by hierarchical phone clustering algorithm to get a compact bilingual inventory. Experimental results show that phone clustering approach outperforms IPA-based phone mapping approach , and it can also achieve comparable performance to the simple combination of language-dependent phone inventories with less model parameters , especially when using acoustic likelihood measurement .

Please choose a citation manager

Content to export

2004 Volume 18 Issue 5 Published: 15 October 2004