Journal of Chinese Information Processing

Select

A Chinese Parser Based on Probabilistic Context Free Grammar

LIN Ying,SHI Xiao-dong,GUO Feng

2006, 20(2): 3-9,34.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper studies the limitations of probabilistic context free grammar , and proposes a concept of co-occurrence in syntax structure so as to use the context information. To address the limitation of the Chinese Treebank’s small scale , an Inside-Outside algorithm to obtain the parameters of syntactic rules is given. At last , we present a probabilistic top-down Chinese parser. In the closed test , we get the result that label precision and label recall are 88.1% and 86.8% , showing that this method has potential to get a better performance in parsing and deserves further research.

Select

Approach to Spoken Chinese Understanding Based on Semantic Classification Trees

ZUO Yun-cun,ZONG Cheng-qing

2006, 20(2): 10-17.

Abstract ( ) PDF ( )

Knowledge map

Save

The spoken language understanding is a crucial part in spoken language translation systems and human-machine dialog systems. In this paper ,we propose a new approach to spoken Chinese understanding which combines statistical and rule-based methods. In this approach ,the semantic classification trees which are built by the semantic rules automatically learned from the training data are used to disambiguate key words related to the sentences’shallow semantic meaning ,and then , a statistical model is used to extract the whole sentence’s domain action. The experimental results show that this approach has good performance and is feasible for the restricted domain oriented Chinese spoken language understanding in the shallow semantic level.

Select

Bilingual Dictionary Extraction for Special Domain Based on Web Data

ZHANG Yong-chen,SUN Le,LI Fei,LI Wen-bo,Nishino,YU Hao,FANG Gao-lin

2006, 20(2): 18-25.

Abstract ( ) PDF ( )

Knowledge map

Save

Bilingual dictionary is the base of many NLP applications such as multi-lingual information retrieval and machine translation. This paper proposes a method of extracting bilingual dictionary for the special domain from the non-parallel corpora : first , discusses the fundamental postulate and reviews the related research ,second , presents an algorithm of extracting the bilingual dictionary for the special domain based on the non-parallel corpora with the word relation matrix ,and finally , analyzes the influence of the seed word on the extraction of the bilingual dictionary with abundant of experimentation. The experiments demonstrate that the quantity and average frequency of the seed word pairs contribute to the results effectively.

Select

Automatic Classification of Chinese Text Genre

FANG Zhi-fei,LIN Hong-fei,YANG Zhi-hao,ZHAO Jing

2006, 20(2): 26-34.

Abstract ( ) PDF ( )

Knowledge map

Save

Genre is defined as a category on the basis of external criteria , so its classification is different from the classification based on content. A new mechanism for automatic classification of Chinese text genre is presented , and its main idea is as follows. Features for genre classification , as an essential factor in the mechanism , are described in two ways : one is in word-set , such as affective words and political words derived from some related dictionaries and corpus statistics ; another one is in rule format , such as document identifiers and items. In terms of the correlativeness and variance of features , an approach of parametric distribution is applied to evaluate various features of the genres and extract the features for genre classification. Support Vector Machine is then used as the learning algorithm to build the classifier. The experiment on automatic classification of Chinese text genres , running on a text corpus consisting of five genres , shows that it can improve the precision of classification.

Select

Syntactic Structure Parsing Based Chinese Question Classification

WEN Xu,ZHANG Yu,LIU Ting,MA Jin-shan

2006, 20(2): 35-41.

Abstract ( ) PDF ( )

Knowledge map

Save

Question classification is very important for question answering ,and the result of question classification directly affects the quality of question answering. This paper presents a new method on feature extraction for question classification. The output of syntactic parsing is used in this method to extract the Subject-Predicate structure as well as interrogative words and their adjunctive parts as features for classification , leading to substantial reduction in noise ,and emphasis on the main features of question classification. A bayesian classifier is used in classification ,which effectively increases the precision of question classification. The experimental result validates the effectiveness of this method : the classification precision of coarse classes and fine classes reach 86.62% and 71.92% respectively ,which attains the expected effects.

Select

Information Extraction System Based on Event Frame

LIANG Han,CHEN Qun-xiu,WU Ping-bo

2006, 20(2): 42-48.

Abstract ( ) PDF ( )

Knowledge map

Save

Information extraction technologies can provide high quality retrieval service. In this paper we present an information extraction model based on event frame , and build an unified calamity event frame. The extraction system can be easily implemented due to the inheritance and induction characters of the frame. We also use the frame to collect event information and then output the results in the order of time. A calamity event information extraction system is conducted using the methods. The experiment indicates the methods are feasible.

Select

An Improved Wu-Manber Multiple-pattern Matching Algorithm and Its Application

SUN Xiao-shan,WANG Qiang,GUAN Yi,WANG Xiao-long

2006, 20(2): 49-54.

Abstract ( ) PDF ( )

Knowledge map

Save

The Wu-Manber multiple-pattern matching algorithm does not work well when some patterns are suffix of other patterns. To solve the problem , an improved algorithm is introduced which reduces the number of comparisons during pattern matching and leads to a faster matching algorithm. The text retrieval experiments use 52,067 passages which are randomly selected from TREC2000. Three algorithms including the Wu-Manber algorithm , the improved algorithm and the algorithm simply breaks halfway ,are compared and the results show that the improved algorithm can steadily reduce the number of character comparisons and thus work more efficiently.

Select

Productive Representation on the Phonetic-Semantic Relations of Shuowenjiezi

SONG Ji-hua,LI Guo-yu,WANG Ning

2006, 20(2): 55-61.

Abstract ( ) PDF ( )

Knowledge map

Save

The study of the Chinese semantic relationship relies on that of the phonetic-semantic relations among Chinese characters which fall into three types : homophony , synonymy and paronym. It is one of the most important tasks for Shuowenjiezi (SWJZ) researchers to explore the phonetic-semantic relationship of Chinese characters and then reason out the semantic relations between the characters using their phonemes. In order to better understand the phonetic-semantic relations based on computer technology , especially the paronym , among Chinese characters , this paper formalizes the expression of the phoneme rules , which serves as a foundation for the coming researches. The Knowledge Base of SWJZ includes the Shuangsheng Rule Base and the Dieyun Rule Base , both of which are built as Productive Frame by combining the Rule Slot with the traditional attribute slots and bases , which can represent the descriptive and rule knowledge efficiently.

Select

A Study on Robustness of Large Vocabulary Chinese Continuous Speech Recognition System Based on Wavelet Analysis

YAN Long,LIU Gang,GUO Jun

2006, 20(2): 62-67.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper wavelet decomposition is used to decompose speech signal into five levels. The wavelet coefficients of each part were reconstructed. Because different frequencies of the speech signal have different influence on the performance of the system , the acoustic model of each level was trained and tested. The experimental results show that the method of this paper is effective on gauss white noise and real environmental noise. However it is not effective on pink noise.

Select

Grapheme Segmentation and Recognition in Machine Printed Hangul Characters

XU Ri-jun,LIU Chang-ping

2006, 20(2): 68-73.

Abstract ( ) PDF ( )

Knowledge map

Save

Hangul is composed of graphemes of characters which represent. consonants and vowels in korean. One important Hangul character recognition method is thus the approach of separating each grapheme of character and identifying the separated graphemes independently. For separating graphemes , this paper proposes a background - thinning technique combining structural information of characters. Then ,the separated graphemes are recognized by a statistical method using peripheral features. In a test case with machine printed Hanguls of 4 fonts , the proposed approach achieved 97.4% of grapheme segmentation rate , and over 99% of grapheme recognition rate.

Select

A Study of the Keyboard Input Technology of Uighur Cell Phones and Its Realization

Riyiman TURSUN,Woxur SILAMU

2006, 20(2): 74-79.

Abstract ( ) PDF ( )

Knowledge map

Save

It is of great practical and commercial value to the development of communication and econmy in the minority areas to design a cell phone with Uighur ,Chinese and English Multilanguage. This paper studies the characteristics of the Uighur script , the difficulties in designing the input method of mobile phones , and the multilingual hybrid data display problems of Chinese ,English and Uighur with different width and different input direction and then suggests a Uighur keyboard layout for mobile phones , considering the characteristics of Uighur scrip and the physical features of the display screen . Besides , an input method of Uighur which supports multilingual hybrid data display is given ,together with a program flowchart for realizing the important parts of this input method.

Select

A Study of Layout and Input Method of A Genenal Tibetan Computer Keyboard

LU Ya-jun

2006, 20(2): 80-88.

Abstract ( ) PDF ( )

Knowledge map

Save

In order to improve the layout and input method of the current various Tibetan computer keyboard , the paper is based on the basic theory of keyboard arrangement , some principles , the relevant scientific data and the Tibetan corpus-based statistical data with respect to characters , parts , syllables and of vocabulary. It also follows its Tibetan grammatical rules and specific characteristics as the basis of the specialized research on the property of current Tibetan keyboards. The author invents a key representing multi-characters , that is to say , a key integrating with many parts without shifting the other keys while inputting the Tibetan text. This method improves the input speed and efficiency of Tibetan text. It can be widely used in Tibetan printing , administrative automation and information processing.

Select

Survey on International Text Processing

RUI Jian-wu,WU Jian,SUN Yu-fang

2006, 20(2): 89-95.

Abstract ( ) PDF ( )

Knowledge map

Save

The implementation of multilingual text I/O is essential for computers to interact with all sorts of users in the world. One of the most important functionalities for a computer is , how and to which extent its operating system supports languages with multi-scripts. Owing to considerable differences amongst scripts , multilingual text processing in an a global operating system is very complicated. In this paper , firstly , the scope and the content of multilingual text processing are defined , including text input , store , processing and interactions in an internationalized manner. Secondly , models for text processing are outlined ; several technical solutions are discussed ; the pros and cons are listed. Thirdly , technical features of text processing implemented by current operating systems are analyzed. Finally , some challenges in the realm of internationalized text processing are presented.

Select

Research on Multi-lingual Support for Minority Languages in DBMS

CHENG Wei,LIN He-shui,WU Jian,SUN Yu-fang

2006, 20(2): 96-102.

Abstract ( ) PDF ( )

Knowledge map

Save

Almost all the large database systems currently in use such as Oracle , Sybase and DB2 lack the support to minority languages of China. How to storage , query and index minority language information in databases and how to support database applications in such a multi-lingual environment are important tasks. This paper proposes a DBMS multi-lingual support framework for minority languages , along with a multinational language application programming interface. Moreover , it proposes a sorting algorithm for Tibetan words according to the semantics of ISO/IEC 14651 , leading to a full support in . PostgreSQL for Tibetan information processing . The framework has been implemented in PostgreSQL database on the Redflag Linux OS.

Select

A Solution to Chinese Polyphones - Homophones in Database Query

JIANG Xiao-jing

2006, 20(2): 103-106.

Abstract ( ) PDF ( )

Knowledge map

Save

There are many homophones and polyphones in Chinese personal names and geographical names , resulting in some defect in the database query technology. This paper analyzes the multiple-multiple relations between pronunciations and Chinese characters , points out the limitation of existing database query technologies caused by ignoring the problem of polyphones , and gives a solution to database query regarding Chinese polyphones and homophones : instead of inputting Pinyin (Chinese Phonetic Alphabet) , we input Chinese characters , with some changes in query logic in database processing. This solution can be used in most of database systems , for example , Oracle and MS SQL Server , and work well.

Please choose a citation manager

Content to export

2006 Volume 20 Issue 2 Published: 15 April 2006