Journal of Chinese Information Processing

Select

Skeleton Parsing for Specific Domain Chinese Text

QI Hao-liang,YANG Mu-yun,MENG Yao,HAN Xi-wu,ZHAO Tie-jun

2004, 18(1): 2-6,14.

Abstract ( ) PDF ( )

Knowledge map

Save

A method of skeleton parsing for domain specific Chinese text is put forward in this paper. The method includes two key steps : shallow parsing and template matching. The template is adopted to represent the sentence skeleton. We use shallow parsing , cascade hidden Markov Model , to combine phrases. The skeleton parsing is achieved through template matching from the tree of shallow parsing. An experiment on sports news shows the performance of the proposed method achieves 98.04% precision and 81.43% recall for template matching and 96.97% precision and 84.85% recall at sentence level.

Select

Automatic Segmentation of Web information block

QU You-li,YU Hao,XU Guo-wei,NIsino

2004, 18(1): 7-14.

Abstract ( ) PDF ( )

Knowledge map

Save

With the development of the Internet the number of the Web pages increases dramatically , efficient information extraction from Web pages becomes more and more important . Some Web pages often contain multiple information units , which are arranged orderly and compactly with same presentation style and similar HTML syntax , for example , a BBS page that contains multiple posts. For information extraction , information filtering and suchlike Web application , we need segment this kind of original Web page into several appropriate information blocks as the preprocessing. This paper proposed a new automatic approach to segment the Web page into information blocks. First , we construct a structural HTML parsing tree for the Web page , and then locate the sub-tree that contains all information blocks. Finally , 2-rank PAT algorithm is applied to segment the sub-tree according to the depth of the sub-tree and the information of node under the sub-tree. Our experiments on BBS pages show this approach is fairly effective.

Select

Identification of Chinese Unknown Word Based on Decision Tree

QIN Wen,YUAN Chun-fa

2004, 18(1): 15-20.

Abstract ( ) PDF ( )

Knowledge map

Save

Unknown words can cause segmentation mistakes in the automatic word segmentation processing of large Chinese texts. Meanwhile the recognition of unknown words is a difficult point in word segmentation processing. This article suggests the recognition of unknown words as a question of classification first , that is , the segmentation fragments , upon the segmentation processing , are divided into two categories as “combination” (combining an unknown words) and “segregation” (segregating to two single character words) . Then , decision tree is used to solve this problem of classification. Six aspects are summarized from the Corpus and the modern Chinese morpheme database : front position formation probability of former character , back-end position formation probability of latter character , former character freedom , latter character freedom , mutual information and single character words co-occurred probability. Training set is constructed using these as attributes. And lastly , the decision tree is produced using C4.5 algorithm. After word segmentation processing , some unknown words have been recognized , but there are still some segmentation fragments usually. In this case our method should be used. For an open test , its recall-rate is 69.42%; its precision is 40.41%. Experimental results show that shis recognition method based on decision tree is worth to continue to study in the future.

Select

Tagging of the Idiom in the Corpus

AN Na,LIU Hai-tao,HOU Min

2004, 18(1): 21-26,42.

Abstract ( ) PDF ( )

Knowledge map

Save

Idiomaticity is a common phenomenon in natural languages. This paper analyses some known means of tagging the idiom in Chinese corpus. These tagging methods are problematic for the further syntactic tagging and parsing of corpus. To find a suitable solution for application in natural language processing , the authors introduce a new concept“fixed expression”, which consist of idioms , customary usages , two-part allegorical sayings , terms and abbreviations. These fixed expressions have the same grammatical function as common words , thus we can tag them according to their function in text and give suitable vocabulary category of fixed expressions. This is called two-level tagging method. The proposed solution is useful to build a parsed corpus as knowledge source of NLP.

Select

A Comparative Study on Feature Selection in Chinese Text Categorization

DAI Liu-ling,HUANG He-yan,CHEN Zhao-xiong

2004, 18(1): 27-33.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper is a comparative study of feature selection methods in text categorization. Four methods were evaluated, including document frequency (DF) , information gain (IG) , mutual information (MI) and χ2-test (CHI) . A Support Vector Machine (SVM) and a k-nearest neighbor (KNN) were selected as the evaluating classifiers. We found IG, MI and CHI had poor performance in our test , though they behave well in English text categorization. We analyzed the reasons theoretically and put forwarded the possible solutions. A furthermore experiment proved that the combined feature selection method is effective.

Select

The Method of Creating Text Compression Model Based on Adjacency Matrix Full-text Index

TAO Xiao-peng,HU Yun-fa

2004, 18(1): 34-42.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper put forth an algorithm , which create the compressions model of length-changed word based on a new kind of full-text index called adjacent matrix model . It is well known that the created compression model is of higher efficiency to compress Chinese documents than that based on Chinese characters. A good word list is the key of an efficient text compression model based on length-changed word. We find such a word list by the minimal average entropy of character and adjacent matrix full-text index. Because adjacent matrix index model provide an efficient chance to count the number of two-character word , the processing time of this algorithm increases with the number of the kinds of characters linearly , it is fit for the practice systems. We also improve the compression effect of Chinese text . According to the experiments , the compression ratio of our method is 0.47. It improves about 25% compared to the traditional character-based compression model.

Select

Feature Selection and Weighting Scheme Based on Text Set Density

WU Ke,SHI Bing,LU Jun,NIU Xiao-fei

2004, 18(1): 43-48.

Abstract ( ) PDF ( )

Knowledge map

Save

In vector space model of information retrieval ,a text is represented as a weighted vector which is composed of terms weighting of the text . And it is a fundamental issue to how to represent the content of a text as exactly and efficiently as possible. In this paper , we will propose a method of feature selection and weighting scheme based on text set density ,which is a way of measure of contribution to the text set density about some word. By the means , we can find the set containing least elements , which can represent all valuable information of a text , and invent a more reasonable weighting scheme. And this paper presents a new measure standard of the sense of goodness of some weighting schemes : meta-scoring. Through the criterion , it is proved that the approach helps.

Select

Research on Automatic Generation of Extraction Patterns

ZHENG Jia-heng,WANG Xing-yi,LI Fei

2004, 18(1): 49-55.

Abstract ( ) PDF ( )

Knowledge map

Save

Most information extraction (IE) systems adopt a pattern-matching approach. As a result , how to generate extraction patterns has become an essential step. As the cost of man-made patterns is very high , we propose a method to generate extraction patterns automatically by clustering. Calculating the similarity between pattern examples and Using single-link clustering , examples of patterns can be clustered into various categories , each of which represents a pattern. We applied the method to Chinese agricultural texts. After clustering , the rate of wrong classification and rate of miss classification are 0.21% and 1.07% , respectively. The patterns obtained from merging include 24 types of the information that belong to the 25 types proposed by manual analysis.

Select

Extraction of Translation Lexicon with Multi-word Units for EBMT

CHENG Jie,DU Li-min

2004, 18(1): 56-62.

Abstract ( ) PDF ( )

Knowledge map

Save

EBMT system is one of corpus-based machine translation methods that applies analogy theory to translation as its main idea. It has been focused on how to extract wieldy lexicons for computer-aided translation system. The article discusses how to extract multi-word units translation lexicon with the approach of combining the threshold filter by the association value. In the two methods , the choice of the threshold depends on subjective estimation excessively ; and the calculation of the association value cannot be executed effectively. So all of them cannot meet the demand of the extraction of translation lexicon. The algorithm that is proposed in this paper first extracts the prepared multi-word units , simultaneously we lessen the subjective affection and cover all of the multi-word units by using four pairs of thresholds , so reduce the influence that the threshold itself brings about . At the same time , we filter the result for three times and improve the correctness much more. And the algorithm increases the efficiency by incorporating the multi-word units translation of the single-word with the multi-word units translation of the multi-word units.

Select

A Fast Algorithm of Skew Detection and Correction on Gray Business Card Image

BU Fei-yu,LIU Chang-song,DING Xiao-qing

2004, 18(1): 63-70.

Abstract ( ) PDF ( )

Knowledge map

Save

According to the need of business card OCR system , this paper presents a new skew detection and correction method based on black border of gray business card image1 First , this method decide the skew angle of a business card image according to four border fitting lines of the business card , then a method based on block move is provided to correct image and black border is erased based on position of border near-line. Experiments show that this approach is fast , accurate and effective. This algorithm can be extended and applied to other gray and color scan images.

Select

The Application of Segment Models in Hypothesis Testing

ZHANG Yi-yan,LIU Wen-ju,XU Bo

2004, 18(1): 71-78.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper introduced the application of Segment Models (SM) in hypothesis testing. Compared with HMM , SM relaxes the assumption of the independence of frame features , and thus is powerful in the more precise modeling. For decades researchers are engaged in its use for recognition accuracy , but the other fields are rarely dealt with. This paper mainly investigates the SM verification (e.g. Parametric Trajectory Model) in hypothesis testing — alternative PTM provides confidence measurement for HMM result , which is simple but effective. On the basis of this , for the special requirement of the acceptance/rejection , Fisher classifier is then used. Here SM N-Best as input features are proved superior to the HMMs. Compared with the traditional methods , the new SM verification achieves excellent performance.

Select

Voice Conversion by GA-based RBF Neural Network

ZUO Guo-yu,LIU Wen-ju,RUAN Xiao-gang

2004, 18(1): 79-85.

Abstract ( ) PDF ( )

Knowledge map

Save

Voice conversion technology makes the speech of one speaker sounds as though it were uttered by another speaker giving it a new identity while preserving the original content . This paper addresses a study on voice conversion using genetic algorithm (GA) to train the hidden layers of RBF neural network , which can help better capture the nonlinear mapping between different speakers. Both subjective evaluations and objective ones are conducted on the transformed speech quality with six mono-vowel phones in Mandarin speech. Experimental results show that desired performance of converted speech can be obtained when a neural network method is applied to voice conversion technique. The evaluations report that compared with K-means method , a genetic algorithm based RBF network has the ability of global optimization with a 10% decrease in the spectral distance between the transformed speech and the target speech.

Please choose a citation manager

Content to export

2004 Volume 18 Issue 1 Published: 16 February 2004