Journal of Chinese Information Processing

Select

Precise Content Extraction from News Web Page Based on Decisions of Two Layers

HU Guo-ping,ZHANG Wei,WANG Ren-hua

2006, 20(6): 3-11,105.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper concerns content extraction from news web pages based on decisions of two layers. The first layer of decision is introduced to predict the scope of content in a webpage, and the second layer is employed to judge whether the paragraph within predicted scope is content or not. We firstly present a strict definition of content for web pages orienting to the practical applications, then analyze the characteristics of news web pages and their contents. Based on the analysis, we propose a content extraction method based on decisions of two layers, and carry out experiments on a corpus of 1867 HTMLs collected from 10 main news web sites in China. The experiment results show that our method can predict the content of news web pages quite well: the percentage ofweb pages which contain mismatching in extracted content is only 18.14% , which decreases 29.85% compared to that just based on the second layer prediction, and only 7.11% of extracted pages are with more than 10% mismatching, indicating that this method could be applied to practical applications.

Select

Genetic Algorithms Based on Variable Neighborhood Search for Mining Long Frequent Itemsets

ZHANG Shun-zhong,WANG Shu-mei,HUANG He-yan,CHEN Zhao-xiong

2006, 20(6): 12-18.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a method formining long frequent items based on variable neighborhood search (VNS-GA). Using the high searching efficiency of GA, the maximum frequent patterns can be mined rapidly. In designing the fitness function, we consider at the same time the support of itemset, length and the distance between itemset’s support and the central support of neighborhood. Running the algorithm once, the maximum frequent itemsets within the neighborhood can be found, and by changing the central support of neightborhood,we can find the maximum frequent itemsets we are interested in. The validity of this method has been tested by experiments. It has been demonstrated that the VNS-GA algorithm has high efficiency in long pattern mining problem because its time complexity is free of support threshold.

Select

Chunk Parsing Based on SVM and Error-Driven Learning Methods

HUANG De-gen,WANG Ying-ying

2006, 20(6): 19-26.

Abstract ( ) PDF ( )

Knowledge map

Save

Chunk parsing of Chinese texts can decrease the difficulty of syntactic parsing. This paper proposes a chunking approach that combines support vector machine with error-driven learning. First, the SVM model is used to chunk the training data. Then by error-driven learning, we automatically acquire the tuning rules from the chunking results of SVM. After filtration the rules are used to revise the chunk parsing results of SVM. The experimental results show that this approach is effective in Chinese chunk parsing. Compared with the pure SVM-based chunking, the performance is improved.

Select

A Keyword Selection Method Based on Lexical Chains

SUO Hong-guang,LIU Yu-shu,CAO Shu-ying

2006, 20(6): 27-32.

Abstract ( ) PDF ( )

Knowledge map

Save

Keywords are very useful for information retrieval, automatic summarizing, text clustering/classification and so on. A lexical chain is a series of related words and primarily used in text structure analyzing. The paper proposes a lexical-chain-based keywords indexing method for Chinese texts. And, an algorithm for constructing lexical chains based on HowNet knowledge database is given. In the method, lexical chains are firstly constructed by calculating the semantic similarity between terms, then keywords are selected through taking account of term frequency and area. The experimental results shows that the performance of the system has a notable improvement by considering semantic relationship between terms, and the precision can be improved by 9.33 percent and the recall can be improved by 7.78 percent compared with term frequency and area.

Select

Automatic Acquisition of Chinese Collocation

WANG Su-ge,YANG Jun-ling,ZHANG Wu

2006, 20(6): 33-39.

Abstract ( ) PDF ( )

Knowledge map

Save

As a kind of word phenomenon, collocation plays a very important role in nature language processing. In this paper, 4 kinds of word association measurements and 3 kinds of word structure distribution measurements are compared and analyzed respectively, and a hybrid method based on mutual information and entropy for collocation is proposed. The experiment results indicate that 4 kinds of word association measurements, mutual information, Cosine coefficient, x2 test and likelihood ratio have the same impact under high co-occurrence frequency for collocation acquiring and entropy is superior to variance and spread for measuring the word structure distribution. The proposed method relies on fewer measurements and can easily selects coefficient thresholds and achieves the same impact of the existing methods.

Select

Study on Computer-aided Popular Words and Phrases Extraction Based on Words’Attributes

HE Ting-ting,ZHU Yi,ZHANG Yong,REN Han

2006, 20(6): 40-47.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper introduces our research on computer-aided popular words and phrases extraction. In the research we use the web pages download from SINA.COM from Jan 1st 2005 to June 25th 2005 as the resource, extract the valid words and phrases, analyze the determinant attributes of every word and make the definition of popularity. Based on the above achievements, we presents the method for measuring the concerning degree of word and phrase, filters and sorts the words according to the tendency curves, and finally get the candidate popular words and phrases. Through the above research, we demonstrate the rationality of the definition of determinant attributes, and give some reference data for the research on words'characteristics.

Select

SBGA:A Multi-Document Summarizer Using Genetic Algorithm-Based Sentence Extraction

LIU De-xi,HE Yan-xiang,JI Dong-hong,YNAG Hua

2006, 20(6): 48-55.

Abstract ( ) PDF ( )

Knowledge map

Save

The multi-document summarizer using genetic algorithm-based sentence extraction (SBGA) regards summarization process as an optimization problem where the optimal summary is chosen among a set of summaries formed by the conjunction of the original articles sentences. To solve the NP hard optimization problem, SBGA adopts genetic algorithm, which can choose the optimal summary on global aspect. The evaluation function employs four features according to the criteria of a good summary: satisfied length, high coverage, high informativeness and low redundancy. To improve the accuracy of term frequency, SBGA employs a new method TFS, which takes word sense into account while calculating term frequency. The experiments on the data in DUC04 show that our strategy is effective and the ROUGE-1 score is only 0.55% lower than the best participant in DUC04.

Select

Automatic Text Summarization Based on Web

GENG Zeng-min,JIA Yun-de,LIU Wan-chun,ZHU Yu-wen

2006, 20(6): 56-62,110.

Abstract ( ) PDF ( )

Knowledge map

Save

Web Document Summarization (WDS) is becoming one of the hot subjects in the text summarization field due to the rapidly increasing number of documents on Web. However, WDS is different from traditional text summarization because it processes hyperlinked texts. This paper first analyses the features of Web documents, then gives a definition for WDS, and finally presents an algorithm for WDS based on sentences extraction. Each sentence's weight is a weighted sum of words'weight and its sentence-structure's weight. The former weight is adjusted by document class tree graph and the latter weight considers both the Web formats and hyperlink attributes. The weight proportion of words and structures is learned by a machine learning approach. Experiments on 1,000 Web documents show that our algorithm is feasible.

Select

Authorship Identification Based on Semantic Analysis

WU Xiao-chun,HUANG Xuan-jing,WU Li-de

2006, 20(6): 63-70.

Abstract ( ) PDF ( )

Knowledge map

Save

Authorship identification techniques are popular in various research areas. The key problems of authorship identification include extracting style marks and evaluating the document similarity in terms of writing style. Traditional methods examine features revealing the author’s writing habits such as the author’s style of using words, constructing sentences and organizing paragraphs, among which analyzing the frequency of punctuations or function words is prevalent. Consulting theoretical stylistics, this paper proposed a new similarity evaluation method based on semantic analysis using HowNet. Experimental results show that content words can also be used as style marks to discriminate among various authors.

Select

A Post-processing Approach for Handwritten Chinese Address Recognition

LONG Chong,ZHUANG Li,ZHU Xiao-yan,HUANG Kai-zhu,SUN Jun,Yoshinobu Hotta,Satoshi Naoi

2006, 20(6): 71-76.

Abstract ( ) PDF ( )

Knowledge map

Save

OCR (Optical Character Recognition) , a convenient and efficient automatic character recognition tool, is becoming more and more important in office automation, information recovery and digital library. Language Model is widely used in OCR post-processing, especially in Chinese. In this paper, we focus on the post-processing of handwritten Chinese addresses, and discuss the relationship between the granularity of language model and system performance. The character-based and the word-based language models are both discussed. Their advantages and disadvantages are also presented. After analysis, the word-based language model is adopted, and then weighted word graph and its algorithm are proposed. Experiments on 58269 handwritten Chinese addresses show that the performance of the OCR system has been greatly improved and the recognition precision increases from 28.56% to 74.15% , which means 63.82% error reduction.

Select

Research on Feature and Score Normalization for Speaker Verification

ZHENG Rong,ZHANG Shu-wu,XU Bo

2006, 20(6): 77-84.

Abstract ( ) PDF ( )

Knowledge map

Save

In speaker verification, the performance will be significantly deteriorated due to the mismatches between the training and testing acoustic conditions. In this paper, two compensation approaches based on feature normalization and score normalization are presented, respectively. Firstly, segment-based cepstrum mean and variance normalization (SCMVN) is modified to normalize the cepstral coefficients with the similar segmental Gaussian distribution to improve the matching degree in different environmental conditions. Secondly, in order to cope with the score variability among the speakers and test utterances, two-stage score normalization techniques, i. e. Test-dependent zero-score normalization (TZnorm) and Zero-dependent test-score normalization (ZTnorm) , are presented to transform the output scores and make the speaker-independent decision threshold more robust under adverse conditions. Experiments on the NIST 2002 speaker recognition evaluation (SRE) corpus show that SCMVN and ZTnorm yield better performance. Compared to the baseline system using CMS and zero normalization, 20.3% relative improvement in EER and 18.1% in the minimal DCF are obtained from the combination of both techniques.

Select

The Optimal Selecting for HMM State-number in Mandarin Continuous Speech

HE Jue,LIU Jia

2006, 20(6): 85-90.

Abstract ( ) PDF ( )

Knowledge map

Save

In order to optimize the penformance of HMM-based Mandarin Continuous Speech recognition, the method of optimal selecting for each initial and final semi-syllable acoustic Hidden Markov Model state-number is proposed. It is proposed that to synthetically calculate three kinds of information, which are the duration mean, duration variance and correctness of each initial and final semi-syllable acoustic Hidden Markov Model, as the principle to select the optimal each semi-syllable acoustic Hidden Markov Model with different state-number and it shows the better performance of semi-syllable recognition by 5.07% , compared with the Hidden Markov Model system with the same state-number. The research demonstrated that each initial and final semi-syllable acoustic Hidden Markov Model should be set up according to practicality and the recognition performance can be increased after the optimal selecting.

Select

The Electronic PSC Testing System

WEI Si,LIU Qing-sheng,HU Yu,WANG Ren-hua

2006, 20(6): 91-98.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper develops an automatic PSC testing system aiming at efficiently evaluating the spoken Chinese. On the basis of 100 hours’standard Chinese database, this paper uses the characteristic of Chinese and linguist’s expert knowledge to optimize the traditional speech evaluation algorithm. At the same time, a corpus-adaptive method is propose to enhance the robustness and performance of the algorithm. Experiments on 500 persons’PSC testing database prove that the new algorithm is much better than the original algorithm. After linear mapping, the error between the machine score and the human score is almost equal to the error between humans’, that is 2.44. The result indicates that the automatic PSC testing system can replace the human to evaluating spoken Chinese under text-dependent condition.

Select

Finite State Machine Description of ISO 2022

XIE Qian,RUI Jian-wu,WU Jian

2006, 20(6): 99-105.

Abstract ( ) PDF ( )

Knowledge map

Save

The encoding system defined by ISO 2022 has pervasive effect on all sorts of national character sets, whereas a lot of uncertain entries of this standard preclude accurate comprehension to them. In this paper a finite state machine (FSM) is introduced to describe the features of ISO 2022 formally. For a 5-tuples of FSM, state set is thoroughly decomposed; input set is divided into categories; start state and acceptance state set are provided; scale of transition functions is analyzed. This FSM description method is also applied to several coded character sets, such as ISO-2022-CN, EUC-CN, and compound text, to reveal their internal relationship with ISO 2022. This work is helpful to detect the consistency of ISO 2022, to draft extended standards and to evaluate the complexity of system implementation. Being seldom used before, this formal method is a new approach to research on coded character set.

Select

A Multi-layers Chinese Characters Input Model for Handheld Devices

ZHU Xiao-xu,LI Pei-feng,ZHU Qiao-ming,DIAO Hong-jun

2006, 20(6): 106-110.

Abstract ( ) PDF ( )

Knowledge map

Save

With the rapid development of handheld devices, such as PDA, SmartPhone, the need for different kinds of Chinese characters Input Method (IM) for them becomes more and more important in China. There are different types operating system and devices, and an IM only can run on a specified device, so it’s time and effort consuming to develop a new IM. A multi-layers and general Chinese characters input method model for Handle Devices is introduced in this paper. We describe the function and specialty of each layer in detail. Then explain how to create an IM according to this model. We conclude our paper with the discussion of the advantages of this model.

Please choose a citation manager

Content to export

2006 Volume 20 Issue 6 Published: 18 December 2006