Journal of Chinese Information Processing

Select

Review

Coarse-Grained Word Sense Disambiguation. Using Features Described in the Lexicon

WU Yun-fang, JIN Peng, GUO Tao

2007, 21(2): 1-8.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a simple but effective feature-based approach to Chinese word sense disambiguation using the distributional features available from the Grammatical Knowledge-base of Contemporary Chinese. The test data is the sense-tagged corpus of People’s Daily. A Nave Bayes classifier is also tried as a comparable statistical method. The feature-based approach achieves precision of 90%, which is comparable to the NB classifier. The striking advantages of the feature-based approach are 1) It is not influenced by the data size, and 2) It can disambiguate some specific words with precision of 100%. The features appropriate for different parts of speech in Chinese WSD are also discussed. This paper demonstrates that sense features described in the lexicon are worth including in WSD.

Select

Review

Correlation Voting Fusion Strategy Used for Part of Speech Tagging

GUO Yong-hui, WU Bao-min, WANG Bing-xi

2007, 21(2): 9-13.

Abstract ( ) PDF ( )

Knowledge map

Save

Part-of-speech (POS) tagging approaches always utilizes linguistic knowledge described from one perspective. Based on the research of four kinds of POS tagging methods, such as, TBED, DT, HMM and ME, we propose a novel data fusion strategy for POS tagging--- correlation voting method. The result of experiment shows that linguistic knowledge of POS tagging can be more roundly described by applying data fusion, and the correlative voting is better than other fusion methods for an average decrease of 27.85% in tagging error rate.

Select

Review

Study on Multi-scale Nested Entity Mention Recognition

LIU Fei-fan, ZHAO Jun, XU Bo

2007, 21(2): 14-21.

Abstract ( ) PDF ( )

Knowledge map

Save

Entity recognition plays a significantly important role in many natural language processing applications. Previous study on entity recognition is mainly focused on the Named Entity Recognition (NER) and nested NEs are not considered. This paper proposes a multi-scale nested entity mention recognition system in the context of ACE(Automatic Content Extraction), which aims to identify named, nominal,pronominal mentions of entities within unstructured texts and assign multiple attributes for all the mentions. We separate this task into two subtasks: multi-scale nested boundary detection and multiple information recognition. First, we propose a information encoding method for nested structure which provides an effective solution to recast the multi-scale nested boundary detection problem to the classical sequential labeling problem. Second, a parallel two-agent classifier is presented to conduct multiple information recognition for each entity mention. Furthermore, abundant multi-level linguistic features are integrated in our machine learning based framework to achieve competitive performance. We evaluate the proposed framework on ACE standard corpus by extensive experiments and obtain the accuracy of 71% for nested boundary detection, the accuracy of 89.05%, 82.17% for the two classification agents respectively.

Select

Review

Recognizing Chinese Person Names Based on Hybrid Models

MAO Ting-ting, LI Li-shuang, HUANG De-gen

2007, 21(2): 22-28.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper describes a hybrid model and the corresponding algorithm combining support vector machines (SVM) with statistical methods to improve the performance of SVM for the task of Chinese person names recognition. In this algorithm, a training set is obtained by extracting the attributes of feature vectors based on characters and the SVM model of automatic identification of Chinese person names is set up by choosing a proper kernel function. Thus a threshold of the distance from the test sample to the hyperplane of SVM in feature space is used to separate SVM region and statistical method region. If the distance is greater than the given threshold, the test sample is classified using SVM; otherwise, the statistical model is used. The experimental results show the recall, precision and F-measure for recognition of Chinese person names based on the hybrid model are up to 91.96%, 94.62% and 93.27% respectively in open test. Compared with sole SVM, the F-measure increases 1.51%. By integrating the advantages of two methods, the performance is obviously improved.

Select

Review

Chinese Syntactic Parsing Algorithm Based on Segmentation of Punctuation

MAO Qi, LIAN Le-xin, ZHOU Wen-cui, YUAN Chun-feng

2007, 21(2): 29-34.

Abstract ( ) PDF ( )

Knowledge map

Save

So far, most syntactic parsers neglect the punctuations or oversimplify their functions. However, it is actually very important information of syntactic characters. According to the features of punctuation in the syntactic structure, this paper proposes a kind of new concept of separate parsing phrase, and according to the typical character and the position of punctuation in a sentence, this paper also presents one way to identify the separate parsing phrase based on the decision tree algorithm (Id3). In this paper, the punctuation is integrated into syntactic analysis. All the experimental data sets, including the training data and test data, are derived from the Chinese Penn Tree Bank 5.0. The experiments have been done solely using the sentences, the length of which is over 40 Chinese words. The results indicate that the accuracy and the recall rate have been improved by 1.59% and 0.93% respectively, and the time expense has been reduced by nearly 66.6%. The results show that the punctuation is quite useful and effective to parse the long sentences in Chinese.

Select

Review

A Survey of Natural Language Processing in Information Retrieval

WANG Can-hui, ZHANG Min, MA Shao-ping

2007, 21(2): 35-45.

Abstract ( ) PDF ( )

Knowledge map

Save

Natural language processing (NLP) has been used in information retrieval (IR) by researchers, in the hope of improving retrieval effect. But most of the results are in the opposite way hypothesized. In most cases, NLP didn’t yield increases in IR precision but took a negative effect. Even if NLP helped IR under some circumstances, the improvements were much smaller than the processing cost needed by NLP. Researchers perform analysis on these phenomena and come to the conclusion that: IR-related tasks that acquire accurate results, such as question answering (QA) and information extraction (IE), are more suited for the use of NLP. NLP needs to be optimized for IR in order to be effective. Recent research, e.g. adding NLP factors to language model, has more or less confirmed the conclusion.

Select

Review

Decision Tree and Markov Model BasedQuestion-Answer Pair Extraction

LIU Jia-bin, HU Guo-ping, CHEN Chao, SHAO Zheng-rong

2007, 21(2): 46-51.

Abstract ( ) PDF ( )

Knowledge map

Save

Question Answering System can give users precise answer to the question presented in natural language and the major factor which influence the System’s performance is the scale of Question-Answer pairs. In order to increase the Question-Answer pair’s scale and make full use of Web Pages’ resource, in this paper we propose a method that uses decision tree and Markov model to extract Question-Answer pairs in Web Pages. The method uses DOM tree to represent a web page according to HTML tags. Then acquire features value from every DOM tree’s node. Last allow the features overpass the classification model, which created by decision tree and Markov model, to get the node’s last classification result. Experimental results show that the precision achieved 90.40% and recall achieved 86.03%. Experimental results also show that this model could extract information from all kinds of Web Pages.

Select

Review

The Establishment of the Annotated Corpus of Song Dynasty Poetry.Based on the Statistical Word Extraction and Rules and Forms

SU Jin-song, ZHOU Chang-le, LI Yi-hong

2007, 21(2): 52-57.

Abstract ( ) PDF ( )

Knowledge map

Save

The annotated corpus of Song Dynasty poetry is the foundation of the computer-based study of Song Dynasty poetry. In our paper, we propose a new definition of “word” in the Song poetry and a new method for the establishment of the annotated corpus. Two available methods, statistical word extraction and segmentation based on rules and forms, are taken into consideration. The former is adopted to extract closely combined two-character words and establish word lists combining with related resources. And the latter, combined with the word lists, is used to segment Song Dynasty poetry. It is showed by the experimental results that the method applied in the paper is effective.

Select

Review

A Topical Document Clustering Method

ZHAO Shi-qi, LIU Ting, LI Sheng

2007, 21(2): 58-62.

Abstract ( ) PDF ( )

Knowledge map

Save

Few of the existing document clustering methods can detect or describe document topics properly, which makes it difficult to conduct clustering based on topics. In this paper, we introduce a novel topical document clustering method called Linguistic Features Indexing Clustering (LFIC), which can identify topics accurately and cluster documents according to these topics. In LFIC, “topic elements” are defined and extracted for indexing base clusters. Additionally, linguistic features are exploited. Experimental results show that LFIC can gain a higher precision (94.66%) than some widely used traditional clustering methods.

Select

Review

A Fast Clustering Algorithm for Abnormal and Short Texts

HUANG Yong-guang, LIU Ting, CHE Wan-xiang, HU Xiao-guang

2007, 21(2): 63-68.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper discusses mainly about the short texts, which occurs on mobile short messages and chat rooms. Because of their irregular style and similarity, we call them abnormal texts. We propose an efficient clustering algorithm based on the duplication information deletion algorithm. It concerns about the features of the abnormal short texts and takes some special methods such as extracting feature code and compressing code to solve this problem. Experiments show that the clustering system based on this algorithm can depose millions of abnormal short texts per hour with high accuracy.

Select

Review

Unsupervised Answer Pattern Acquisition

WU You-zheng, ZHAO Jun, XU Bo

2007, 21(2): 69-76.

Abstract ( ) PDF ( )

Knowledge map

Save

The paper presents an unsupervised learning algorithm to learn answer pattern for answer extraction module of Chinese Question Answering (QA). Given two or more questions of one question type, the algorithm can learn the corresponding answer patterns from internet via web search, topic segmentation, pattern extraction, vertical clustering and horizontal clustering, etc. The experimental results show that the performance of pattern-based answer extraction of Chinese QA is improved significantly.

Select

Review

A New Graph Clustering Algorithm for Chinese Noun Phrase Coreference Resolution

ZHOU Jun-sheng, , HUANG Shu-jian, CHEN Jia-jun, QU Wei-guang

2007, 21(2): 77-82.

Abstract ( ) PDF ( )

Knowledge map

Save

Coreference resolution plays an important role in natural language processing. Facing the fact that the Chinese training corpus for coreference resolution is heavily lacking, this paper presents a new unsupervised clustering algorithm for noun phrase coreference resolution. In this approach, the problem of coreference resolution is firstly converted as a graph clustering problem, and then an objective function called the modularity function, which allows automatic selection of the number of clusters, is selected for graph clustering. The proposed algorithm does not make pairwise coreference decisions independently of each other. The experimental results on the Chinese ACE training corpus demonstrate that the proposed method is a feasible unsupervised algorithm for noun phrase coreference resolution.

Select

Review

Research on Automatic Version Comparison and Analysis.of Ancient Book and Its Realization

CHANG E, HOU Han-qing, CAO Ling

2007, 21(2): 83-88.

Abstract ( ) PDF ( )

Knowledge map

Save

The automatic version comparison and analysis of ancient book is that the difference among different versions of ancient book is automatic found and marked down by computer, and give help to critic by the assisted tools. Firstly, the article expands on sense in automatic version comparison and analysis of ancient book. Secondly, the article addresses system of automatic version comparison and analysis in detail, including subject selection, data collection, object, algorithm of automatic version comparison and analysis, and system master plan. Finally, we discuss thoroughly how to develop the assisted tools, including the list of ancient official title, personal name and place name, etc. The experimental result shows that the recall is 92.3% and the precision is 95.2%.

Select

Review

Feature Extraction for Handwritten Chinese Character by Weighted Dynamic.Mesh Based on Nonlinear Normalization

CHEN Guang, ZHANG Hong-gang, GUO Jun

2007, 21(2): 89-93.

Abstract ( ) PDF ( )

Knowledge map

Save

A new feature extraction method contributing to improvement of the performance of a handwritten Chinese character recognition system is presented. By using enhanced weighted dynamic meshes based on nonlinear normalization, this method not only avoids the zigzags and other undesirable side effects introduced in the original Yamada et al.’s nonlinear normalization method but also avoids additional feature normalization process in the original Lian-Wen Jin et al.’s and WU Tian-lei et al.’s dynamic mesh method. Experiment on HCL2000 shows that our method achieves superior performance.

Select

Review

TH-CoSS, a Mandarin Speech Corpus for TTS

CAI Lian-hong, CUI Dan-dan, CAI Rui

2007, 21(2): 94-99.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper states our work which focuses on the building and analysis of corpus for Mandarin Text-to-Speech System, named TH-CoSS. The text script consists of four parts: sentences for TTS system building, sentences for TTS system evaluation, special syllable groups, and sentences with special sentence type to convey special intonation. The finished corpus has about 20K sentences read by one female and one male. The annotation files are in XML format, including segmental and prosodic tags. Software tools are developed as well. On the basis of the syllables in TH-CoSS, an analysis of the influences of context features on the prosody of speech is carried out.

Select

Review

Pinyin-to-Character Conversion Model Based on Support Vector Machines

JIANG Wei, GUAN Yi , WANG Xiao-long, LIU Bing-quan

2007, 21(2): 100-105.

Abstract ( ) PDF ( )

Knowledge map

Save

In order to overcome the difficulty in fusing more features into n-gram, a Pinyin-to-Character conversion model based on Support Vector Machines (SVM) is proposed in this paper, providing the ability of integrating more statistical information. Meanwhile, the excellent generalization performance effectively overcomes the overfitting problem existing in the traditional model, and the soft margin strategy overcomes the noise problem to some extent in the corpus. Furthermore, rough set theory is applied to extract complicated and long distance features, which are fused into SVM model as a new kind of feature, and solve the problem that traditional models suffer from fusing long distance dependency. The experimental result showed that this SVM Pinyin-to-Character conversion model achieved 1.2% higher precision than the trigram model, which adopted absolute smoothing algorithm, moreover, the SVM model with long distance features achieved 1.6% higher accuracy.

Select

Review

A Two Level Unsupervised Algorithm for Audio Segmentation

ZHANG Shi-lei, ZHANG Shu-wu, XU Bo

2007, 21(2): 106-111.

Abstract ( ) PDF ( )

Knowledge map

Save

We propose a two level unsupervised method for audio segmentation that detects acoustic changes of speaker, environment and channel in a continuous audio stream effectively. In our approach, we divide the change detection process into two levels: region level that detects the potential change regions containing candidate acoustic change points, and boundary level that searches and refines the true change points. At the region level, we employ the modified Generalized Likelihood Ratio metric to search for the potential change regions in continuous local windows without setting any threshold. At the boundary level, we perform T² and Bayesian Information Criterion algorithm to detect segment boundaries within the potential windows. The experimental results on the 1997 Broadcast News Hub4-NE mandarin corpus show the proposed scheme can get nearly 10.5% recall rate increase.

Select

Review

Study and Realization of Pivotal Technology in The Design of Uighur Version Office

LU You-fei, ZHANG Wei, ZHANG Yan, MIAO Cheng, LI Chun

2007, 21(2): 112-116.

Abstract ( ) PDF ( )

Knowledge map

Save

It is of great value to the development of information in the minority area to design the office with Uighur,Chinese and English multilanguage. In view of above, this article first introduces Uighur characteristic, then analyzes and realizes the essential techniques including Automatic graphy selection , Breaking the line according to the syllable and automatic lengthening and so on in the design of Evermore Integrated Office Uighur version, after these essential techniques take into application in Uighur version Office , it causes the Uighur typesetting to be extremely neat through the test. Simultaneously these essential techniques have general guidance to the Ughur scripts processing, also to other Uighur language software development.

Select

Review

The Implementation of Rendering Mongolian in OpenOffice.org Office Suite

MENG Fan-qiang, WU Jian, JIA Yan-min

2007, 21(2): 117-121.

Abstract ( ) PDF ( )

Knowledge map

Save

Mongolian belongs to complex scripts, and so far it can’t be rendered correctly in the OS and office suite. OpenOffice.org is a multi-platform office suite, being able to run in Linux and Windows, and to invoke ICU LayoutEngine and Uniscribe to process complex scripts in Linux and Windows respectively. In this paper, we based on OpenOffice.org supporting Mongolian in Linux, analyzed complex script process of OpenOffice.org in the Linux and Windows, and implemented Mongolian presentation in OpenOffice.org, by integrating ICU LayoutEngine into Uniscribe.

Select

Review

A Solution to Support Tibetan Coded Character Set Extension A/B in Linux System

ZHANG Xing-liang, RUI Jian-wu, XIE Qian, CHENG Wei, WU Jian

2007, 21(2): 122-128.

Abstract ( )

Knowledge map

Save

There lacks corresponding coded character standards which regulate the commonly used BrdaRten characters in software development. The newly shipped National standards, “Information technology—Tibetan coded character set—Extension A” and “Information technology-Tibetan coded character set-Extension B”, are of great importance to the standardization and globalization of software development in China. In this paper, the encoding methods adopted by Tibetan character set defined in ISO/IEC 10646 and Tibetan coded character set extension A/B are compared; critical problems for implementations are analyzed. Finally, for the special characteristics of Tibetan coded character set—Extension B, based on Linux I18N architecture, a reasonable solution is proposed to thoroughly support the newly shipped National standards.

Please choose a citation manager

Content to export

2007 Volume 21 Issue 2 Published: 16 April 2007