Journal of Chinese Information Processing

Select

Review

Research on Rule Based Question Answering for ChineseReading Comprehension

LI Jihong,YANG Xingli, WANG Ruibo, ZHANG Na, LI Guochen

2009, 23(4): 3-10.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper constructs a set of heuristic rules for six types of question regarding to time,human, location, number, entity and description in Chinese QARC system. Each rule is further assigned with a weight optimized by the orthogonal array. Then the calculation of each candidate answer sentence is described over corresponding rules. The experiment on the CRCC v1.1 (Chinese reading comprehension corpus) built by Shanxi University produces 83.09% HumSent accuracy. Compare with the results of ME-based method, the proposed approach achieves 81.13% HumSent accuracy, which is about 1% higher than the ME-based results on the same training and testing environment.
Key wordscomputer application; Chinese information processing; reading comprehension; question answering; heuristic rules; orthogonal array

Select

Review

Zero Anaphora in Chinese—the State of Art

HUANG Xian,ZHANG Keliang

2009, 23(4): 10-16.

Abstract ( ) PDF ( )

Knowledge map

Save

Anaphora has always been a focus of linguistic researches, and the anaphora resolution is also of utmost importance to natural language processing (NLP). This paper introduces theoretical researches on the zero anaphora (ZA) in Chinese from four aspects, namely, syntax, pragmatics, discourse analysis and cognitive linguistics. The paper also summarizes how zero anaphors are used and distributed in different languages and various styles of writing. In terms of natural language processing, there are some substantial researches made on ZA in Chinese, such as the ZA-resolution models based on the Centering Theory, HNC-based analysis of ZA with its chunk-sharing model and DRT theory based efforts. The paper concludes by suggesting that NLP experts should pay more attention to theoretical researches of linguistics, while linguists engaged in this field should also orient their researches toward formalization of natural languages.
Key words computer application; Chinese information processing; zero anaphora; linguistics; natural language processing

Select

Review

An SVMTool-Based Chinese POS Tagger

WANG Lijie, CHE Wanxiang, LIU Ting

2009, 23(4): 16-22.

Abstract ( ) PDF ( )

Knowledge map

Save

The SVMTool is a simple, flexible and effective generator of sequential tagger based on Support Vector Machines, capable of dealing with a large number of linguistic features. In this paper, SVMTool is applied in Chinese POS tagging task and improves the accuracy by 2.07% compared with the baseline system on the Hidden Markov Model. To further improve the accuracy of unknown words, we introduce some features of Chinese characters and words, such as radicals of Chinese characters and reduplicate words, and probe into a theoretical analysis for their feasibility. Experiments indicate that these features can improve the accuracy of unknown words by 1.16% as well as reduce the error rate by 7.40%.
Key words computer application; Chinese information processing; part of speech tagging; SVMTool; unknown word; radicals of Chinese

Select

Review

SMS-2008: An Annotated Chinese Short Messages Corpus

MA Xu, XU Weiran, GUO Jun, HU Rile
(. Peking University Health Science Center, Beijing 0008, China;
. School of Information and Communication Engineering, Beijing University of Posts and Telecommunications,
Beijing 00876, China; . Nokia Research Center(China), Beijing 000, China)

2009, 23(4): 22-27.

Abstract ( ) PDF ( )

Knowledge map

Save

With the popularity of short messages, smart SMS tools are urgently demanded by users, operators and government departments. However, there is no open standard SMS corpus, which is an indispensable resource for the algorithm research, system development and performance test etc, due to the technology, the copyright protection, the privacy right and other various reasons. SMS-2008, as an annotated Chinese SMS Corpus, takes the lead in establishing a multi-purpose Chinese text message corpus, which includes the original corpus, privacy tagged corpus, content tagged corpus, errors tagged corpus. This Corpus can be applied in the research of SMS language, SMS classification, privacy protection algorithm or automatically correcting system.
Key words computer application; Chinese information processing; Chinese short message; tagged corpus

Select

Review

Research on Temporal Information Based Sentences Ordering in
Multi-Document Automatic Summarization

XU Yongdong, WANG Yadong, LIU Yang, WANG Wei, QUAN Guangri

2009, 23(4): 27-34.

Abstract ( ) PDF ( )

Knowledge map

Save

Sentences ordering is a key issue in the multi-documents automatic summarization, which influences the fluency and readability of the summarization. Among them, temporal information processing is the bottleneck technology which affects the quality of the ordering algorithm. Traditional ordering methods ignore this factor because the temporal information processing is very difficult, and, as a result, they could not achieve steady and high-quality ordering effects. To address this issue, this paper proposes an algorithm of Chinese text temporal information extraction, semantics computation and temporal reasoning. Then, based on the strategy of the majority ordering and the computation of sentences similarity, we propose sentences ordering algorithm based on the temporal information. The experiments show that the quality of this algorithm outperforms the calssical majority ordering algorithm and the chronological ordering algorithm.
Key words computer application; Chinese information processing; multi-documents automatic summarization; sentences ordering; Chinese temporal information processing

Select

Review

Research on Stylistic Feature of the Discourse Marker and Its Application

MENG Xiaoliang, HOU Min

2009, 23(4): 34-40.

Abstract ( ) PDF ( )

Knowledge map

Save

As a common discourse phenomenon, discourse markers have become an important subject in the discourse analysis. Due to various research perspectives, there still exist substantial differences in the perception and classification of discourse markers. From the perspective of style, this paper proposes the concept of “style degree” for the discourse marker, hypothesizing it bears certain stylistic features. The distribution of sampling discourse markers in the corpus of different styles is found with obvious distinction, and the Rocchio method based on these markers classify the text with a precision of 82.9%. It is concluded that the stylistic feature of discourse markers is a valuable in the text classification.
Key words computer application; Chinese information processing; discourse marker; stylistic feature; style degree; similarity; classification of texts

Select

Review

Characters of Query and Feedback in Chinese Search Engines

LAI Maosheng, QU Peng

2009, 23(4): 40-48.

Abstract ( ) PDF ( )

Knowledge map

Save

Query is Web user’s primary method to express his/her information need in searching. Related term provided by systems is a useful tool to refine his/her query. The paper focuses on query and related term; describes and analyzes them from user’s utilization behavior aspect. Log mining is used to give descriptive statistics on query words; qualitative categorization is then used to divide the query words into primary and auxiliary keywords. The result of qualitative analysis is compared with the result of a questionnaire survey. Important finding are as the following. Users use auxiliary keywords greatly. The content of primary keyword is relatively concentrated. Query length is short and the query syntax is simple. From both the questionnaire and the controlled experiment results, we find that users have high recognition and low utilizations on related terms. The study provides empirical results to understand user’s language utilization and also data for search engine to refine its index.
Key words computer application; Chinese information processing; Chinese Search Engines; Information Behavior; Language Utilization; Log Mining; Questionnaire Survey; Controlled Experiment

Select

Review

Multi View Text Categorization Based on Random Forests

TIAN Baoming, DAI Xinyu, CHEN Jiajun

2009, 23(4): 48-55.

Abstract ( ) PDF ( )

Knowledge map

Save

Term-based Vector Space Model (VSM) is a traditional approach to representing documents, which defects in its neglecting of the relations between terms. To capture the relations between the terms, some latent topics-based document representations such as LDA (Latent Dirichlet Allocation) have arisen much attention recently. However, simple latent topic-based text representations may cause loss of information carried by terms. In this paper, we use a modified random forests method to combine the term based and the LDA latent topic based documents representation. Random forests are constructed separately for two kinds of text representations and the final classification result is decided by vote scheme. The experimental results on some standard datasets show that, compared with methods only using one set of text features, our method can efficiently combine two kinds of text representations and improve the performance of text categorization.
Key words computer application; Chinese information processing; text categorization; VSM; latent dirichlet allocation; ensemble classification; random forests

Select

Review

High Embedding Ratio Text Steganography
by Ci-Poetry of the Song Dynasty

YU Zhenshan, HUANG Liusheng, CHEN Zhili, LI Lingjun, YANG Wei, ZHAO Xinxin

2009, 23(4): 55-63.

Abstract ( ) PDF ( )

Knowledge map

Save

Text steganography is a method of concealing secrets in texts. Different from cryptography which encrypts plain text to meaningless strings, text steganography generates innocuous stego-texts, which arouse less suspicion. However, compared with other types of multimedia documents such as image and video, text is not a well developed kind of carrier in information hiding because of its low redundancy and, consequently, the low embedding ratio achieved. A novel text steganography algorithm using Ci-poetry of the Song Dynasty is proposed in this paper, and the system composed of the encoder, the decoder, the lexicon and the tune template is realized. Secret messages are embedded into stego-Cis of the tune with proper number of lines, words, sentence patterns, rhythm and rhyme. This system reaches 16% embedding ratio while ensuring linguistic robustness. This is, to the best of our knowledge, the first text steganography algorithm making use of special type of literature.
Key wordscomputer application; Chinese information processing; information hiding; text steganography; embedding ratio; linguistic security; Ci-poetry of Song Dynasty; tune

Select

Review

WNCT: a Method for Automatic Translation of WordNet Concepts into Chinese

WANG Shi, CAO Cungen

2009, 23(4): 63-71.

Abstract ( ) PDF ( )

Knowledge map

Save

WordNet is an important English lexical semantic knowledge base. This paper presents a method for the automatic translation of the synsets in the WordNet into Chinese, named as WNCT. Firstly, WNCT uses dictionaries and term translation tools to translate the senses of English words in the WordNet into Chinese. Then WNCT regards the selection for correct sense of the words in a synset as a classification issue. The classification model is then trained by 12 features extracted according to the uniqueness of translation, the translation intersections within and between the concepts, the construction rules for Chinese phrase as well as PMI based translation relevance. Experimental results show that WNCT achieve 85.21% coverage rate and 81.37% accuracy for the Chinese translation of the synsets in WordNet 3.0.
Key words artificial intelligence; machine translation; WordNet translation; word translation; translation disambiguation; Chinese lexical knowledge base; Chinese information processing

Select

Review

Contour Extraction of Xixia Characters Based on Level Set

LIU Changqing

2009, 23(4): 71-77.

Abstract ( ) PDF ( )

Knowledge map

Save

Recently, researches on Xixia characters develop so much and a large number of Xixia documents have been published with their original forms at home and abroad. How to carry out the fast digitalization of those documents is of great importance. Based on the level set technique, we first process those documents by the smooth algorithm, and then the contours of Xixia characters are extracted by Level set. Level Set evolutionary function is descritized by the fourth-order symmetrical compact finite different scheme in spatial direction. Narrow-band algorithm and global optimization methods are adopted in computation. The experiment proves to be effective and can be applied to extracting relatively accurate contours of Xixia characters.
Key words artificial intelligence; pattern recognition; Xixia characters information processing; level set method; Xixia characters; contour extraction; compact difference

Select

Review

Factor Analysis in GMM-Based Language Identification

FU Qiang, SONG Yan, DAI Lirong

2009, 23(4): 77-82.

Abstract ( ) PDF ( )

Knowledge map

Save

In language identification system, the performance is substantially affected by the session variability including speaker variability; channel variability etc. In this paper, factor analysis is introduced to estimate the session variability subspace. According to the characteristics of the language identification task, the statistical model construction algorithm is discussed. Finally, both the model and the feature domain compensation methods are proposed. In NIST LRE 2007 30s test corpus, the experiment results show advantage of the proposed method, with a relative reduction in the equal error rate (EER) for about 36.5% compared with the baseline GMM-UBM system.
Key wordscomputer application; Chinese information processing; language identification; GMM model; factor analysis

Select

Review

Durational Characteristics and Pitch Characteristics
of the Prosodic Phrase in Mandarin Chinese

NI Chongjia, LIU Wenju, XU Bo

2009, 23(4): 82-88.

Abstract ( ) PDF ( )

Knowledge map

Save

The analysis and modeling of the information structure and the prosodic structure in a sentence or in the discourse is the key to improve the natural degree of speech synthesis and reduce the error rate of speech recognition that analyzes. Based on large speech corpus (ASCCD) with prosodic structure label, this paper presents the statistical results on the characteristics of the duration and the pitch included. The first discovery is that the prosodic border can obviously prolong the duration of syllable and different tone and accent have different effect to prolong the syllable duration. The second finding is that the break duration at prosodic border, especially at little prosodic border is more obvious. It is obvious that F₀ reset always occurs between prosodic phrases. The F₀ bottom line is always declined. The F₀ top line is declined after the accent. And at accent position, the rage of pitch is big and the top line is high.
Key wordscomputer application; Chinese information processing; major prosodic phrase (MAP); minor prosodic phrase (MIP); duration; pitch

Select

Review

Research on the Robust Speaker Identification Based
on Adaptive Frequency Warping

LI Yanping, TANG Zhenmin, ZHANG Yan, DING Hui

2009, 23(4): 88-95.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a new discriminative feature based on adaptive frequency warping. Based on the discriminative analysis of the frequency components and their quantification results, this new feature is extracted by non-uniform sub-band filters designed according to the adaptive frequency warping in different frequency bands; Furthermore, in order to overcome the mismatch between training speech and testing speech under the noisy environment, we adopt pre-enhancement before the feature extraction. Through a series of controlled experiments, it is shown that the proposed feature is insensitive to the speech content and thus more discriminative and robust in comparison to the conventional Mel frequency cepstral coefficients. The experimental results demonstrate that combining pre-enhancement and proposed feature leads to noticeable improvement on speaker recognition rate and robustness.
Key wordscomputer application; Chinese information processing; speaker identification; adaptive frequency warping; discriminative feature; robustness

Select

Review

Survey of Automatic Pronunciation Error Detection

WAN Jiping, XIAO Yunpeng, YE Weiping

2009, 23(4): 95-103.

Abstract ( ) PDF ( )

Knowledge map

Save

To detect the pronunciation errors and correct them is very important for pronunciation learning. The Automatic Pronunciation Error Detection (APED) is the technique of detecting pronunciation errors in the speech stream, which is one of the main research issues in the Computer Assisted Pronunciation Training (CAPT) area. This paper reviews the literature on the techniques, introducing in detail three APED methodsthe automatic speech recognition (ASR) based APED, the pronunciation error networks based APED and the acoustic-phonetic approach. It also summarizes the applications of APED in CAPT, and the automatic pronunciation evaluation technologies for mandarin.. Finally, the paper gives some analysis and suggestions for research on automatic pronunciation error detection.
Key wordscomputer application; Chinese information processing; automatic pronunciation error detection; computer assisted language learning; computer assisted pronunciation training; pronunciation evaluation; automatic speech recognition

Select

Review

An Experimental Study on the Acoustic Features of Consonant
Clusters in Monosyllabic Uyghur Words

Hankiz Ilahun, Zulfiya Aman, Askar Hamdulla

2009, 23(4): 103-107.

Abstract ( ) PDF ( )

Knowledge map

Save

To improve the naturalness of speech synthesis, this paper investigates the acoustic features of 63 monosyllabic words with consonant cluster from the “Uyghur voice acoustic parameters database”, which is recorded by both a male and a female speaker. We focus on the rule of combination and the statistics of consonant cluster of the monosyllabic words in Uyghur language. From the language typology point of view, that monosyllabic words including consonant cluster in modern Uyghur language have a fixed acoustic feature of a shorter length while a stronger intensity in the first consonant than the second. In contrast, the combination of consonants is not fixed, because the composition of the consonant cluster remains open.
Key wordscomputer application; Chinese information processing; Uyghur Language; consonant cluster; acoustic analysis; acoustic parameters

Select

Review

Research on the Word Categories and Its Annotation
Scheme for Tibetan Corpus

CAI Rangjia

2009, 23(4): 107-113.

Abstract ( ) PDF ( )

Knowledge map

Save

For the automatic segmentation and POS tagging, this paper proposes a Tibetan word category system and a annotation scheme after a careful analysis over a large Tibetan corpus. According to the practical demands on the Tibetan corpus, the Tibetan words are first divided into several main categories according to where they are content words or function words. Then several fine granularized sub-categories are further suggested. This framework has been proved valid for the processing of a Tibetan Corpus with 10 million characters.
Key wordscomputer application; Chinese information processing; corpus;Tibetan phrases; category; mark gathering

Select

Review

Research on Tibetan Segmentation Scheme for Information Processing

TASHI Gyal, ZHU Jie

2009, 23(4): 113-118.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic word segmentation is essential to Tibetan Information Processing as well as a key technology in intelligent Tibetan information processing area. To resolve the standards for the word class and the word segmentation which is a premise for this issue, this paper firstly classifies the Tibetan words accoring to requirements of Tibetan information processing, and then provides a systemic and applicable word segmentation scheme.
Key wordscomputer application; Chinese information processing; segmentation scheme; Tibetan; information processing

Select

Review

Design and Implementation of Tibetan Code Conversion Based on ISO/IEC 10646

ZHANG Qing, HUANG Heming, ZHANG Dengyi

2009, 23(4): 118-124.

Abstract ( ) PDF ( )

Knowledge map

Save

At present, many publishing systems, such as Bei Da Fang Zheng and Hua Guang are widely applied in the printing industry for issuing Tibetan publications in the domestic minority areas. Due to the different coding system in these systems, the valuable electronic resources for Tibetan languages cannot be exchanged and shared. This paper proposes a solution to convert Tibetan code of different system into the international standard. It further realizes such conversion system for Hua Guang windows encoding of Tibetan into the ISO/IEC 10646 encoding, with a designed sub-table&group strategy in hash.
Key wordscomputer application; Chinese information processing; Tibetan; character encoding standard; code conversion;encoding sort ; query

Select

Review

Efficient Hash Algorithm for Uyhur Words in EBMT

TIAN Shengwei, Turgun Ibrahim, YU Long

2009, 23(4): 124-129.

Abstract ( ) PDF ( )

Knowledge map

Save

The efficient retrieval of the candidate translation example from the large scale translation example base is fundamental issue in the study of EBMT. This paper proposes an Uyhur t Hash function designed according to the distribution of the uyhur words and characters, which, on the equiprobable condition, facilitate an average search length of 1.59. To resovle the conflict in the Hash table, a new mechanism name second optimal tree for synonym is established as regards to the frequency of the conflicting Urhur words. The experiments show that the proposed approach achieves 27.5% and 21.8% improvement in the performance compared with the sequential chain and binary search approach respectively.
Key wordscomputer application; Chinese information processing; EBMT; hash; average search length; second optimal tree

Please choose a citation manager

Content to export

2009 Volume 23 Issue 4 Published: 17 August 2009