Journal of Chinese Information Processing

Select

Review

Chinese Noun Compound Interpretation Based on Paraphrasing Verbs

WANG Meng1,2,HUANG Chu-ren2,YU Shiwen1,LI Bin3

2010, 24(6): 3-10.

Abstract ( ) PDF ( )

Knowledge map

Save

Noun compound interpretation is to recover the implicit semantic relation between the head and modifier. In this paper, we present a dynamic approach to use paraphrasing verbs to interpret the meaning of Chinese noun compounds automatically for the first time in the literature. The experimental results show that this approach not only provides the possible interpretations for one noun compound, but also reflects the subtle semantic differences of similar noun compounds. In addition, our research can be applied in some other fields such as question answering, information retrieval and lexicography.
Key wordsChinese noun compounds;interpretation;paraphrase;paraphrasing verbs

Select

Review

Tree Pruning Based Fast Segmentation of Classical Texts
——A Case Study on “Classic of Tea”

JIANG Xin 1, JIANG Yi 1, FANG Miao 2, WANG Rongpei1

2010, 24(6): 10-14.

Abstract ( ) PDF ( )

Knowledge map

Save

This study proposes a new fast segmentation method for classic Chinese texts based on the tree pruning process. Firstly, word candidates of two, three and multiple characters are selected with likelihood ratio statistics. Then an algorithm of fast segment is presented and a basic flow chart is illustrated. Finally, the Classic of Tea is used to verify its validity and effectiveness. The theoretical analysis and experimental instances show that the algorithm is effective and promising in computer-aided translation of classic Chinese texts.
Key wordssegmentation; tree pruning; likelihood ratio; The Classic of Tea; computer-aided translation

Select

Review

Layer Based Dependency Parsing by Sequence Labeling Models

JIAN Ping, ZONG Chengqing

2010, 24(6): 14-23.

Abstract ( ) PDF ( )

Knowledge map

Save

A layer-based projective dependency parsing approach is presented. This novel approach works layer by layer in a bottom-up manner, in which the depth of token dependency is allowed no more than one. Inside the layer the dependency graphs are searched exhaustively while between the layers the parser state transfers deterministically. Taking the dependency layer as the parsing unit, the proposed parser has a lower computational complexity than graph-based models which search for a whole dependency graph, alleviating the error propagation in transition-based models to some extent. Furthermore, our parser adopts the sequence labeling models to find the optimal sub-graph of the layer, which demonstrates the sequence labeling techniques qualified for hierarchical structure analysis tasks. Experimental results indicate that the proposed approach offers desirable accuracies and especially a very fast parsing speed, with 2500 words per second for Penn Treebank.
Key wordsdependency parsing; dependency layer; sequence labeling

Select

Review

ZHANG Liang1,2,YIN Cunyan1, CHEN Jiajun1

2010, 24(6): 23-31.

Abstract ( ) PDF ( )

Knowledge map

Save

Word similarity analysis and computing is one of the key technologies in natural language processing. It can offer substantial help to parsing, machine translation and information retrieval etc. Recently Chinese word similarity computing based on Hownet has become a hot research issue, though most of which are improvements or modifications to what was proposed in (Liu, 2002). Based on new Hownet(2007) with its concept frame and the multi-dimension semantic expression form, this paper proposes a new method to analyze and compute Chinese word similarity from three dimensionsthe main sememe, the main sememe frame and the concept characteristic description. This method also distinguishes the semantic similarity and the syntax similarity in computation. Experiment shows that the method produces a good performance.
Key wordssemantic tree;words similarity;Hownet2007;distance of semantic

Select

Review

Study on HowNet-Based Word Similarity Algorithm

LIU Qinglei, GU Xiaofeng

2010, 24(6): 31-37.

Abstract ( ) PDF ( )

Knowledge map

Save

Word (sentence) similarity computing based on the “HowNet” usually treats the optimal matches between the primitives or words as the basic unit, and the ultimate outcome can be the sum of weighted counts. However, this approach often results in the information duplication and some irrational constructions. To deal with these issues, this paper propose to calculate the similarity of sets by the statistics on common information (commonality) and the different information (differences) between the two sets of direct primitives. Moreover, the paper introduces this measure into the calculation of sentence similarity. The final experimental analysis shows that the proposed method is more stable and effective.
Key wordsHowNet; word similarity; sentence similarity; common information; different information

Select

Review

A Kernel-Based Classification Method for Nominal Data

LI Zhihua,REN Qiuying,GU Yan,WANG Shitong

2010, 24(6): 37-43.

Abstract ( ) PDF ( )

Knowledge map

Save

A kernel-based nominal data classification (KNDC) method is proposed with a new distance definition and a simple inner product computing method in this paper. It’s insensitivity to outliers and classification capability to unbalanced data in real datasets are further analyzed. The calculation on inner product of nominal data is difficult, often regarded as the bottleneck of SVM. The KNDC possesses a lower computation complexity than SVM over the nominal dataset, which is discussed for its validity. Experimental results on the standard datasets demonstrate that the proposed method has promising performance compared with other methods.
Key wordskernel-based classification method; nominal dataset; dissimilarity measure; inner production calculation

Select

Review

A Survey of Topic Evolution Based on LDA

SHAN Bin, LI Fang

2010, 24(6): 43-50.

Abstract ( ) PDF ( )

Knowledge map

Save

With topics evolve over time, new topics emerge and old ones decay. Many researches are devoted to detect the topic evolution automatically. Latent Dirichlet Allocation (LDA), as a recently emerged probabilistic topic model, has been widely used in the research of topic evolution. This paper discusses two aspects of evolution on topic, i.e. the content and the topic intensity. It summarizes three methods in LDA based topic evolution detection according to the dealing with timejoining the time to LDA model, post-discretizing or pre-discretizing methods. The three methods are also compared in several featuresthe time granularity, on-line or off-line, etc. In addition, the evaluation methods for topic evolution are introduced. Finally, the paper gives some analysis and suggestions for future researches on topic evolution based on LDA.
Key wordstopic model;topic evolution;Latent Dirichlet Allocation

Select

Review

SMT Domain Adaptation Based on Monolingual Context Information

CAO Jie, LV Yajuan,SU Jinsong, LIU Qun

2010, 24(6): 50-57.

Abstract ( ) PDF ( )

Knowledge map

Save

Domain adaptation problem will arise when statistical machine translation (SMT) system is used to translate domain-specific texts. When the texts to be translated and the training data come from the same domain, SMT system can achieve good performance. Otherwise, the translation quality will degrade dramatically. In general, domain-specific parallel corpus is limited, while domain-mixed parallel corpus and domain-specific monolingual corpus are easy to obtain. According to the fact, this paper proposed a new translation model which utilized domain-mixed parallel corpus and domain-specific monolingual corpus to improve the domain translation quality. Experiments show that the proposed method improves translation performance in three IWSLT evaluation tests significantly.
Key wordsstatistical machine translation; domain adaptation; context information

Select

Review

Apply Dialog Act Information in Spoken Language Translation

ZHOU Keyan, ZONG Chengqing

2010, 24(6): 57-64.

Abstract ( ) PDF ( )

Knowledge map

Save

How to apply semantic and pragmatics information is one of the difficulties in researches on spoken language translation. Dialog act, as a description of shallow discourse structure, has been utilized in several types of translation systems. In this paper, we first introduce dialog act theory and several famous dialog act annotated corpora. Based on annotated corpus and dialog act automatic recognition technology, we propose three kinds of applications of dialog act in phrase-based translation. By introducing the dialog act classification, our approach improves the consistencies between the training data and the test data, between the develop set and the test set, and between the source language and the target language. Further, the translation process is more effective and translation result is more accurate in reflecting the intention of source language. The experimental results on Chinese-to-English spoken language show that dialog act can make the spoken language translation system more accurate and effective.
Key wordsdialog act; spoken language translation; dialog act classification

Select

Review

Research on Embedded Text-Dependent Speaker Recognition
Algorithms and Its Implementation

GUO Haoting1, 2, ZHENG Fang2, LUO Canhua2, LI Yinguo1

2010, 24(6): 64-69.

Abstract ( ) PDF ( )

Knowledge map

Save

The speaker recognition technology is an important and popular user authentication method in daily life due to its convenience, economy, and easy-to-acceptance. However the current algorithms cannot meet the real-time requirements in embedded applications. Based on the Non-Linear Partition (NLP) algorithms used in speech recognition, a novel algorithm is proposed and applied to the embedded Text-Dependent Speaker Recognition. Compared with the traditional Dynamic Time Warping (DTW) based algorithms, it achieves a good practical result in terms of real time performance.
Key wordsspeaker recognition; Text-Dependent; embedded application; Non-Linear Partition

Select

Review

An Approach to Query-focused Multi-Document Summarization

YE Na,CAI Dongfeng

2010, 24(6): 69-75.

Abstract ( ) PDF ( )

Knowledge map

Save

There are two difficulties in the technique of query-focused multi-document summarization. First, to ensure the high relevancy with the query, the summarization tends to be repetitive. Second, the original query needs to be expanded to fully reflect user’s intention, but current query expansion methods usually depend on exterior linguistic resources. To solve the above problems, this paper proposes a query-focused multi-document summarization approach, in which subtopics are identified by topic analysis technique. While selecting sentences, both the relevancy with query and the importance of the subtopic are considered. Then, the query is expanded according to the co-occurrence of words among subtopics without using any external knowledge. Experimental results on DUC2006 corpus show that the new approach achieves higher performance than the baseline system. The query expansion method further improved the summarization quality.
Key wordsquery-focused;multi-document summarization;subtopic;relevancy;query expansion

Select

Review

A Survey of Query Suggestion in Search Engine

LI Yanan1,2,WANG Bin 1,LI Jintao1

2010, 24(6): 75-85.

Abstract ( ) PDF ( )

Knowledge map

Save

Query suggestion, i.e. generating related queries or keywords for an initial one, has been widely utilized in search engines and sponsored search systems. As one of the necessary techniques in search engines, query suggestion draws more and more attentions in NLP and IR community. In recent years, many studies have been done to validate the usefulness of query suggestion and to improve its effect. This paper introduces the state of the art in query suggestion, including its history, approaches and evaluation methods. The paper analyzes the challenges, discusses the possible solutions and suggests future works.
Key wordscomputer application; Chinese information processing; survey; query suggestion; information retrieval

Select

Review

Peer Resource Measurement and Analysis in Kad Network

LIU Xiangtao1,2, GONG Caichun3, LIU Yue1, BAI Shuo1

2010, 24(6): 85-92.

Abstract ( ) PDF ( )

Knowledge map

Save

In Kad network, there are hundreds of millions of shared resources, among which a considerable part can be rated as questionable information. In order to understand the characteristics of resources, especially questionable ones, in Kad network, the file resources of peers are measured and analyzed using the Kad-network crawler Rainbow. We find that1) both the popularity of files and the number of filenames corresponding to a file approximately fit Zipf distribution; 2) the severity of questionable files can be judged more accurately using co-occurrence-words in multiple filenames corresponding to the same file-content-hash; 3) the questionable resources only occupy 6.34% of random samples, and 74.8% of which are video files.
Key wordsPeer-to-peer network; Kad network; measurement and analysis; questionable resource

Select

Review

A Method of Text Classifier for Focused Crawler

JIANG Peng,SONG Jihua

2010, 24(6): 92-97.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper combines DF and CHI to select features of web pages related to the area of teaching Chinese as a second language (TCSL). A classifier is first constructed based on two-step topic similarity measurement by the title and the main text. The classifier is then applied to crawling web pages related to TCSL, and the results show substantial improvements on efficiency and recall rate compared with traditional methods. Now this classifier has been deployed for data collection for a big TCSL corpus in actual practice.
Key wordsDF; CHI; classifier; focused crawler

Select

Review

Vectorization of Qi Gong Calligraphy

CAO Fang, WU Zhongke, AO Xuefeng, ZHOU Mingquan

2010, 24(6): 97-103.

Abstract ( ) PDF ( )

Knowledge map

Save

Vector Chinese characters are popular for the high-quality of display and output under the transformations like zoom, rotation. Therefore, the vectorization of Chinese characters is a fundamental issue for Chinese Information Processing. We propose a vectorization algorithm to tracing the outline of 3 755 frequently used characters in the style of Qi Gong calligraphy. A vector character includes the representation of these strokes and their sequence, which may serve as a kind of support to the studying of Qi Gong calligraphy. The paper presents the details in contour extraction, stroke extraction and the final optimization.
Key wordsvectorization; Chinese calligraphy; Qi font; stroke

Select

Review

Free Stroke Input Method for Carapace-bone-script

NIE Yanzhao,LIU Yongge

2010, 24(6): 103-108.

Abstract ( ) PDF ( )

Knowledge map

Save

Digitalization of the Carapace-bone-script requires support of input method. To improve the existing method of the Carapace-bone-script input, a stroke coding scheme of the Carapace-bone-script is presented. The implementation of corresponding input method proves its feasibility, which may serve as a more convenient alternative to inputting the Carapace-bone-script.
Key wordsCarapace-bone-script; input method; stroke

Select

Review

Research on Keyboard Layout for Chinese Pinyin IME

HUANG Jinwen, JIN Hua,WANG Fan,CHEN Binhong,
HE Yongshu,CHEN Xiaowei,LIN Qingwen,HUANG Xiaoming

2010, 24(6): 108-114.

Abstract ( ) PDF ( )

Knowledge map

Save

In order to improve the current keyboard to better support the Chinese Pinyin IME, this paper suggests a new letter layout scheme based on the statistics on the frequencies Chinese Pinyin(phonetic alphabet). The new scheme is compared with the current keyboard layout from three aspectsthe static loading, dynamic loading and the alternate rate of the left and right hand. In terms of workload, there is a linear decline from the forefinger, the middle finger, the ring finger and little finger, which is better coordinating to the practical efficiency of each finger. And the alternate rate of left hand and right hand is 0.74833, which is a more relaxed condition. These statistic figures validate that the new design would significantly enhance the efficiency of the Chinese characters input.
Key wordsChinese information processing; keyboard layout; Chinese Pinyin; Pinyin IME

Select

Review

Research and Implementation of Kazakh Base Noun Phrase Identification

SUN Ruina 1, Gulila·Altenbek 2

2010, 24(6): 114-120.

Abstract ( ) PDF ( )

Knowledge map

Save

An automatic identification system for Kazakh basic noun phrase is presented. Adopting the rule based identification method and manual annotation, the corpus of Kazakh base noun phrase is first constructed. Then, a combined approach using statistical information and linguistics rules is presented to predict the base noun phrase boundary by mutual information and correct the boundary by base noun phrase constitution rules. Experiment shows the precision is improved from 80.2% to 82.5% by combining the rules.
Key wordscorpus; base noun phrase; Kazakh; mutual information; rules

Select

Review

Research on Mongolian Input Method in Unicode

FAN Daoerji,BAI Fengshan, WU Huijuan

2010, 24(6): 120-125.

Abstract ( ) PDF ( )

Knowledge map

Save

Microsoft’s operating system has started to fully support the traditional Mongolian input, editing and typesetting in Vista. On the basis of Microsoft Mongolian input method, this paper proposes a new algorithm for the Mongolian input based on the unique characteristics of Mongolian. The algorithm supports automatic deformation calculation, automatic association input, automatic learning and the resource sharing. This paper presents an automatic deformation theory and a detailed algorithm for computing process. It also discusses the details of the Mongolian dictionary data storage, and describes the automatic learning algorithms and the solution to the resource sharing.
Key wordsMongolian input method; Unicode; automatic deformation; Uniscribe

Select

Review

Research and Realization of Mongolian Displaying Method Based on QTE

BAI Fengshan, FAN Daoerji, JIN Yuxin, WU Wei, ZHANG Lihong

2010, 24(6): 125-129.

Abstract ( ) PDF ( )

Knowledge map

Save

Mongolian is the language generally used by China Mongolian. Most of the popular word processing tools do not support Mongolian because of its distinct writing style and variant shape. Now Linux with QTE has become a popular practice in of the field of embedded system products and application. This paper presents an algorithm of Mongolian dots displaying and variant shape transformation based on Unicode under QTE, and also defines the QTE modules supporting Mongolian. The method provides a solution to processing Mongolian in the embedded system with Linux plus QTE.
Key wordsQTE; Linux; Mongolian; Unicode

Please choose a citation manager

Content to export

2010 Volume 24 Issue 6 Published: 15 December 2010