Journal of Chinese Information Processing

Select

Review

Research on Combinational Ambiguity in Chinese Word Segmentation

QIN Ying, WANG Xiao-jie, ZHANG Su-xiang,

2007, 21(1): 1-8.

Abstract ( ) PDF ( )

Knowledge map

Save

One of challenges in Chinese Word Segmentation is the combinational ambiguity problem with two main obstacles: the detection of combinational ambiguities and ambiguity resolution. This paper investigate the structures of combinational ambiguities and proposes a new approach for automatically detecting this type of ambiguities. The experimental result reveals the approach is effective in the tagged corpus of 1998-01 People Daily with about 1 million words, we have detected more than 400 combinational ambiguities,far more than that detected by common approaches. Then the resolutions of 60 combinational ambiguities are carried out using the maximum entropy model. The effect of six kinds of features, as well as their combination, on the performance of disambiguation is further studies. The average accuracy of disambiguation reaches 88.05%.

Select

Review

Chinese Multi-word Chunks Extraction for Computer Aided Translation

Kang Byeong-Kwu, ZHANG Qin-long, CHEN Yi-rong, CHANG Bao-bao

2007, 21(1): 9-16.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper suggests a methodology which is aimed to extract multi word chunks for translation purposes. Our basic idea is to use a hybrid method which combines the statistical method and linguistic rules. The extraction system used in our work operated at four steps: (1) Tokenization of Chinese corpus; (2) Extraction of multi-word chunks(2-gram to 10-gram) using Nagao’s Algorithm and Substring Reduction Algorithm; (3)Statistical Filtering which combines Mutual Information (or Log-likelihood Ratio) and Left/Right Entropy; (4) Linguistic filtering by chunk formation rules and stop-word list. As a result, hybrid method proved to be a suitable method for selecting multi-word chunks, it has considerably improved the precision of the extraction which is much higher than that of purely statistical method. We believe that multi-word chunks extracted in this way could be used effectively to supplement existing translation memory database.

Select

Review

Researches on Ontology-based Technical Lexicons for Specialty Machine Translation

HUANG He-yan, ZHANG Ke-liang,ZHANG Xiao-fei

2007, 21(1): 17-22.

Abstract ( ) PDF ( )

Knowledge map

Save

In the design and implementation of specialty machine translation systems, a crucial concern is the efficient organization of domain-specific technical terms and the intelligent selection of terminological meanings on the basis of the text being processed. This paper begins with an analysis of some problems ubiquitous in technical lexicons for specialty MT systems and a brief introduction to the features of ontology-based domain-specific conceptual systems. Some important aspects of specialty MT-oriented technical lexicons are then studied, including the design of general-purpose specialty ontology, the description of technical terms and their mapping to specialty ontology, the organization and application of bilingual or multilingual MT domain-specific lexicons. Last, the paper presents some of the experimental work, covering the design of a draft MT-oriented specialty classification system, the mapping from technical lexicons to specialty classification system, and the mapping from ICS(International Classification System) to the MT specialty classification system. The results of the mapping experiments prove that the classification system conducted by the paper has a desirable coverage over MT technical lexicons.

Select

Review

Named Entity Translation with Web Mining and Transliteration

JIANG Long, ZHOU Ming, Chien Lee-feng

2007, 21(1): 23-29.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a novel approach to improve the named entity translation by combining transliteration with web mining. For the details of the approach, a transliteration model is used to generate a translation candidate, and then the web information applied to get more translations. A Maximum Entropy (ME) model is employed to rank the translation candidates with various features such as pronouncing similarity, contextual features, co-occurrence etc. The experimental results show that our approach effectively improves the precision of the named entity translation by a large margin.

Select

Review

Cross-Language Similar Document Retrieval

WANG Hong-jun, SHI Shui-cai, YU Shi-wen,XIAO Shi-bin

2007, 21(1): 30-37.

Abstract ( ) PDF ( )

Knowledge map

Save

To retrieve translations of a document is very helpful for bilingual parallel corpora construction. This paper proposes an improved approach for this purpose, which uses statistical translation model to match bilingual word-pairs, uses weights of word-pairs as features for computing similarity and uses a new Dice-based method to compute Cross-Language document similarity. The approach was evaluated by measuring the numbers of how many times the translation of a given document was identified in the top N similar documents. Although two noisy datasets were used in the experiment, about 90% translations were identified in the top 5 similar documents. The experimental results show that the weighs of bilingual words-pairs are good features for similarity computing and this approaqch can effectively find translation equivalent of a document in other languages.

Select

Review

SDC Feature-based Language Identification Using GMM-UBM

JIANG Hong-chen, ZHENG Rong, ZHANG Shu-wu, XU Bo

2007, 21(1): 49-53.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents an automatic language identification (LID) system which uses shifted delta cepstra (SDC) feature vectors and universal background model (UBM). SDC feature is created by stacking delta cepstra computed across multiple speech frames and is involved with much more temporal information than conventional MFCC feature. UBM represents the characteristic of all different languages and each language model is obtained by employing the Bayesian adaptation from this UBM. Compared with the conventional GMM method, the training and testing speed of this method is much faster. This system performance is evaluated on the OGI corpus. The best identification accuracy for 11-languages is 73.28% for 10-s utterances, 82.62% for 30-s utterances and 85.23% for 45-s utterances. The processing speed is about 0.03 times real time.

Select

Review

Chinese Prosodic Phrasing with a Constraint-based Approach

DONG Hong-hui, TAO Jian-hua, XU Bo

2007, 21(1): 54-59.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a linguistic constraint model and a phrase-length constraint model to describe the prosodic phrasing process. Each part of them can be described in detail. In the linguistic constraints model, Chunk is considered as an important basic unit. And an HMM is used to model the phrase-length constraints, which include the distributions of the prosodic phrase lengths and the prosodic word number in the prosodic phrase. Then a k-candidate method is introduced to combine these two models. This approach makes full use of the linguistic constraints and the phrase-length constraints. The experiments show that this approach achieved a good performance with the phrasing f-score 82.9%.

Select

Review

A Document Processing Model Supporting Multilingual Text Layout Direction

JIA Yan-min, WU Jian

2007, 21(1): 60-66.

Abstract ( ) PDF ( )

Knowledge map

Save

Document processing is a key part of script handling. With regard to typesetting for multilingual text, the document processing model supporting multilingual text layout direction based on “frame” is proposed in this paper. In this model, the process of text layout is encapsulated in the module of document formatting. So the problem of text layout in multi-directions is reduced to the problem of text layout from left to right horizontally. The document formatting recursive algorithm of text layout in multi-directions for this model is also designed. Diffrerent text layout directions of various scripts, including Mongolian, Tibetan and Uighur are supported in this model.

Select

Review

Towards Evaluating Chinese Character Digital Input System

ZHOU Ke-lan ,LV Qiang ,ZHANG Yu-hua,PAN Ji-si,QIAN Pei-de

2007, 21(1): 67-73.

Abstract ( ) PDF ( )

Knowledge map

Save

The National standard GB18031 plays an important role in the evaluation of digital input method .But when the standard being applied to real world there are many difficulties. And there is no standard about the functions of the digital input method software. Therefore establishing a scientific national standard on software function of digital input method becomes an emergency. This paper first discusses how to judge the easily learning in GB18031 specification. The quantitative analysis about the execution difficulties of the selection rate is processed. Then the characteristic of the digital input systems from the view of the non-professional requirements is summarized and some concrete suggestions about how to revise GB18031 is given.Finally this paper explains the essentiality of establishing the national standard about functions of digital input method and conducts a elementary research on how to set up such standards.

Select

Review

Shortened Finals of Chinese Syllables and Application for Hanzi Input

FANG Gui-ming,

2007, 21(1): 74-78.

Abstract ( ) PDF ( )

Knowledge map

Save

The scheme of Chinese phonetic alphabet plays an important role in Chinese information processing. Hanzi input by Pinyin is one of popular Methods in China.. The number of characters in the final of Chinese syllable is from 1 to 4. A compressed scheme to shorten the finals is proposed. The finals with 2 to 4 characters originally is coded by only 2 characters in the set of {a,o,e,i,u,v} in order to make them short. For the characters in the finals are different from the 20 characters in the initials of Chinese syllable. Hanzi input with keyboard for Chinese phrase can be done by the combination of “initial, final, initial”. This scheme can be used in alphabet keyboard and even better in numeric keypad. 4 keys for tones of pinyin are different with the 6 keys for the finals, so the boundary of each syllable is clear even the finals are omitted. Each pair of fuzzy initials or finals has 3 codes in numeric keypad, in order to help these users who need the fuzzy code for partial Chinese characters.

Select

Review

Feature Engineering for Chinese Semantic Role Labeling

LIU Huai-jun, CHE Wan-xiang, LIU Ting

2007, 21(1): 79-84.

Abstract ( ) PDF ( )

Knowledge map

Save

In the natural language processing field, researchers have experienced a growth of interest in semantic role labeling by applying statistical and machine-learning methods. Using rich features is the most important part of semantic parsing system. In this paper, some new effective features and combination features are proposed, such as next word of the constituent, predicate and phrase type combination, predicate class and path combination, and so on. And then we report the experiments on the dataset from Chinese Proposition Bank (CPB). After these new features used, the final system improves the F-Score from 89.76% to 91.31%. The results show that the performance of the system has a statistically significant increase. Therefore it is very important to find better features for semantic role labeling.

Select

Review

A New Approach to Phrase Segmentation for Statistical Machine Translation

HE Zhong-jun, LIU Qun , LIN Shou-xun

2007, 21(1): 85-89.

Abstract ( ) PDF ( )

Knowledge map

Save

Select

Review

HowNet Based Chinese Question Automatic Classification

SUN Jing-guang, CAI Dong-feng, LV De-xin, DONG Yan-ju

2007, 21(1): 90-95.

Abstract ( ) PDF ( )

Knowledge map

Save

Question answering system can provides a precise and concise answer to a natural language query. Question classification is the first task of Question Answering System, and the precision of question classification has great effect on the subsequent processes. In this paper, we present a new method on feature extraction which uses HowNet as semantic resource, and use Maximum Entropy Model to realize it. We choose the interrogative words, syntax structure, question focus words and their first sememes as classification feature. The experiment result show that the first sememes in HowNet can express the main meaning of the question focus words, it can be as an important feature. This method can improve the precision of question classification: the classification precision of coarse classes and fine classes reaches 92.18% and 83.86% respectively.

Select

Review

Text Orientation Identification Based on Semantic Comprehension

XU Lin-hong, LIN Hong-fei, YANG Zhi-hao

2007, 21(1): 96-100.

Abstract ( ) PDF ( )

Knowledge map

Save

At the fields of spam filtering, information security and automatic summarizations, text orientation identification is used widely. The paper presents the mechanism based on Semantic Comprehension for text orientation identification. Firstly, it acquires the semantic orientation through computing semantic similarity the vocabulary and tagged vocabulary in How-Net, and it adopts the derogatory or commendatory terms as features of classification. It utilizes Support Vector Machine classifier to identify the text orientation. Finally it deals with the negative sentence via matching negative rules. At the same time, it also identifies the derogatory or commendatory intensity through degree adverb in order to improve the accuracy of classification.

Select

Review

Research on Non-contiguous Phrase-based Model for Statistical Machine Translation

ZHANG Da-kun, ZHANG Wei, FENG Yuan-yong, SUN Le

2007, 21(1): 101-108.

Abstract ( ) PDF ( )

Knowledge map

Save

The phrase-based statistical machine translation model is still the most popular model nowadays. However, non-contiguous phrases are not taken into account in this model. A statistical machine translation model based on non-contiguous phrases is proposed in this paper. The units of translation are extended from contiguous phrases to phrases with intervals in order to take advantage of the context dependence. With the less numbers of phrases, the efficiency of the decoder in our model is also improved. Experiments show that with a better efficiency the translation results of our non-contiguous phrase-based model and hierarchical model are comparable.

Select

Review

Research in Search Engine User Behavior Based on Log Analysis

YU Hui-jia, LIU Yi-qun, ZHANG Min, RU Li-yun, MA Shao-ping

2007, 21(1): 109-114.

Abstract ( ) PDF ( )

Knowledge map

Save

User log analysis is important for both Web information retrieval technologies and commercial search engine algorithms. In order to better understand search behavior of Chinese Web search users, we presents an analysis of Sogou Search Engine query log consisting of approximately 50 million entries for search requests over a period of one month. The analysis includes search retrieval behavior in individual queries distribution, user request customs in the same session and whether using advanced search functions. Conclusions may help improve Web information retrieval algorithms and search performance evaluation methods.

Select

Review

Chinese Base NP Chunking by Error-driven Combination Classifiers

XU Fang, ZONG Cheng-qing, WANG Xia

2007, 21(1): 115-119.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a hybrid error-driven combination approach to chunking Chinese Base noun phrase (Chinese Base NP), which combines TBL (Transformation-based Learning) model and CRF (Conditional Random Field) model. First, we give an overview of the Chinese and English Base NP chunking, followed by a description of the Chinese Base NP chunking task. In order to analyze the results respectively from the two (TBL-based and CRF-based) classifiers and improve the performance of the Base NP chunkers, an error-driven SVM (Support Vector Machine) based classifier is trained from the classification errors of the two classifiers. According to our experiments, the hybrid method achieves the best results with F-measure of 89.72% and improves by 2.35% in the best case compared with other methods.

Please choose a citation manager

Content to export

2007 Volume 21 Issue 1 Published: 15 February 2007