Journal of Chinese Information Processing

Select

Review

Artificial Intelligence in the Network Age

LI De-yi, XIAO Li-ping,

2008, 22(2): 3-9.

Abstract ( ) PDF ( )

Knowledge map

Save

Pattern recognition, knowledge engineering, and robotics have been made significant progress in the 50-year history of artificial intelligence; however, AI is far away from human intelligence. In order to fulfill the requirement of data mining, machine learning, and knowledge discovery, there are three important directions of AI research and development in the network age, which are cognitive physics, intelligence with uncertainty, and networked intelligence, discussed in details in this paper.

Select

Review

Research on Strategies for Integrating Chinese Lexical Analysis and Parsing

MI Hai-tao, XIONG De-yi, LIU Qun

2008, 22(2): 10-17.

Abstract ( ) PDF ( )

Knowledge map

Save

External resources can be used effectively to improve the parsing accuracy. In this paper, we introduce an external Chinese lexical analysis system to parsing and propose a general transformation methd to integrate them. The Transformation-based Error-driven Learning and Conditional Random Fields are used to solve the problem of transformation between two different standards of segmentation and POS tagging. We also propose a parsing model which combines the Head-driven parsing model and Structural Context parsing model effectively. Experimental results show that our new integrated parsing model achieves an F1 score of 82.5% on the Penn Chinese Tree-Bank Version 1.0, higher than the state-of-art parsers.

Select

Review

Disambiguating Biomedical Abbreviations Based on K-Nearest Neighbor with Weighted Voting Method

YU Zhong-hua, CHEN Rong, HU Jun-feng, CHEN Yuan

2008, 22(2): 18-23.

Abstract ( ) PDF ( )

Knowledge map

Save

Information extraction from biomedical literature is very useful for utilizing the achievements in biomedical field and promoting further improvement of Biology and Medicine. This paper, aiming at biomedical abbreviation analysis and understanding, proposes an approach for disambiguating biomedical abbreviations based on K-nearest neighbor (K-NN) with weighted voting. In the approach, the samples with labels are generated automatically based on the hypothesis of “One Sense Per Discourse”. And the wordsdescribing the topic of a discourse are chosen as the features for abbreviation disambiguation. The classification model used in the approach is based on K-NN with weighted voting.The experimental results on a testing set containing 177 762 Medline abstracts show that the approach proposed in the paper can obtain higher precision than others in related work. The experiments also prove that K-NN with weighted voting can get not only higher precision, but also better stability in comparison with the traditional K-NN in abbreviation disambiguation task.

Select

Review

The Research on Coreference Resolution Based on Maximum Entropy Model

PANG Ning, YANG Er-hong

2008, 22(2): 24-27,54.

Abstract ( ) PDF ( )

Knowledge map

Save

Coreference is a common phenomenon in news reports about paroxysmal event. Conreference resolution is essential for information extraction. In the paper, we present an approach of coreference resolution based on Maximum Entropy Model in Chinese news reports about paroxysmal events. By the approach, we can extract the pronouns, nouns and noun phrases which refer to the same entity in a news report. The training corpus contains 200 000 Chinese characters and the testing corpus contains 100 000 ones. Eight kinds of features are chosen for the Maximum Entropy Model according to the characteristics of the problem. The experimental results show that the approach can achieve a F-measure of 64.6%.

Select

Review

A Survey of Syntax-based Statistical Machine Translation

XIONG De-yi, LIU Qun, LIN Shou-xun

2008, 22(2): 28-39.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents an overview of recent syntax-based statistical machine translation (SMT). According to the differences of grammars which translation models are based on, we classify the syntax-based SMT into two categories: formally syntax-based SMT and linguistically syntax-based SMT. For each category, we discuss the representative work, including model design, training and decoding. We also make a comparison of different models. Finally, we point out the problems on designing syntax models and give a prediction of future development of syntax-based SMT.

Select

Review

Corpus Selection and Optimization for Statistical Machine Translation
System Based on Information Retrieval Method

HUANG Jin, LV Ya-juan, LIU Qun,

2008, 22(2): 40-46.

Abstract ( ) PDF ( )

Knowledge map

Save

Parallel corpora are an indispensable resource for translation model training in statistical machine translation (SMT) system. Instead of collecting more and more parallel training corpora, this paper aims to improve the performance of SMT system by exploiting full potential of the existing parallel corpora. We propose an approach to select and optimize training corpus by using information retrieval method. First, sentences similar to the test text are selected to form a small and adapted training data. This allows us to get a comparable or even better performance with only a subset of the total data and the less hardware need. Second, we add the selected subset to the entire corpus to optimize the data distribution and get a better result. The experiments show that this method can effectively improve the performance of SMT system .

Select

Review

Research on Japanese-Mongolian Machine Translation
of Verb Phrase Based on the Derivational Grammar

BAI Shun

2008, 22(2): 47-54.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper describes the implementation of a Japanese-Mongolian verbal phrase machine translation system of verb phroses.In the Japanese derivational grammar, there is no concept of conjugations, a word is analyzed into stems and suffixes. After translating Japanese stems and suffixes into Mongolian stems and suffixes, Mongolian phonetic rules are used to process and generate verbal phrases. We implemented a Japanese-Mongolian verbal phrase machine translation system. We also tested 403verb phrases from 30Japanese reports, and achieved a 95.78% accuracy.

Select

Review

Subjective Relation Extraction in Chinese Opinion Mining

ZHANG Jian-feng,ZHANG Qi, WU Li-de, HUANG Xuan-jing

2008, 22(2): 55-59,86.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a novel method to extract the subjective relationship between opinion-bearing terms and opinion targets. This method extracted the pairs of opinion-bearing terms and opinion targets as the candidate set, and then employed the maximum entropy model to combine lexical, part of speech, semantic and positional features derived from text. Our method incorporates relation extraction into opinion mining and solves the problem of coreference and omitting of opinion targets to some extent. The experiments showed that the F value of our method is 15% higher than that of Baseline which takes the nearest opinion target as the real target, Besides, the experiments found that the intensifiers can improve the performance of subjective relation extraction.

Select

Review

A Quick Speaker Searching Algorithm

ZHU Lei, JIANG Jie, ZHENG Rong, XU Bo,

2008, 22(2): 60-63.

Abstract ( ) PDF ( )

Knowledge map

Save

Speaker retrieval has recently emerged as an important task due to the rapidly growing volume of audio archives. This paper presents a novel approach to accelerate the speed of speaker recognition. This approach combines the state-of-the-art speaker recognition system (GMM-UBM system) with Index and Simulation, and it can accelerate the speed greatly with little reduction in accuracy. For the details of this approach, a two-time search strategy is proposed for this task. First, we calculate the Euler distance between two indexes to find some candidates, and then we use the Simulation to find the best target. The experimental results show that our approach effectively improves the speed of the process with little degradation in accuracy.

Select

Review

Combining Semi-Supervised Learning and Active Learning
for Shallow Semantic Parsing

CHEN Yao-dong, WANG Ting, CHEN Huo-wang

2008, 22(2): 70-75.

Abstract ( ) PDF ( )

Knowledge map

Save

Semantic analysis is one of the fundamental and key problems in the research of content-based Text Mining. Most of supervised machine learning methods led to poor performance when work on limited tagged data. This paper investigated a novel semi-supervised learning algorithm—Transductive Support Vector Machine for shallow semantic parsing. An optimization strategy of selecting training instances, based on active learning, was integrated with TSVM. The experiment result shows that the method integrating TSVM and optimization strategy for shallow semantic parsing outperforms supervised methods on small tagged data.

Select

Review

Modeling Lifetime of Web Pages Based on User Interest Analysis

WANG Yong, LIU Yi-qun, ZHANG Min, MA Shao-ping, RU Li-yun

2008, 22(2): 76-80.

Abstract ( ) PDF ( )

Knowledge map

Save

The activeness of a web page varies during its lifetime. Some pages are valuable only in a specific period, and then become obsolescent. Web page lifetime analysis from users’ perspective is important to enhance the performance of web crawlers and search engines, and to improve the efficiency of web advertising. With page view data collected by a proxy server, we were able to perform large scale analysis in web page lifetime. A model is given to describe user interest evolution based on an experiment conducted with the page view data of more than 36 000 000 web pages for two months. The model is the foundation to better understand how the web is organized and operates.

Select

Review

Discussion Search in Enterprise Email Archives

FU Yu-peng, ZHANG Min, MA Shao-ping

2008, 22(2): 81-86.

Abstract ( ) PDF ( )

Knowledge map

Save

Enterprise search has been more and more important in research as information technology develops. Discussion search in enterprise email collection is a frequently faced problem. There is a large volume of emails containing valuable information in enterprise corporations. Therefore how to retrieval required data from those emails effectively is important. In this paper, according to the structure feature of emails and its semantic topology study, we introduce the email features based retrieval model. In TREC2006 discussion search task, our model achieved the best performance among all participants.

Select

Review

Ancient Sentence Search Based on Sentence Auto-Alignment in Parallel
Corpus of Ancient and Modern Chinese

GUO Rui,SONG Ji-hua,LIAO Min

2008, 22(2): 87-91,105.

Abstract ( ) PDF ( )

Knowledge map

Save

Along with the Corpus Linguistics’ prosperity and development, the research on Example Based Machine Translation (EBMT) has a flourishing prospect. In this area, two problems must be solved: 1) Constructing a large-scale parallel corpus with high accuracy and speed. 2) Searching the most similar sentence with the input sentence from the huge aligned examples. This paper aimed at EBMT between ancient and modern Chinese. First, a new translation model was built which takes the length of the sentence, character information and punctuation into account at the same time. Then, a new approach for aligning bilingual sentences automatically was proposed based on genetic algorithm and Dynamic Programming. Finally, a new similarity method was given based on Chinese characters’ information entropy. Experimental results showed that our methods achieved good performance.

Select

Review

A Computer-Aided Chinese Reading System Based on Analysis Unit of Characters

FANG Gao-lin, YU Hao, MENG Yao,ZOU Gang

2008, 22(2): 92-98.

Abstract ( ) PDF ( )

Knowledge map

Save

As one of the important research topics, computer-aided Chinese learning is attracting more and more interest in natural language processing society. A computer-aided reading and learning system based on analysis unit of characters is proposed to provide reading and learning assistant for Chinese learner in this paper. The system first employs character-based Chinese morphological analysis for segmenting Chinese texts into words, and then presents a method based on structure information of constituent characters for new word finding. For unknown words unregistered in the dictionary (such as: technical terms, proper nouns and fixed phrases), a method based on semantic prediction and feedback learning is proposed to mine their native translations from the Web. For frequent words, real-time translation display is implemented by the Chinese-English (Chinese-Japanese) dictionary database, and users can also obtain typical examples of this word usage through a word usage retrieval module. In this system, key technologies include: morphological analysis based on character information, word segmentation based on structure information of constituent characters, and translation acquisition of new words based on semantic prediction and feedback learning. A character analysis unit is the core of all proposed methods used in the whole system. Experiments show that our system has good performance in every aspect.

Select

Review

A Discriminative Model Topology Optimization Method

YAN Zhi-jie,HU Yu ,WANG Ren-hua

2008, 22(2): 99-105.

Abstract ( ) PDF ( )

Knowledge map

Save

Select

Review

Investigation of Subwords Confidence Performance in Chinese Speech Verification

SUN Cheng-li , LIU Gang, GUO Jun

2008, 22(2): 106-109,128.

Abstract ( ) PDF ( )

Knowledge map

Save

A Minimum Classification Error (MCE) criterion based sub-words weighting parameters estimation algorithm is proposed in which the sub-word weighting parameters are derived by the MCE training. Investigation of the contribution of different sub-words on the word-level confidence measure show that Finals significantly outperform the Initials with more reliability and stability in confidence performance, and Finals have more discriminative power than those of Initials. Experiment on keyword spotting system with 130 keywords shows that the system with different sub-word weighting contribution achieved a relative Equal Error Rate (EER) reduction of 3.05% compared with the equal weighting contribution case.

Select

Review

High Quality Prosody Generation in a Text-to-speech System

GUO Qing, Nobuyuki Katae, YU Hao, Hitoshi Iwamida

2008, 22(2): 110-115.

Abstract ( ) PDF ( )

Knowledge map

Save

The Fujitsu Mandarin TTS system is a state-of-the-art, unit-selection based concatenative speech synthesis system. This paper describes the current status of the system, especially in prosody generation related aspects. Decision tree based duration prediction method and statistical pitch contour prediction method are described in detail. At last, the prosody evaluation result and the system evaluation result are presented.

Select

Review

A Study on Chinese Prosodic Hierarchy Prediction Based on
Dependency Grammar Analysis

SHAO Yan-qiu, SUI Zhi-fang, HAN Ji-qing,WU Yun-fang

2008, 22(2): 116-123.

Abstract ( ) PDF ( )

Knowledge map

Save

Different prosodic hierarchy could divide texts into several prosodic chunks for better speaking and understanding. Currently, many shallow features such as part-of-speech, length of word are used to predict the prosodic hierarchy. But these features are not powerful for some prosodic unit prediction such as prosodic phrase. In fact, syntactic structure is in close touch with prosody structure. They influence and restrict each other. In this paper, based on dependency grammar, some deep features which are related with prosody hierarchy are extracted. Compared to the shallow features, the deep features such as inner-arc span and inner-arc type are more effective on the prediction of the middle level such as prosodic phrase. The F-score increases about 11%.

Select

Review

Lattice-based PPRLM and Its Application in Language Identification

WANG Shi-jin, ZHENG Rong, XU Bo

2008, 22(2): 124-128.

Abstract ( ) PDF ( )

Knowledge map

Save

The paper presents an automatic language identification (LID) system which is based on the lattice-based PPRLM method. As an extension of the original PPRLM, the lattice-based PPRLM method uses lattice to generate the acoustic hypothesis space, which contains more information than that of 1-best phoneme sequence in the original PPRLM. Evaluations from the broadcasting speech in real environments show that the lattice-based PPRLM improves the accuracy rate by 6%. Results are also comparable with other approaches within different languages, while four-hour training set is given for each language.

Please choose a citation manager

Content to export

2008 Volume 22 Issue 2 Published: 15 April 2008