Journal of Chinese Information Processing

Select

Review

The Inference and Identification Models for Textual Entailment

YUAN Yulin1,WANG Minghua2

2010, 24(2): 3-14.

Abstract ( ) PDF ( )

Knowledge map

Save

This article firstly presents an inference model that consists of a knowledge base of entailment patterns along with a set of inference rules and related probability estimations, which approximates the textual entailment relationship and predicates whether an entailment holds for a given text-hypothesis pair. Then it introduces some methods of learning the inference rules and entailment patterns and their probability, including learning from a single or parallel/comparable corpus, or from the web. Finally, it describes the recognizing entailment models which based on lexical probability, e.g. lexical entailment probability models and lexical reference matching models, and the syntax and semantics driven models, e.g. the models based on the matching the dependency tree nodes or predicate-argument structures between a given text-hypothesis pair.
Key wordscomputer application; Chinese information processing; textual entailmentinference model; entailment pattern; recognizing models; lexical probability; syntax and semantics

Select

Review

Chinese Wordnet: Design, Implementation,
and Application of an Infrastructure for Cross-Lingual Knowledge Processing

Chu-Ren Huang1,2, Shu-Kai Hsieh3, Jia-Fei Hong4,
Yun-Zhu Chen1, I-Li Su1, Yong-Xiang Chen5, Sheng-Wei Huang1

2010, 24(2): 14-24.

Abstract ( ) PDF ( )

Knowledge map

Save

The design criterion of Chinese WordNet (CWN) is to build a complete and robust knowledge system which also embodies a precise expression of semantic relations. Such precise expression for the Chinese sense division and the semantic relations must be based on linguistic theory, esp. lexical semantics. All word sense examples together with the lexical semantic relations in CWN are all attested with corpus data. Our methodology involves first analyzing language data and then combining the analyzed result with corpus by sense tagging to re-examine the accuracy of the analysis. For formal representation and computational application, a complete and robust knowledge system needs to be equipped with the formal integrity of ontology. The Suggested Upper Merged Ontology (SUMO) is adopted for this purpose.
Key wordscomputer application; Chinese information processing;Chinese WordNet; global Wordnet grid; ontology; multi-language processing; cross-lingual integration

Select

Review

Discourse-Based Chinese Location Name Recognition

TANG Xuri, CHEN Xiaohe, XU Chao, LI Bin

2010, 24(2): 24-33.

Abstract ( ) PDF ( )

Knowledge map

Save

The paper presents a system for the recognition of Chinese location names on the discourse level. The system employs three modules in sequence, the CRFs-based module for simple location name recognition, the discourse-based module for the relationship identification between the simple location names and the CRFs-based module for complex location name recognition. The CRFs-based module for single location name recognition takes raw text as input and models both the information of internal structure of basic location names and the information of neighboring characters. The discourse-based module employs toponymhood calculation and discourse-based location name relation for recognition. The module of complex location name recognition is also based on CRFs but operates on the result of single toponym recognition. Experiments show that the system achieves the F-scores of 92.87% and 89.76% in close and open tests respectively.
Key words computer application; Chinese information processing; discourse-based location name relation; conditional random fields; toponymhood calculation

Select

Review

Research on Zizhi Tongjian Historical Ontology Construction and Application

PENG Weiminɡ,SONG Jihua

2010, 24(2): 33-39.

Abstract ( ) PDF ( )

Knowledge map

Save

The paper analyses certain issues in the process of constructing domain ontology, especially on the position of the "instance". Accordingly, a historical domain ontology project of Zizhi Tongjian is proposed, as well as the corresponding construction method of domain ontology. Taken the pattern-driven and bottom-up strategy, the Ontology of the Pre-Qin History is implemented and evaluated through the application of SPARQL searching and TouchGraph visualization. It is believed that this engineering practice may provide some ideas for the insiders who construct the domain ontology.
Key wordscomputer application; Chinese information processing; domain ontology; construction method; ontology engineering

Select

Review

CRF Based Research on a Unified Approach to
Word Segmentation and POS Tagging for Pre-Qin Chinese

SHI Min, LI Bin, CHEN Xiaohe

2010, 24(2): 39-46.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper explores the cross field between NLP and ancient Chinese, particularly the pre-Qin documents. The text of "Zuo Zhuan" is firstly analyzed after manual segmentationand POS tagging. Then the Conditional Random Fields model (CRF) is adopted for the word segmentation (WS), POS tagging (PT) and a unified process of WS and PT, respectively. The precision and recall of the unified approach are much higher than the independent WS and PT in the open test, with a F-score of 94.60% in WS and 89.65% in PT. This method is suitable for the study of ancient Chinese vocabulary and corpus construction, and can be applied to compensatethe manual tagging.
Key wordscomputer application; Chinese information processing; Pre-Qin Chinese; word segmentation; POS tagging; Zuo Zhuan; conditional random fields model

Select

Review

Research on Gender Recognition for Character in Text

TANG Qin,LIN Hongfei

2010, 24(2): 46-52.

Abstract ( ) PDF ( )

Knowledge map

Save

In addition to the word features of a character’s name, we can recognize a character’ gender according to the differences of the words when a man or a woman is described in the text. In the paper, based on the different description of men or women of various aspects, we obtain a large number of significant words with gender differences, gender bias feature words and gender bias personal appellations. The experiment shows that gender bias feature words have a better description of different gender roles than gender bias personal appellations. Besides, the method of gender bias feature words combined with gender bias personal appellations and the word features of a character’s name has a better effect than using only the person names’ features.
Key wordscomputer application; Chinese information processing;gender bias feature words; gender bias personal appellations; gender recognition

Select

Review

A Concept Semantic Similarity Algorithm Based on Bayesian Estimation

WU Kui1, ZHOU Xianzhong2, WANG Jianyu1, ZHAO Jiabao2

2010, 24(2): 52-58.

Abstract ( ) PDF ( )

Knowledge map

Save

Traditional algorithms for semantic similarity computation fall into two categoriesdistance-based and information-based methods. The former ignores the objective statistics, while the latter suffers from insufficient domain data. In this paper, a new method for similarity computation based on Bayesian Estimation is proposed. First, the concept emergence probability is assumed to be a random variable with a priori Beta distribution. Second, its priori parameters are designated by the distance-based similarity algorithm, calculated by Bayesian Estimation. Thereby, the semantic similarity integrating the subjective experience with the objective statistic is acquired based on information-based method. Finally, the proposed method is implemented and proved by a slightly higher correlation with human judgments against WordNet.
Key wordscomputer application; Chinese information processing;ontology; semantic similarity; Bayesian estimation; Beta distribution

Select

Review

Search Result Clustering Algorithm Based on Maximal Frequent Itemsets

SU Chong, CHEN Qingcai, WANG Xiaolong, MENG Xianjun

2010, 24(2): 58-68.

Abstract ( ) PDF ( )

Knowledge map

Save

Most of existing web page clustering algorithms are based on short and uneven snippets of web pages, which often causes bad clustering performance (e.g., STC and Lingo algorithms). On the other hand, the classical clustering algorithms for full web pages are too complex to provide good cluster label in addition to the incapability online clustering (for example, Kmeans algorithm). To address above problems, this paper presents an online web page clustering algorithm based on maximal frequent itemsets (MFIC). At first, the maximal frequent itemsets are mined, and then the web pages are clustered based on shared frequent item sets. Finally, clusters are labelled based on the frequent items. Experimental results show that MFIC can effectively reduce clustering time, improve clustering accrucy by 15%, and generate understandable labels.
Key wordscomputer application; Chinese information processing;search engine; Web page clustering; frequent itemset

Select

Review

Data Extraction from Web Forums Based on Similarity of Page Layout

WANG Yun, LI Bicheng, LIN Chen

2010, 24(2): 68-76.

Abstract ( ) PDF ( )

Knowledge map

Save

Web forums contain a wealth of information resources. Making full use of these information resources relies on web forums data extraction technology. This paper solves the problems of what data should be extracted and how to extract from web forums by the proposed method based on the similarity of page layout. The method can effectively avoid the disadvantages of current methods at low degree of automation or low accuracy. The method firstly recognizes the topic block by making full use of the special layout of the web forum pages, then extract data using rules from the topic block. Experimental results show that this method performs well in adjustability, precision and recall.
Key wordscomputer application; Chinese information processing;Web forum; data extraction; similarity

Select

Review

High Performance Chinese Spam Filter

QI Haoliang1, CHENG Xiaolong1, YANG Muyun2, HE Xiaoning3, LI Sheng2, LEI Guohua1

2010, 24(2): 76-84.

Abstract ( ) PDF ( )

Knowledge map

Save

We designed and implemented a high performance Chinese spam filter. Online filtering mode is adopted in order to defense the evolution of spam emails. Logistic regression model is used as its filtering model; byte level N-gram is put forward to extract email’s features; and the filter is trained with TONE (Train On or Near Error) method. The performance of our filter is evaluated on Chinese spam corpora. It outperforms the best system in TREC 06 spam filtering track, gets 0.000 0% of 1-ROCA on SEWM07 immediate feedback task and ranks top in all SEWM 08 online learning tasks.
Key wordscomputer application; Chinese information processing;Chinese spam filtering; online learning; logistic regression model; byte N-gram; TONE

Select

Review

OHR:A Hybrid Personalized Recommendation Model Based on Ontology

PAN Tuoyu1,2, ZHU Zhenmin1,2, TENG Ji1,2, YE Jian1, ZENG Qingfeng1

2010, 24(2): 84-91.

Abstract ( ) PDF ( )

Knowledge map

Save

With the dramatic increase of information available on the Internet, it is obviously a trend to provide users with personalized service. In this paper, through building a generalized service model based on ontology, the Items are classified into service sub-category. and the probability distribution of the users′ interests are calculated. On the basis of the combination of Content Filtering and Item-based Collaborative Filtering, an new ontology-based hybrid personalized recommendation model(OHR) is put forward. The experimental results show that OHR provides the better recommendation results than traditional collaborative filtering algorithms, as well as the better ability to discover the users′ new interests.
Key wordscomputer application; Chinese information processing;ontology; hybrid personalized recommendations; item-based collaborative filtering; probabilistic model

Select

Review

Acquisition of Argument Correspondence between Chinese and English Verb Subcategorization

ZHU Conghui1, ZHAO Tiejun1, HAN xiwu2, ZHENG Dequan1

2010, 24(2): 91-96.

Abstract ( ) PDF ( )

Knowledge map

Save

The verb subcategorization (SCF) is a more brief classification based on syntactic behaviors of verb and it is composed by a verb and several arguments. Recently it has attracted substantial researches for a single language, e.g. English and Chinese, whereas the cross-lingual subcategorization demands more systematic efforts. We present a novel method to obtain SCF argument crrespondence between Chinese and English based on active learning. This method can find the new relations through bilingual parallel sentence pairs almost without any priori language knowledge. We also integrated these relations to the statistical machine translation (SMT) system and experiment results show that the performance of SMT combined bilingual argument relationships has significant improvement, which indicates the validity of argument corresponding relationships automatically obtained.
Key wordsartificial intelligence; machine translation; verb subcategorization; cross-lingual argument crrespondence; automatic acquisition; statistical machine translation

Select

Review

Generating Chinese Metrical Poetry by a Statistical MT Approach

HE Jing1,2, ZHOU Ming2, JIANG Long2

2010, 24(2): 96-104.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic poetry generation is considered difficult. In this paper, we propose a novel statistical approach for automatic generation of traditional Chinese metrical poetry from a few user-supplied keywords. A template-based model is used to automatically generate the first sentence of the poem. A phrase-based statistical machine translation model then generates additional sentences one-by-one. With our interactive model, the user can select the best sentence from the system’s N-best output at each step. The approach has been evaluated on the generation of quatrains of 5- and 7-character lines. The evaluation metrics for single lines as well as for the whole generated poem suggest that this method is very promising.
Key wordsartificial intelligence; machine translation; statistical machine translation; poem generation; poem evaluation

Select

Review

Cepstral Shape Normalization (CSN) for Robust Speech Recognition

DU Jun, DAI Lirong, WANG Renhua

2010, 24(2): 104-110.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper, we propose a new feature normalization approach for robust speech recognition. It is revealed that the shape of speech feature distributions is changed in noisy environments compared with that in the uninterrupted condition. Accordingly, the Cepstral Shape Normalization (CSN) which normalizes the shape of feature distributions is performed by exploiting an exponential factor. This method has been proven effective in noisy environments, especially under low SNRs. Experimental results show that the proposed method yields relative word error rate reductions of 38% and 25% on aurora2 and aurora3 databases, respectively, in comparing with those of the conventional Mean and Variance Normalization (MVN). It is also shown that CSN consistently outperforms other traditional methods, such as Histogram EQualization (HEQ) and Higher Order Cepstral Moment Normalization (HOCMN).
Key wordscomputer application; Chinese information processing; robust speech recognition; shape normalization

Select

Review

Automatic Mispronunciation Detection for Mandarin Chinese

ZHANG Feng1, HUANG Chao2, DAI Lirong1

2010, 24(2): 110-116.

Abstract ( ) PDF ( )

Knowledge map

Save

The current automatic mispronunciation detection systems are mostly based on automatic speech recognition (ASR) framework with statistical model. This paper presents the methods to improve the performance of mispronunciation detection at syllable level for Mandarin Chinese from two aspectsintroducing the speaker adaptive training (SAT) and the selective maximum likelihood linear regression (SMLLR) to get a better acoustic statistical model, and proposing speaker normalization backend because of the limited information and the different rating level for the different pronunciation level. Experiments on a database of 8 000 syllables pronounced by 40 speakers with varied pronunciation proficiency indicate the promising effects of these strategies by improving the precision from 45.8% to 53.6% at 30% recall, and 64.6% to 79.9% at 10% recall.
Key wordscomputer application; Chinese information processing; Automatic mispronunciation detection; Speaker Adaptive Training (SAT); Selective Maximum Likelihood Linear Regression (SMLLR); speaker normalization;

Select

Review

Character Image Restoration Algorithm of Jiagu Rubbings

GU Shaotong1,2,3

2010, 24(2): 116-122.

Abstract ( ) PDF ( )

Knowledge map

Save

A character image restoration of Jiagu rubbings method based on adaptive threshold and fractal geometry is proposed in this paper. The paper analyzes the characteristics of the image noise and the edges of the characters on Jiagu rubbings. Firstly, we estimate the adaptive threshold by means of Bayes risk function and clear the noise regions. Then we calculate the fractal dimension of the character edge on Jiagu rubbings by means of statistics. Finally, we perform the transformation to the character edges so as to smooth the character edges of Jiagu rubbings image. The experimental results show that the proposed method could smooth the character edge of Jiagu rubbings significantly.
Key wordscomputer application; Chinese information processing;Jiagu　rubbings; adaptive threshold; fractal geometry; fractal dimension; compression transformation; character image restoration

Select

Review

Analysis of Acoustic Frequency Feature for Uighur Vowels and Their Identification

WANG Kunlun1,ZHANG Guanhong1, Turghunjan Abdukirim 2

2010, 24(2): 122-129.

Abstract ( ) PDF ( )

Knowledge map

Save

As a Turkic Altaic language, Uighur has a unique word-building in which its eight vowels have very important roles for speech recognition and synthesis, especially for selection of recognition base unit. Focused on the acoustic frequency characteristics and the formant frequency parameters of the Uighur vowels, this paper adopts the basis theory and method of experimental phonetics to reveal the vowels’formant frequency distribution rules on the Uighur synthetic speech database(office environment). The accuracy of the formant frequency distribution parameter of Uighur’s eight vowels is further validated in the speech recognition test. The experiment confirms that the Uighur is more distinguishable in the audio frequency for the correctness in the speech transmission and receival if removing the vowel harmonious phenomena.
Key wordscomputer application; Chinese information processing;speech recognition; acoustic frequency characteristics; formant frequency; vowel; Uighur

Please choose a citation manager

Content to export

2010 Volume 24 Issue 2 Published: 15 April 2010