Journal of Chinese Information Processing

Select

Corpus Based Investigation on High Frequent Maximal Overlapping Ambiguity String in Chinese Word Segmentation

LI Bin,CHEN Xiao-he,FANG Fang,XU Yan-hua

2006, 20(1): 3-8.

Abstract ( ) PDF ( )

Knowledge map

Save

Overlapping ambiguity is still an open issue in Chinese word segmentation. This paper makes a deep investigation on Maximal Overlapping Ambiguity String (MOAS) . First , we discuss the disadvantage of using FBMM to detect OAS. Then , by word omni-segmentation , we collect 14906 high frequent MOASs from People’s Daily corpus which contains about 400M characters. For these MOASs , 1354270 sample sentences are randomly selected and manually labeled. The results show that about 70% of MOASs with true ambiguity have a strong bias towards one segmentation , and consequently , a disambiguation strategy fon dealing with overlapping ambiguities is put forward.

Select

Study on Product Named Entity Recognition for Business Information Extraction

LIU Fei-fan,ZHAO Jun,LV Bi-bo,XU Bo,YU Hao,XIA Ying-ju

2006, 20(1): 9-15.

Abstract ( ) PDF ( )

Knowledge map

Save

Electronic business has fueled increasing research interest recently in business information extraction and market intelligence management. As one of the key techniques , product named entity recognition (product NER) has also begun to draw more attention in the field of natural language processing. In the paper , characteristics and challenges in product NER are explored and analyzed deliberately , and a hierarchical hidden Markov model (HHMM) based approach to product NER from Chinese free text is presented. Experimental results in both digital and mobile phone domains show that our approach performs quite well in these two different domains and achieves F-measures of 79.7% , 86.9% , 75.8% on the whole for three types of product named entities respectively. In comparison with maximum entropy model , HHMM is experimentally proved to be more powerful for dealing with multi-scale embedded sequence problem.

Select

Semantic Orientation Computing Based on HowNet

ZHU Yan-lan,MIN Jin,ZHOU Ya-qian,HUANG Xuan-jing,WU Li-de

2006, 20(1): 16-22.

Abstract ( ) PDF ( )

Knowledge map

Save

Nowadays , with the development of Internet and information explosion ,automated techniques for analyzing author’s attitudes towards specific events will make great effort to business intelligence and public opinion survey. Semantic orientation inference has become a meaningful tool , which could provide useful information for text classification , summarization , filtering etc. Measuring the semantic orientation of words would greatly contribute to predicting the author’s attitude in a passage. In this paper , a simple HowNet-based method for semantic orientation computation of Chinese words is introduced. Although this method requires only a few seed words , satisfactory results can still be obtained. And the performance is even better for frequently used words , with the frequency-weighted accuracy of above 80%.

Select

Research on Extraction and Integration of Developing Event Based on Analysis of Space-time Information

WU Ping-bo,CHEN Qun-xiu,MA Liang

2006, 20(1): 23-30.

Abstract ( ) PDF ( )

Knowledge map

Save

Technology of information extraction (IE) can provide high-quality service for retrieval. Targeting at events in web news ,this paper conducts a system that can extract and integrate key information of event that interests people. Methodologies and strategies of the system are as follows : (1) Extraction rules are built in terms of sentence patterns , then event information is directly extracted from the text in which temporal phrases (TP) and space phrases (SP) are recognized and normalized . The extraction system can thus be easily implemented owing to skipping complex syntax parsing. (2) The same event in different documents is linked by normalized TP and SP of event , and the information associated with an event is merged. (3) When new event appears in a text , the text is segmented. So isolative information for an event in same segment can be merged into its owner. Preliminary experiments show that methodologies and strategies in this paper are feasible.

Select

The Study of Topic Detection Based on Algorithm of Division and Multi-level Clustering with Multi-strategy Optimization

LUO Wei-hua,YU Man-quan,XU Hong-bo,WANG Bin,CHENG Xue-qi

2006, 20(1): 31-38.

Abstract ( ) PDF ( )

Knowledge map

Save

Topic Detection and Tracking is a research driven by evaluation , which intends to organize and utilize information stream of texts according to event. Since being brought forward in 1996 ,it comes under more and more attention. This paper proposes an algorithm of division and multi-level clustering with multi-strategy optimization , which bases on study of today’s mature algorithms. The core thought of the algorithm is to divide all data into groups (each group has intrinsic relevance) ,and cluster in each group to produce micro-clusters ,and then cluster on all micro-clusters to result in final topics. During the process , various strategies are employed to improve the effect of clustering. The system implemented with the algorithm has been tested on TDT4 corpus. The test indicates the algorithm is one present best algorithm.

Select

The Classification of Tibetan Verbs and Relative Patterns Based on Semantics and Syntax

JIANG Di

2006, 20(1): 39-45.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper discusses the classification of Tibetan verbs according to verbal semantic types and syntactic types , which can describe the numbers of arguments and characters of components in sentences. There are 12 types of verbs in Tibetan. They are stative verbs , action verbs , cognition verbs , perception verbs , verbs of change , directional verbs , narrate verbs , copula , verbs of possession , existential verbs , interrelation verbs , causative verbs , each of which requires different case markers for nouns in different positions , or different word orders or different syntactic particles. All the semantic and syntactic features of verbs and their structural model described in the paper can be used as information that appears in the dictionary , which we are constructing now. It is no doubt the analysis of semantic and syntactic features of Tibetan verbs is infrastructure of Tibetan natural language processing.

Select

The Evaluation of Chinese Word Segmentation and POS Tagging

YANG Er-hong,FANG Ying,LIU Dong-ming,QIAO Yu

2006, 20(1): 46-51,99.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents the results from the‘863 Chinese and Interface Technology’integrative evaluation on automated Chinese word Segmentation and part of speech (POS) tagging held in 2003 ,Beijing. It describes the evaluation content , evaluation method , selection of test questions , test guideline , etc. , and summarizes the type of word segmentation and POS tagging errors in the tested result. In this paper , we also present a flexible automatic method used in the evaluation. Finally , we give some analyse on the result and make some recommendation for the future.

Select

Active Learning in Chinese Word Segmentation Based on Multigram Language Model

FENG Chong,CHEN Zhao-xiong,HUANG He-yan,GUAN Zhen-zhen

2006, 20(1): 52-60.

Abstract ( ) PDF ( )

Knowledge map

Save

Word segmentation is a fundamental task in Chinese processing. To solve the difficulties of traditional methods in coping with various application domains and evolutive language phenomena , this paper adopts an unsupervised learning framework , using EM algorithm to train the n-multigram language model. A new certainty-based active learning segmentation algorithm is proposed , which combine labeled data with unlabeled data together to optimize language model. In experiments it outperforms other unsupervised word segmentation algorithms.

Select

Resume Information Extraction Based on Cascaded Double-layer Classification

YU Kun,GUAN Gang,ZHOU Ming,WANG Xu-fa,CAI Qing-sheng

2006, 20(1): 61-68.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents an approach based on cascaded double-layer text classification for resume information extraction. This approach first divides a resume into block and string. Then it divides the target information into general information and detailed information. It first extracts general information by block segmentation and classification. Then it selects those blocks that may contain predefined detailed information with a fuzzy strategy. At last , it segments these blocks into strings and labels the strings with detailed information classes. The experimental results on 1200 Chinese resumes show that our approach is suitable for the information extraction and management of resumes.

Select

A Pattern-classification Based Solution for the Recognition of Tense of the Chinese Language

LIN Da-zhen,LI Shao-zi

2006, 20(1): 69-77.

Abstract ( ) PDF ( )

Knowledge map

Save

As far as NLP is concerned , the tense of the Chinese language is especially hard to tackle. One of the outstanding characteristics of the Chinese language is that its tense is usually implied rather than obvious. Hence , the Rule-based solution is far from suitable for the recognition of tense in situations where tense-informing words are missing or more than one of such words are present. In this paper , we introduce a pattern-classification based solution , which evaluates each single word in terms of its contribution to the recognition of tense for the concerned sentence. This solution proves effective when processing sentences containing none or more than one tense-informing words. Furthermore , the implementation of linear discriminating function in this solution leads to its abilities of multi-dimensional data processing and training , and helps to achieve decent performance. Evaluated under open conditions , the Precision and the Recall of this solution for single sentences are 79.8% and 95.3% , respectively.

Select

Text Inference Based on HowNet

SHI Jing,DAI Guo-zhong

2006, 20(1): 78-86.

Abstract ( ) PDF ( )

Knowledge map

Save

Text inference is central to natural language applications. This paper presents an inference method based on HowNet , which organizes knowledge with semantic net and infers with marker passing. The method introduces construction-integration model , generates knowledge structure dynamically and builds paths between text words with guide. Examples of 16 inference classes are used to test it. The results show that ideal inferences can be extracted with low cost if enough contexts are given.

Select

An Offline Handwritten Character Segmentation Algorithm for Mail Address

HAN Zhi,LIU Chang-ping,YIN Xu-cheng

2006, 20(1): 87-92.

Abstract ( ) PDF ( )

Knowledge map

Save

Character segmentation for mail address has become a crucial step for the address recognition in the automatic post mail sorting system. In this paper , a character segmentation algorithm was proposed according to the characteristics of handwritten mail address character. First a simple segmentation process was fulfilled using the structure-based methods , including vertical projection , connected components extraction and stroke cross number analysis , to extract the block sequence from the mail address image. Next candidate segmentation paths were created by merging the neighboring blocks. Then these paths were evaluated by the character recognition confidence and knowledge analysis of the known post address database. An experiment with the algorithm was carried out on more than 500 real envelop images ,with the correct sorting rate of address recognition up to 78.61% and the rate of address and postcode integrated recognition up to 95.42%.

Select

Robust Speaker Recognition in Noisy Environment

BAI Jun-mei,ZHANG Shi-lei,ZHANG Shu-wu,XU Bo

2006, 20(1): 93-99.

Abstract ( ) PDF ( )

Knowledge map

Save

Speaker recognition (SR) has got excellent result in clean speech. However , the effects of noises or channel mismatch can cause significant performance degradation in practical appliance. The focus of this paper is to address those problems related to robust and efficient speaker identification (SI) in noise environment. The main contributions center around two areas that include signal processing based on Wiener filtering and speaker features integration of F0 and MFCC. The experimental results on YOHO corpus show that Wiener filter is an efficient front-end processing technique and F0 is a robust feature for SR in noise environments. The performance has improved 20% above baseline system.

Select

Optimization of F0 Contour of Chinese Prosodic Word

LIU Hao-jie,Du Li-min

2006, 20(1): 100-106.

Abstract ( )

Knowledge map

Save

The F0 contour of Chinese prosodic word is influenced greatly by the stress of its syllables. Based on the mathematical model to produce the optimized F0 contour ,this paper proposes a method of x2 fitting to optimize the continuity,the smoothness,the shape and the average feature of F0 contour of prosodic word,which can achieve the optimization of F0 contour under the function of the stress of the responding syllables.Based on the HNM speech synthesis system,we show the optimized results for the 64 to...

Please choose a citation manager

Content to export

2006 Volume 20 Issue 1 Published: 15 February 2006