2006 Volume 20 Issue 1 Published: 15 February 2006
  

  • Select all
    |
  • LI Bin,CHEN Xiao-he,FANG Fang,XU Yan-hua
    2006, 20(1): 3-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    Overlapping ambiguity is still an open issue in Chinese word segmentation. This paper makes a deep investigation on Maximal Overlapping Ambiguity String (MOAS) . First , we discuss the disadvantage of using FBMM to detect OAS. Then , by word omni-segmentation , we collect 14906 high frequent MOASs from People’s Daily corpus which contains about 400M characters. For these MOASs , 1354270 sample sentences are randomly selected and manually labeled. The results show that about 70% of MOASs with true ambiguity have a strong bias towards one segmentation , and consequently , a disambiguation strategy fon dealing with overlapping ambiguities is put forward.
  • LIU Fei-fan,ZHAO Jun,LV Bi-bo,XU Bo,YU Hao,XIA Ying-ju
    2006, 20(1): 9-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    Electronic business has fueled increasing research interest recently in business information extraction and market intelligence management. As one of the key techniques , product named entity recognition (product NER) has also begun to draw more attention in the field of natural language processing. In the paper , characteristics and challenges in product NER are explored and analyzed deliberately , and a hierarchical hidden Markov model (HHMM) based approach to product NER from Chinese free text is presented. Experimental results in both digital and mobile phone domains show that our approach performs quite well in these two different domains and achieves F-measures of 79.7% , 86.9% , 75.8% on the whole for three types of product named entities respectively. In comparison with maximum entropy model , HHMM is experimentally proved to be more powerful for dealing with multi-scale embedded sequence problem.
  • ZHU Yan-lan,MIN Jin,ZHOU Ya-qian,HUANG Xuan-jing,WU Li-de
    2006, 20(1): 16-22.
    Abstract ( ) PDF ( ) Knowledge map Save
    Nowadays , with the development of Internet and information explosion ,automated techniques for analyzing author’s attitudes towards specific events will make great effort to business intelligence and public opinion survey. Semantic orientation inference has become a meaningful tool , which could provide useful information for text classification , summarization , filtering etc. Measuring the semantic orientation of words would greatly contribute to predicting the author’s attitude in a passage. In this paper , a simple HowNet-based method for semantic orientation computation of Chinese words is introduced. Although this method requires only a few seed words , satisfactory results can still be obtained. And the performance is even better for frequently used words , with the frequency-weighted accuracy of above 80%.
  • WU Ping-bo,CHEN Qun-xiu,MA Liang
    2006, 20(1): 23-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    Technology of information extraction (IE) can provide high-quality service for retrieval. Targeting at events in web news ,this paper conducts a system that can extract and integrate key information of event that interests people. Methodologies and strategies of the system are as follows : (1) Extraction rules are built in terms of sentence patterns , then event information is directly extracted from the text in which temporal phrases (TP) and space phrases (SP) are recognized and normalized . The extraction system can thus be easily implemented owing to skipping complex syntax parsing. (2) The same event in different documents is linked by normalized TP and SP of event , and the information associated with an event is merged. (3) When new event appears in a text , the text is segmented. So isolative information for an event in same segment can be merged into its owner. Preliminary experiments show that methodologies and strategies in this paper are feasible.
  • LUO Wei-hua,YU Man-quan,XU Hong-bo,WANG Bin,CHENG Xue-qi
    2006, 20(1): 31-38.
    Abstract ( ) PDF ( ) Knowledge map Save
    Topic Detection and Tracking is a research driven by evaluation , which intends to organize and utilize information stream of texts according to event. Since being brought forward in 1996 ,it comes under more and more attention. This paper proposes an algorithm of division and multi-level clustering with multi-strategy optimization , which bases on study of today’s mature algorithms. The core thought of the algorithm is to divide all data into groups (each group has intrinsic relevance) ,and cluster in each group to produce micro-clusters ,and then cluster on all micro-clusters to result in final topics. During the process , various strategies are employed to improve the effect of clustering. The system implemented with the algorithm has been tested on TDT4 corpus. The test indicates the algorithm is one present best algorithm.
  • JIANG Di
    2006, 20(1): 39-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper discusses the classification of Tibetan verbs according to verbal semantic types and syntactic types , which can describe the numbers of arguments and characters of components in sentences. There are 12 types of verbs in Tibetan. They are stative verbs , action verbs , cognition verbs , perception verbs , verbs of change , directional verbs , narrate verbs , copula , verbs of possession , existential verbs , interrelation verbs , causative verbs , each of which requires different case markers for nouns in different positions , or different word orders or different syntactic particles. All the semantic and syntactic features of verbs and their structural model described in the paper can be used as information that appears in the dictionary , which we are constructing now. It is no doubt the analysis of semantic and syntactic features of Tibetan verbs is infrastructure of Tibetan natural language processing.
  • YANG Er-hong,FANG Ying,LIU Dong-ming,QIAO Yu
    2006, 20(1): 46-51,99.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents the results from the‘863 Chinese and Interface Technology’integrative evaluation on automated Chinese word Segmentation and part of speech (POS) tagging held in 2003 ,Beijing. It describes the evaluation content , evaluation method , selection of test questions , test guideline , etc. , and summarizes the type of word segmentation and POS tagging errors in the tested result. In this paper , we also present a flexible automatic method used in the evaluation. Finally , we give some analyse on the result and make some recommendation for the future.
  • FENG Chong,CHEN Zhao-xiong,HUANG He-yan,GUAN Zhen-zhen
    2006, 20(1): 52-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word segmentation is a fundamental task in Chinese processing. To solve the difficulties of traditional methods in coping with various application domains and evolutive language phenomena , this paper adopts an unsupervised learning framework , using EM algorithm to train the n-multigram language model. A new certainty-based active learning segmentation algorithm is proposed , which combine labeled data with unlabeled data together to optimize language model. In experiments it outperforms other unsupervised word segmentation algorithms.
  • YU Kun,GUAN Gang,ZHOU Ming,WANG Xu-fa,CAI Qing-sheng
    2006, 20(1): 61-68.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents an approach based on cascaded double-layer text classification for resume information extraction. This approach first divides a resume into block and string. Then it divides the target information into general information and detailed information. It first extracts general information by block segmentation and classification. Then it selects those blocks that may contain predefined detailed information with a fuzzy strategy. At last , it segments these blocks into strings and labels the strings with detailed information classes. The experimental results on 1200 Chinese resumes show that our approach is suitable for the information extraction and management of resumes.
  • LIN Da-zhen,LI Shao-zi
    2006, 20(1): 69-77.
    Abstract ( ) PDF ( ) Knowledge map Save
    As far as NLP is concerned , the tense of the Chinese language is especially hard to tackle. One of the outstanding characteristics of the Chinese language is that its tense is usually implied rather than obvious. Hence , the Rule-based solution is far from suitable for the recognition of tense in situations where tense-informing words are missing or more than one of such words are present. In this paper , we introduce a pattern-classification based solution , which evaluates each single word in terms of its contribution to the recognition of tense for the concerned sentence. This solution proves effective when processing sentences containing none or more than one tense-informing words. Furthermore , the implementation of linear discriminating function in this solution leads to its abilities of multi-dimensional data processing and training , and helps to achieve decent performance. Evaluated under open conditions , the Precision and the Recall of this solution for single sentences are 79.8% and 95.3% , respectively.
  • SHI Jing,DAI Guo-zhong
    2006, 20(1): 78-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text inference is central to natural language applications. This paper presents an inference method based on HowNet , which organizes knowledge with semantic net and infers with marker passing. The method introduces construction-integration model , generates knowledge structure dynamically and builds paths between text words with guide. Examples of 16 inference classes are used to test it. The results show that ideal inferences can be extracted with low cost if enough contexts are given.
  • HAN Zhi,LIU Chang-ping,YIN Xu-cheng
    2006, 20(1): 87-92.
    Abstract ( ) PDF ( ) Knowledge map Save
    Character segmentation for mail address has become a crucial step for the address recognition in the automatic post mail sorting system. In this paper , a character segmentation algorithm was proposed according to the characteristics of handwritten mail address character. First a simple segmentation process was fulfilled using the structure-based methods , including vertical projection , connected components extraction and stroke cross number analysis , to extract the block sequence from the mail address image. Next candidate segmentation paths were created by merging the neighboring blocks. Then these paths were evaluated by the character recognition confidence and knowledge analysis of the known post address database. An experiment with the algorithm was carried out on more than 500 real envelop images ,with the correct sorting rate of address recognition up to 78.61% and the rate of address and postcode integrated recognition up to 95.42%.
  • BAI Jun-mei,ZHANG Shi-lei,ZHANG Shu-wu,XU Bo
    2006, 20(1): 93-99.
    Abstract ( ) PDF ( ) Knowledge map Save
    Speaker recognition (SR) has got excellent result in clean speech. However , the effects of noises or channel mismatch can cause significant performance degradation in practical appliance. The focus of this paper is to address those problems related to robust and efficient speaker identification (SI) in noise environment. The main contributions center around two areas that include signal processing based on Wiener filtering and speaker features integration of F0 and MFCC. The experimental results on YOHO corpus show that Wiener filter is an efficient front-end processing technique and F0 is a robust feature for SR in noise environments. The performance has improved 20% above baseline system.
  • LIU Hao-jie,Du Li-min
    2006, 20(1): 100-106.
    The F0 contour of Chinese prosodic word is influenced greatly by the stress of its syllables. Based on the mathematical model to produce the optimized F0 contour ,this paper proposes a method of x2 fitting to optimize the continuity,the smoothness,the shape and the average feature of F0 contour of prosodic word,which can achieve the optimization of F0 contour under the function of the stress of the responding syllables.Based on the HNM speech synthesis system,we show the optimized results for the 64 to...