Journal of Chinese Information Processing

Select

Internet-oriented Chinese New Words Detection

ZOU Gang,LIU Yang,LIU Qun,MENG Yao,YU Hao,Nishino Fumihito,KANG Shi-yong

2004, 18(6): 2-10.

Abstract ( ) PDF ( )

Knowledge map

Save

With the fast development of the society ,more and more new words come out in our life. It is one of the important topics in Chinese natural language processing to collect those new words. A method is presented for detecting these new words automaitcally in this paper. Through analysing webpages grabbed from the Internet , a large word and string set is built , which new words are detected from and filtered by rules. At last new words which exist in the webpages grabbed are extracted. The system built in this way can find new words in any length and in any field. Now it is applying to the compilation of Modern Chinese New Word Information Dictionary. It reduced human labor a lot in practise.

Select

Chinese Name Identification Integrated Decision Tree Learning

WANG Zhen-hua,KONG Xiang-long,LU Ru-zhan,LIU Shao-ming

2004, 18(6): 11-16.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese person name identification is a subfield of Named Entity Identification in natural language processing. This identification is divided into three stages in this paper : extraction , classification , and disambiguation. The candidate Chinese person names are extracted using statistical information. The morphological , syntax , and semantic features of the context are also extracted to compose the sample of classification. The estimation of the candidate is deemed to classification. We classify every candidate using decision tree to distinguish whether it is a real Chinese person name. In the end , the inconsistency in classification is disambiguated. Recall and precision are all above 90% in experiments using this method.

Select

Co-occurrence Word Retrieval Based on the Lexical Attraction and Repulsion Model

GUO Feng,LI Shao-zi,ZHOU Chang-le,LIN Ying,LI Sheng-rui

2004, 18(6): 17-23.

Abstract ( ) PDF ( )

Knowledge map

Save

Co-occurrence word retrieval is very important in information mining and natural language processing. But traditional co-occurrence word retrieval methods used only a single statistic method , so the result is very imprecise , and needs lots of manual collation. In this paper we present a co-occurrence words extraction algorithm based on the lexical attraction and repulsion model , and combine some common statistical methods with the algorithm to improve its effect. In the open test , our system’s Interesting performance is 60.87%. We show good performance in speed and precision when applied the algorithm on a co-occurrence search system based on web.

Select

Dual Distributional Semantic Knowledge Acqusition with Small Corpora and Machine Readable Dictionaries

HAO Xiu-lan,YANG Er-hong

2004, 18(6): 24-30.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a systemfor unsupervised verb semantic knowledge acquisition using small corpus and a machine-readable dictionary (MRD) . The system does not depend on sense-tagged corpus , but learns a set of typical usages listed in the MRD usage examples for each of the senses of a polysemous verb in the MRD definitions and uses verb-object co-occurrences acquired from the corpus. This paper concentrates on the problemof data sparseness in two ways. First , extending word similarity measures from direct co-occurrences to co-occurrences of co-occurred words , we compute the word similarities using not co-occurred words but co-occurred clusters. Second , we acquire IS-A relations of nouns from the MRD definitions. It is possible to cluster the nouns roughly by the identification of the IS-A relationship. By these methods , two words may be considered similar even if they do not share any word. Experiments show that this method can learn from very small training corpus and achieve over 85.7% correct disambiguation performance without a restriction of word’s senses.

Select

Integrating Class Frequency Into Association Rules Based Chinese Text Categorization

QIAN Tie-yun,WANG Yuan-zhen,FENG Xiao-nian

2004, 18(6): 31-37.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper , a new algorithm that integrates class frequency into association rules based document classification is introduced into Chinese text categorization. This algorithm views each document as a transaction and each term as an item. The class frequency of a term is used to filter the words that are irrelevant to classification , and the mining algorithm of association rules is used to mine the correlation between item and category. Class character words sets are formed basing on the rules , and unlabeled documents are classified by intersecting with these sets. Experiments confirm that this method has a promising recall , precision rate and F-Measure while speeding up both training and test time.

Select

E-commerce Oriented Knowledge Description Language

HE Jian,QIN Zheng,JIA Xiao-lin,XIE Guo-tong

2004, 18(6): 38-43.

Abstract ( ) PDF ( )

Knowledge map

Save

According to the increasingly intelligent and mobile characteristic of E-commerce , an XML-based and ontology-supported E-commerce Knowledge Description Language (KDL) is first presented , which has three-tier structure (Core KDL , Extended KDL and Complex KDL) , and takes advantages of strongpoint of ontology , XML , description logics , frame-based systems. And then , we introduce the XML-Based syntax of KDL , and give the methods of translating KDL into first order logic. At last , the reasoning ability of KDL proved by experiment is illustrated in detail.

Select

An HNC Approach to the Resolution of Syntactic Structural Ambiguity

ZHANG Ke-liang

2004, 18(6): 44-53.

Abstract ( ) PDF ( )

Knowledge map

Save

Disambiguation has always been the focus of natural language understanding and processing. Successful disambiguation relies on the correct understanding of a given context. The HNC theory is characteristic of its formalized representation of conceptual primitives , its arrangement of concepts in a hierarchical network , and its development of the sentence category (SC) and sentence format (SF) systems. All this provides the utmost possibility for resolving ambiguity in natural languages. The overall principle for HNC-based disambiguation is to take sentence as the basis for disambiguation and to integrate micro disambiguation into macro disambiguation. In the case of V + NP1 + 的+ NP2 , a triple ambiguous syntactic structure in the HNC perspective , proper rules are suggested for its disambiguation.

Select

Research on Software Architecture for Language Engineering : A Survey

FENG Chong,CHEN Zhao-xiong,HUANG He-yan

2004, 18(6): 54-61,73.

Abstract ( ) PDF ( )

Knowledge map

Save

Providing reference architectures for general natural language applications , software architecture for language engineering has gradually became one of the main research fields of language engineering in the past several years. This paper makes a short review on this fresh area , introduces its primary concepts , and discusses some representative progresses. Based on the analysis to the current work , we present some promising direction for future research.

Select

Research on Uighur Word Segmentation

Gulila Adongbieke,Mijit Ablimit

2004, 18(6): 62-66.

Abstract ( ) PDF ( )

Knowledge map

Save

Root-affix and syllable segmentation of Uighur word bring great facilities in Uighur natural language processing. Affix in Uighur are various , they link between themselves and to a root in different ways. But there are intricate rules in their linkage. In this paper , we propose methods of handling with the basic phonetic features of Uighur words , such as the final vowel change , rules of vowel and consonant harmony , and syllable segmentation. We also summarized the word structures and phonetic structures of Uighur , and proposed some rules of Uighur word segmentation and implementation of this segmentation. According to the implementation of these rules on regular words from scientific publishing in Xinjiang , the accuracy is 95%.

Select

A Contrastive Investigation of Vowel Systems Between Standard Mandarin and Shanghai-Accented Mandarin

YU Jue,LI Ai-jun,WANG Xia

2004, 18(6): 67-73.

Abstract ( ) PDF ( )

Knowledge map

Save

Dialectal differences are widely investigated for dialect identification , language (L2) learning and pronunciation modeling for Automatic Speech Recognition (ASR) . Especially in Chinese ASR systems , how to deal with the accent issue becomes a big challenge due to the variability of the language. We compared these pure monophthongs [?a ,u ,? ?,y ,i] for 10 SM (Standard Mandarin) and 20 ASH (Shanghai-accented Mandarin) speakers in NOKIA-CASS corpus and tried to find out the differences in monophthongs between SC and ASH: (1) under the influence of its dialectal vowel inventory , the vocalic space of ASH would be inevitably more peripheral. (2) there is a large overlap between the two vowel ellipses of in ASH speakers while in SC there is not. (3) in comparison with SC , the first two formants of [ y ,i ] are all drawn much further in ASH speakers. Meanwhile the formant pattern of is very similar to that of in ASH speakers. (4) vowel [??] in most speakers has a tendency of diphthongization , especially in ASH speakers.

Select

Research and Realization of Embedded Speech Recognition System

FANG Min,PU Jian-tao,LI Cheng-rong,TAI Xian-qing

2004, 18(6): 74-79.

Abstract ( ) PDF ( )

Knowledge map

Save

Proposed in this paper is a novel speaker-independent speech recognition system , which is command-variable and suitable for realization based on embedded platform. Compared with traditional speaker-independent speech recognition system based on PC , our system is featured small storage and computation cost. The system is evaluated on several embedded platforms that are specially designed. According to the result of the evaluation , the feasibility of speaker-independent speech recognition system based on embedded platform is proved and the least requirement for the hardware is given. Then we analyzed the main problems and difficulties in the development of high performance speech recognition SOC (System On a Chip) from the point of technology , and pointed out some future works.

Select

Study on Hierarchical Speech Recognition

XU Ming-xing|YANG Da-li|WU Wen-hu

2004, 18(6): 80-85.

Abstract ( ) PDF ( )

Knowledge map

Save

Hierarchical recognition has been proposed for a long time in the pattern recognition field. Although it is a familiar action when human performs a recognition task , there is not an effective and systematic method to implement it for the speech recognition. This paper presents our recent experimental results on this topic , which uses the principle of sub-space partition to realize a hierarchical recogntion and a tree-based architecture to organize multi-recognizers. The results show that the proposed algorithm can achieve about 10% error reduction compared with traditional methods. In future works , we will test all Chinese syllables and extend them for the continous speech recogntion.

Please choose a citation manager

Content to export

2004 Volume 18 Issue 6 Published: 15 December 2004