2004 Volume 18 Issue 6 Published: 15 December 2004
  

  • Select all
    |
  • ZOU Gang,LIU Yang,LIU Qun,MENG Yao,YU Hao,Nishino Fumihito,KANG Shi-yong
    2004, 18(6): 2-10.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the fast development of the society ,more and more new words come out in our life. It is one of the important topics in Chinese natural language processing to collect those new words. A method is presented for detecting these new words automaitcally in this paper. Through analysing webpages grabbed from the Internet , a large word and string set is built , which new words are detected from and filtered by rules. At last new words which exist in the webpages grabbed are extracted. The system built in this way can find new words in any length and in any field. Now it is applying to the compilation of Modern Chinese New Word Information Dictionary. It reduced human labor a lot in practise.
  • WANG Zhen-hua,KONG Xiang-long,LU Ru-zhan,LIU Shao-ming
    2004, 18(6): 11-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese person name identification is a subfield of Named Entity Identification in natural language processing. This identification is divided into three stages in this paper : extraction , classification , and disambiguation. The candidate Chinese person names are extracted using statistical information. The morphological , syntax , and semantic features of the context are also extracted to compose the sample of classification. The estimation of the candidate is deemed to classification. We classify every candidate using decision tree to distinguish whether it is a real Chinese person name. In the end , the inconsistency in classification is disambiguated. Recall and precision are all above 90% in experiments using this method.
  • GUO Feng,LI Shao-zi,ZHOU Chang-le,LIN Ying,LI Sheng-rui
    2004, 18(6): 17-23.
    Abstract ( ) PDF ( ) Knowledge map Save
    Co-occurrence word retrieval is very important in information mining and natural language processing. But traditional co-occurrence word retrieval methods used only a single statistic method , so the result is very imprecise , and needs lots of manual collation. In this paper we present a co-occurrence words extraction algorithm based on the lexical attraction and repulsion model , and combine some common statistical methods with the algorithm to improve its effect. In the open test , our system’s Interesting performance is 60.87%. We show good performance in speed and precision when applied the algorithm on a co-occurrence search system based on web.
  • HAO Xiu-lan,YANG Er-hong
    2004, 18(6): 24-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a systemfor unsupervised verb semantic knowledge acquisition using small corpus and a machine-readable dictionary (MRD) . The system does not depend on sense-tagged corpus , but learns a set of typical usages listed in the MRD usage examples for each of the senses of a polysemous verb in the MRD definitions and uses verb-object co-occurrences acquired from the corpus. This paper concentrates on the problemof data sparseness in two ways. First , extending word similarity measures from direct co-occurrences to co-occurrences of co-occurred words , we compute the word similarities using not co-occurred words but co-occurred clusters. Second , we acquire IS-A relations of nouns from the MRD definitions. It is possible to cluster the nouns roughly by the identification of the IS-A relationship. By these methods , two words may be considered similar even if they do not share any word. Experiments show that this method can learn from very small training corpus and achieve over 85.7% correct disambiguation performance without a restriction of word’s senses.
  • QIAN Tie-yun,WANG Yuan-zhen,FENG Xiao-nian
    2004, 18(6): 31-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper , a new algorithm that integrates class frequency into association rules based document classification is introduced into Chinese text categorization. This algorithm views each document as a transaction and each term as an item. The class frequency of a term is used to filter the words that are irrelevant to classification , and the mining algorithm of association rules is used to mine the correlation between item and category. Class character words sets are formed basing on the rules , and unlabeled documents are classified by intersecting with these sets. Experiments confirm that this method has a promising recall , precision rate and F-Measure while speeding up both training and test time.
  • HE Jian,QIN Zheng,JIA Xiao-lin,XIE Guo-tong
    2004, 18(6): 38-43.
    Abstract ( ) PDF ( ) Knowledge map Save
    According to the increasingly intelligent and mobile characteristic of E-commerce , an XML-based and ontology-supported E-commerce Knowledge Description Language (KDL) is first presented , which has three-tier structure (Core KDL , Extended KDL and Complex KDL) , and takes advantages of strongpoint of ontology , XML , description logics , frame-based systems. And then , we introduce the XML-Based syntax of KDL , and give the methods of translating KDL into first order logic. At last , the reasoning ability of KDL proved by experiment is illustrated in detail.
  • ZHANG Ke-liang
    2004, 18(6): 44-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    Disambiguation has always been the focus of natural language understanding and processing. Successful disambiguation relies on the correct understanding of a given context. The HNC theory is characteristic of its formalized representation of conceptual primitives , its arrangement of concepts in a hierarchical network , and its development of the sentence category (SC) and sentence format (SF) systems. All this provides the utmost possibility for resolving ambiguity in natural languages. The overall principle for HNC-based disambiguation is to take sentence as the basis for disambiguation and to integrate micro disambiguation into macro disambiguation. In the case of V + NP1 + 的+ NP2 , a triple ambiguous syntactic structure in the HNC perspective , proper rules are suggested for its disambiguation.
  • FENG Chong,CHEN Zhao-xiong,HUANG He-yan
    2004, 18(6): 54-61,73.
    Abstract ( ) PDF ( ) Knowledge map Save
    Providing reference architectures for general natural language applications , software architecture for language engineering has gradually became one of the main research fields of language engineering in the past several years. This paper makes a short review on this fresh area , introduces its primary concepts , and discusses some representative progresses. Based on the analysis to the current work , we present some promising direction for future research.
  • Gulila Adongbieke,Mijit Ablimit
    2004, 18(6): 62-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    Root-affix and syllable segmentation of Uighur word bring great facilities in Uighur natural language processing. Affix in Uighur are various , they link between themselves and to a root in different ways. But there are intricate rules in their linkage. In this paper , we propose methods of handling with the basic phonetic features of Uighur words , such as the final vowel change , rules of vowel and consonant harmony , and syllable segmentation. We also summarized the word structures and phonetic structures of Uighur , and proposed some rules of Uighur word segmentation and implementation of this segmentation. According to the implementation of these rules on regular words from scientific publishing in Xinjiang , the accuracy is 95%.
  • YU Jue,LI Ai-jun,WANG Xia
    2004, 18(6): 67-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    Dialectal differences are widely investigated for dialect identification , language (L2) learning and pronunciation modeling for Automatic Speech Recognition (ASR) . Especially in Chinese ASR systems , how to deal with the accent issue becomes a big challenge due to the variability of the language. We compared these pure monophthongs [?a ,u ,? ?,y ,i] for 10 SM (Standard Mandarin) and 20 ASH (Shanghai-accented Mandarin) speakers in NOKIA-CASS corpus and tried to find out the differences in monophthongs between SC and ASH: (1) under the influence of its dialectal vowel inventory , the vocalic space of ASH would be inevitably more peripheral. (2) there is a large overlap between the two vowel ellipses of in ASH speakers while in SC there is not. (3) in comparison with SC , the first two formants of [ y ,i ] are all drawn much further in ASH speakers. Meanwhile the formant pattern of is very similar to that of in ASH speakers. (4) vowel [??] in most speakers has a tendency of diphthongization , especially in ASH speakers.
  • FANG Min,PU Jian-tao,LI Cheng-rong,TAI Xian-qing
    2004, 18(6): 74-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    Proposed in this paper is a novel speaker-independent speech recognition system , which is command-variable and suitable for realization based on embedded platform. Compared with traditional speaker-independent speech recognition system based on PC , our system is featured small storage and computation cost. The system is evaluated on several embedded platforms that are specially designed. According to the result of the evaluation , the feasibility of speaker-independent speech recognition system based on embedded platform is proved and the least requirement for the hardware is given. Then we analyzed the main problems and difficulties in the development of high performance speech recognition SOC (System On a Chip) from the point of technology , and pointed out some future works.
  • XU Ming-xing|YANG Da-li|WU Wen-hu
    2004, 18(6): 80-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    Hierarchical recognition has been proposed for a long time in the pattern recognition field. Although it is a familiar action when human performs a recognition task , there is not an effective and systematic method to implement it for the speech recognition. This paper presents our recent experimental results on this topic , which uses the principle of sub-space partition to realize a hierarchical recogntion and a tree-based architecture to organize multi-recognizers. The results show that the proposed algorithm can achieve about 10% error reduction compared with traditional methods. In future works , we will test all Chinese syllables and extend them for the continous speech recogntion.