2004 Volume 18 Issue 4 Published: 16 August 2004
  

  • Select all
    |
  • ZHOU Qiang
    2004, 18(4): 2-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    The syntactically annotated corpora , commonly called‘treebanks’, play an important role in empirical linguistics as well as in machine learning methods in natural language processing. After a brief summarization of several treebank annotation of different language , we proposed a new annotation scheme for Chinese treebank in this paper. Under this scheme , every Chinese sentence will be annotated with a complete parse tree , where each non-terminal constituent is assigned with two tags. One is the syntactic constituent tag , which describes its external functional relation with other constituents in the parse tree. The other is the grammatical relation tag , which describes the internal structural relation of its sub-components. These two tag sets consist of 16 and 27 tags respectively. They form an integrated annotation for the syntactic constituent in a parse tree through top-down and bottom-up descriptions. Based on this scheme , we built a 1,000,000 words Chinese treebank covering a balanced collection of journalistic , literary , academic , and other documents. The annotating experiments on different kinds of complex linguistic phenomena show the availability and compatibility of this annotation scheme.
  • LI Yu-qin,SUN Li-hua
    2004, 18(4): 10-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    The technique of text automatic category is to classify texts into one or more classes according to some strategy. This paper firstly reports three kinds of technique of text automatic category based on statistic (k nearest neighbor ,support vector machine and na?ve bayes) ,and analyses their advantages and disadvantages. The weakness of statistic-based automatic category is the category precision decrease while the character intersect within classes increase , especially in the case of multi-layers classifying. In order to improve statistic-based automatic category performance , rule-based automatic category is used. we combine statistic-based category with rule-based classifying method , design and realize a systemof mixing category lastly , which has and has had very good performance in category.
  • QU Wei-min,SUN Le,SUN Yu-fang
    2004, 18(4): 16-23.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper study the problemof producing ranked result for keyword search over text-rich XML documents. We analyze the challenges introduced by XML data if utilize traditional Information Retrieval to solve the problem. Then we propose a dynamic element-oriented method to compute the weight of keywords , and a ranking function that consider both the frequency distribution and structural distribution of keywords in the result. Experimental results prove the effectiveness of our solution.
  • WANG Yun,YUAN Chun-fa
    2004, 18(4): 24-31.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the past years , temporal information processing and extraction has received increasing attentions. Nevertheless , only a few researchers have investigated the recognition about corresponding temporal expression of the event in Chinese text. The aim of this paper is to investigate both the temporal information extraction and the determining of mapping relation between event and its temporal expression. As compared to many other techniques , we use a machine learning method , transformation-based error-driven learning algorithm to determine the time-event mapping relation. The method can automatically acquire the analytical rules. The system builds an initial time-event tagger firstly. Then by machine learning , the system get a patch rule set to improve the performance of the initial time-event tagger. Using the patch rule set , system gets 6.5% error rate decrease for time-event mapping relation determination. The experiment indicates that the transformation-based error-driven learning is a good patch for based-rule method.
  • YANG Yun,ZHOU Chang-le,WANG Xue-mei,DAI Shuai-xiang
    2004, 18(4): 32-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper introduces computational model of Chinese metaphor in machine-understanding of Chinese. By analyzing large-scale Chinese metaphor samples , we have classified Chinese metaphor based on understanding. The cognitive features of Chinese metaphor are also considered to improve our classification. The classification focuses on the similarity of the tenor and vehicle in a metaphor , showing the mode and difficulty of metaphor understanding. The relevant knowledge in metaphor understanding is also discussed. The classification is statistically verified. What's more , a program is developed to validate the rationality of our classification. Finally , a system of understanding-based classification of Chinese metaphor has been put forward. A comparison with other classification is listed. And a linguistic explaining of the system is given at the end of the paper.
  • XU Yong,XUN En-dong,JIA Ai-ping,SONG Rou
    2004, 18(4): 38-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presented an experimental Web-Based term definition retrieval system. It can provide the users convenient way to obtain definition-related knowledge of newly emerged terms , like Clone , ADSL , etc. The system mainly consists of two modules : web crawling module and term definition matching module. Based on multi-thread architecture , the web crawling module downloads web pages efficiently , in which the term definition matching module searches for the term definitions simultaneously with a set of term-definition related linguistic patterns. The term definition patterns used in the system are obtained from technology journal corpora. Experiment shows that the system can retrieve term definitions effectively from web and the accuracy of the retrieved result is acceptable.
  • CHEN Yan,SUN Yu-fei,ZHANG Yu-zhi
    2004, 18(4): 45-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to overcome the weakness of conventional segmentation algorithm in OCR , this paper presents a new segmentation method for gray document image. Important features of the new method include grading of the grayscale of pixels in image and construction of a tree structures for the whole document image. By dividing this tree's branches and leaves , characters , pictures and forms can be correctly segmented. The experiment results showed that this method is very effective for document with both Chinese and English characters or document with different backgrounds.
  • ZHANG Yu-hua,ZHOU Ke-lan
    2004, 18(4): 51-55.
    Abstract ( ) PDF ( ) Knowledge map Save
    The major way of inputting Chinese character into computer is the Chinese encoding input method. It is very important to evaluate the Chinese input method scientifically and help the programmers and the users improve the technology and make their choice. Depending on actual application , this thesis promotes the way of how to evaluate the Chinese input method performance. It narrates how to establish the input rules of any Chinese input methods , in a certain input method how the Computer simulates the input process and then gains the input method character and phrase code chart , and how to use the code chart to evaluate the input method according to the national standard of information technology.
  • CAO Juan,ZHOU Jing-ye
    2004, 18(4): 56-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper we put forward a new concept , the degree of cohering of Chinese strings , and it's computation. It's value reflects how close the two strings are interrelated. This method completely considered the environment of the Chinese strings and the local-use-frequency of the words. Its definition and the examples of applying it in word segmentation are presented。Compared with the method of mutual information the predecessors had put forward , this method can solve some difficult problems in word segmentation and improves the precision.
  • ZHANG Xiao-heng
    2004, 18(4): 61-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese Character Component Standard of GB13000.1 Character Set for Information Processing is an important document for the standardization of Chinese character input methods. Yet , when employed to the design and implementation of a nontrivial Chinese character input system , the standard encountered a number of difficulties : the hard-to-remember large number of coding components , the difficult-to-maneuver definition of basic components , and the poor rules for component disassembly and assembly. The sources of these difficulties include (a) definition of basic components by enumeration , (b) disassembly and assembly of components based on etymology and formation of characters , (c) reliance on the judgment of character-forming capability of candidate components , and (d) over-emphasis on the restriction of the number of basic components. To escape from this difficult position , we urgently need convenient and reliable rules for component identification and segmentation , which can be built up on the basis of the existing component standard by taking full advantage of the form features of Chinese characters. The feasibility and effectiveness of the proposed methodology have been verified by the successful development of the ZYQ Chinese character input system.
  • CHU Min
    2004, 18(4): 67-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper explores the uncertainty of prosody in a speech corpus , which contains two read versions of 1000 sentences by a professional voice talent under the same linguistic and affective planning. It is found that corresponding prosodic features in the two versions change in a rather wide range. The scope of local variations can be as large as 45 - 50 % of the overall variation range of a speaker. Based on such observation , this paper proposes a minimum error-rate criterion (MERC) to replace the traditional maximum correct-rate criterion in prosody generation. Furthermore , this paper proposes an approach to integrate the MERC into the unit selection algorithm. Among all instances of a speech unit , those that have the lowest possibility to result unnatural prosody are picked out first , and then the most suitable path is selected from all prosodic equivalent candidates under the smoothest criterion to assure the smoothest concatenation of all units on this path.
  • ZHUANG Li,BAO Ta,ZHU Xiao-yan
    2004, 18(4): 73-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    Some speech and language processing technique used in the“Aurora”software systemfor the blinds developed by the State Key Laboratory of Intelligent Technology and Systems is introduced in this paper. The software system can obtain and analyze screen information that requires feedback , and read it in virtue of a speech synthesis platform to give information to the users. By using some natural language processing technique , including Chinese word segmentation and language model , the system can realize the transformation from Characters to Braille , and export the information to the Braille display service. Users can obtain the information they need by touching the Braille display device. On the other hand , users can input Braille by using the Braille input tool , and the input result can be transferred to Characters text.
  • LIU Peng,WANG Zuo-ying
    2004, 18(4): 80-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper , we investigate on the using of visual feature in Mandarin multimodal speech recognition. The audio-visual fusion strategy based on multi-stream hidden Markov model is presented. Then key technologies about visual feature , including lip location and visual feature extraction , are discussed. Firstly , we research on the lip location algorithm based on model matching and the low Subsequently , the low-level visual feature based on linear transform is investigated and compared to the high-level visual feature based on active shape models. It is shown by experiments that the word error rate of the first candidate of acoustic level is reduced by 36.09% relatively with visual feature used , compared to audio speech recognition system. It is also demonstrated from more experiments that our audio-visual systemprovides significant robustness enhancement in noise environment.