1997 Volume 11 Issue 4 Published: 15 November 1997
  

  • Select all
    |
  • Zhou Ming1, Pan Haihua2
    1997, 11(4): 2-11.
    Abstract ( ) PDF ( ) Knowledge map Save
    A transformation based method is applied to tag the syntactic function of the words in a Chinese sentence. The system inputs a Chinese sentence with word boundary and part-of-speech information , and outputs the syntactic function for every words in the sentence. To realize this , a Chinese dependency formalism which consists of 44 kinds of dependency relations is firstly designed , and a corpus of 1300 sentences tagged with dependency relations in an efficient man-machine interactive mode is prepared. Among these these sentences , 1100 sentences are used as the training corpus , and the rest 200 sentences are used for test . Totally 60 ordered tagging transformations out of 17 kinds of transformation templates are acquired with the so-called transformation based method. To improve the robustness and the coverage , new words are initially annotated with the dependency relation of the highest frequency corresponding to its part-of-speech. This method is simple and easy to realize , and the experiment shows a preliminary good result .
  • Zhang Xiaoheng , Wang Lingling
    1997, 11(4): 22-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    As important proper nouns , Chinese names of organizations and institutions play an in-dispensable role in language communication. Unfortunately , due to their infinite quantity , constant creation and disappearance , and relative length and complexity , most of these names have failed to find their way into Chinese dictionaries of computer systems. Linguistically , however , these proper nouns can be viewed as a special group of compound nouns and as a simple category of noun phrase , possessing their own formation rules and physical markers. This paper presents a pioneer discussion on the analysis of Chinese names of organizations and institutions from the computational point of view. Useful linguistic rules has been drawn from the discussion and applied to the identification of names of organizations and institutions in the 6,000,000-character Mainland-Hongkong-Taiwan corpus of modern Chinese developed by Hong Kong Polytechnic University. Preliminary experiments show that both precision and recall rates for identifying names of colleges and universities are over 96%.
  • Zeng Xiangning, Zhang Xinzhong, Shen Lanshen, Ren Kunpeng
    1997, 11(4): 34-42.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper introduced an automatic analysis and recognize system for printed form and characters , which used an analysis algorithm special for the printed form , the form's feature points have been used by this algorithm. Based on the form image analysis , this algorithm can do a high quality analysis for the form lines , also , with the consider of the intersections of the form lines and characters , this algorithm can get the character blocks from the form image properly , at next , the cut module separate Chinese characters and English Characters , then , via the recognize module , all contents are recognized , and finally , the system restore the form in text format . The system has been proved effectively by a lot of experiments.
  • Zhou Qiang , Zhang Wei , Yu Shiwen
    1997, 11(4): 43-52.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper , some basic issues on building a Chinese treebank , including a Chinese syntactic tagset available for automatic analyzing and manual annotation , a working standard for Chinese treebank construction , and a man-machine mutually dependent corpus processing model , are discussed. Then , an automatic syntactic tagging system for the Chinese language is proposed and some experimental results are given. Moreover , some ideas for building a large scale Chinese treebank are also discussed.
  • Zhao Xusheng
    1997, 11(4): 53-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    The traditional way of text process causes more and problems today. In this paper , we will put forword a new mothod called COTP (character oriented Text Process) which is to unify the codes in the world and then separate the logical character from its storage format , and to process text with such characters as units. This method can bring convenience to both programmers and users in editing , displaying and searchng texts in Chinese and other languages. It can also make the dimensionof character set infinite. We will first explain the VLC and COTP. and then discribe how to take advantage of COTP in several different ways.
  • LiuXiaohu , Li Sheng , Wu Wei
    1997, 11(4): 61-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a method on dictionary look-up and fast input word by subset letters in a word. Lexical aid is very important , i. e. dictionary look-up , in a MAT (machine aided translation) system. But in many cases , the spelling of a word is not very clear in our memory , i. e. the user remember only several letters of the word. This paper solve the fuzzy dictionary look-up. The central technique of the method is full-text index technique , which depends on the special index of the dictionary. Further more , using the full-text index and fuzzy dictionary look-up technique , we implement the function of fast input word in the MAT system.
  • Mei Yong , Xu Bingzheng
    1997, 11(4): 67-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to improve Chinese speech recognition rate , a kind of conversion method from spelling to character based on Markov Model is employed to realize conversion from spelling to character in this paper. During realization , we put forward a kind of simplified model. The model not only assures the real time characteristics , but also is the base of our future work ; at the same time a new solution to the sparseness of training text is put forward. Employed the above model , the simulation test shows that the forward - backward markov model has better recognition characteristics than others. Furthermore , characteristics of models whose output units are words are better than those whose output units are characters.