2001 Volume 15 Issue 2 Published: 16 April 2001
  

  • Select all
    |
  • XU Jia-lu
    2001, 15(2): 2-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper surveys the-state-of-the-art of Chinese information processing and the major obstacles being faced currently , pointing out that the underlying factor to block the development of Chinese information processing is the lag of the systematic and in-depth study on contemporary Chinese language. The main stream in Chinese information processing community depends heavily on corpus-based methods , by making full use of the statistical relationship among words , in recent years. The fact that the Chinese information processing technique progress slowly shows that the above scheme has very strong limitations. To make computer more intelligent and capable of coping with Chinese texts automatically , we must teach more linguistic knowledge to it . This requires us to strengthen the research in Chinese language , particularly in semantics , to the maximum extent from the perspective of computation. Regarding this , the author continues to talk about the ongoing research work in China , which is being promoted actively along with three distinct technical lines , discusses its advantage and disadvantage respectively , and concludes with a proposed strategy of “sharing resources nation-wide , doing research hand-in-hand , and tackle key problem jointly”, in terms of the experience obtained from the national 95 key project “Study on Lexicology of Contemporary Chinese Language for Information Processing ”supervised by the author.
  • YU Jiang-sheng
    2001, 15(2): 10-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this article ,we showed the algebraic structure of syntactic categories based on monoid and defined categorial equation whose solutions are described by consistency and correlation . The result“If X is a solution of a categorial equation ,then there exists an unique essential solution Y such that Y?X”makes it possible that the essential catgories of a word could generate all possible syntactic categories by some deductive rules. Finally ,the author described the deductive system of syntactic categories from the viewpoint of Category Theory in mathematics.
  • WANG Zhong-xiao,FAN Zhi-hua
    2001, 15(2): 17-23.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on a previous algorithm proposed in [1] ,this paper addressed an adaptive hashing algorithm of Chinese characters.By introducing an oblivious policy and sorting Chinese characters in accordance with their dynamic frequencies ,the algorithm made important improvements on the average search length of Chinese characters ,which could better guarantee the strict demand on time of any application driven by the dyanmic statistics of Chinese texts. In addition ,a simpler hash function was given which sorked almost the same as the one in [1].
  • HAN Ke-song,WANG Yong-cheng,CHEN Gui-lin
    2001, 15(2): 24-31.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper we describe a fast high-frequency strings extracting algorithm. Our approach uses HASH technology to avoid relying on corpus and word segmentation. To extract the high frequency strings , we only use statistics information. After processing the prefixes and suffixes , the high frequency strings we get can be the supplement knowledge for the un-login words processing , word disambiguation and word weighting. The experimental results show that it has a high speed and can work on arbitrary texts. Our method has good effect when processing novels and other real texts.
  • HUANG De-gen,YANG Yuan-sheng,WANG Xing,ZHANG Yan-li,ZHONG Wan-xie
    2001, 15(2): 32-38,45.
    Abstract ( ) PDF ( ) Knowledge map Save
    Identification of Chinese names is one of important techniques to improve the accuracy of automatic word segmentation. This paper proposes an effective model based on statistics to identify Chinese names. It establishes rewards-punishment mechanism and supervised-learning mechanism , and presents the reliability for the word segmentation in the model. The experiments show that the precision and recall rate respectively reach 95.97% and 95.52% by close test , while the precision and recall rate are 92.37% and 88.62% by open test .
  • WANG Wei,ZHONG Yi-xin,SUN Jian,YANG Li
    2001, 15(2): 39-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper is mainly to present a word segmentation ambiguity resolution scheme based on unsupervised training. According to the idea of EM ,a language model is built increasingly by collection the fractional counts of patterns (such as bigram pair) from the augmentations of all the segmentation candidates of a sentence. The learned language model is incorporated into a statistical segmentor. Experiments show that this scheme can resolve 85.36% ambiguity on test set each sentence of which has at least one ambiguous part (and the accuracy rate is based on sentence) .
  • ZHU Xiao-yan,WANG Yu,LIU Jun
    2001, 15(2): 46-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    Mandarin is a tonal language. The tones are recognized by using the pitch contour information which can be expressed by fundamental frequencies. The classic approaches for fundamental frequencies smoothing ,such as linear smoothing ,median smoothing and linear interpolation ,can not work well in the case of that fundamental frequency is not detected correctly and several continulus frames. In this paper ,a new smoothin approach is presented ,in which a searching method is used to get a preferable accurate pitch contour. This approach is characterized by its simple ,reliable and fast performance. Experimental results show that the new approach can decrease the recognition error rate by 40%.
  • XU Zhan-wu,ZHANG Tao,LIU Xiao-lin
    2001, 15(2): 52-57.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic vectorization of scanned topographic maps is an important and difficult problem that needs to be solved urgently. Atopographic map includes plenty of numbers with various fonts which indicate properties and other features of general configuration. Extracting and recognizing these numbers correctly is an important part in map disposal. Many present methods of extraction are analyzed on their disadvantages and a new extraction and recognition algorithm of numbers is presented in this paper. The algorithm first fixes on candidates according to transcendental sizes ,and then recognizes real numbers with BP neural network of OCON structure. At last ,it extracts extended numbers using relation of neighborhood. Experiments have proved it is fast ,efficient and reliable.
  • S·soyoltu
    2001, 15(2): 58-65.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the agglutinative pattern of Mongolian language recording and syllably writing rules of words , a whole word coding method for Mongolian language is proposed. With the use of computability theory dividing whole-word coding into two parts : writing-input coding and computational coding , an method of none keyboard mapping for spelling language is proposed. An best human-computer interaction pattern is reached with the imitation of the natural spelling ,writing rules of traditional Mongolian whole word in designing the writing codes. The whole words computing codes can do both bearing information of complex whole word features and guaranteeing the computability of information , so that sets up an feasible and scientific best for unified computation and parallel processing of Mongolian whole words.