2003 Volume 17 Issue 4 Published: 18 August 2003
  

  • Select all
    |
  • LIU Qun
    2003, 17(4): 2-13.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper gives a survey on three approaches of statistical machine translation and the evaluation methods used in SMT. The basic idea of parallel grammar based approach is to build parallel grammars for source and target languages , which conform the same probabilistic distribution. In the source-channel approach , the translation probability is expressed as a language model and a translation model. In the maximum entropy approach , the optimal translation is searched according to a linear combination of a series of real-valued feature functions. The source-channel approach can be regard as a special case of maximum entropy approach.
  • LI Qing-hu,CHEN Yu-jian,SUN Jia-guang
    2003, 17(4): 14-19.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese word segmentation is the preparation for Chinese Information Processing. As one basic component of Chinese word segmentation systems , the dictionary mechanism influences the speed and efficiency of segmentation significantly. In this paper , we provide a new dictionary mechanism named double-character-hash-indexing (DCHI) . Compared with existing typical dictionary mechanisms (i.e. binary-seek-by-word , TRIE indexing tree and binary-seek-by-characters) , DCHI improves the speed and efficiency of segmentation without increasing the space and time complication and maintenance difficulty.
  • SONG Rui-hua,MA Shao-ping,CHEN Gang,LI Jing-yang
    2003, 17(4): 20-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    While using search engine , people always find so many irrelevant or peripherally relevant items in the result list . Most of them are produced by the words irrelevant to the topic of a web page. It is costly or even impossible to remove such items using traditional keyword methods. In this paper , we define the concept of noise in web pages , and propose a novel approach to clean the noise information of web pages in the pre-processing stage. A novel model of Chinese web pages and 4 simple rules are build to discard noise from HTML files. Experimental results show that , all the indirect items that appear in the results of site grouping are removed correctly and about 11% irrelevant or indirect items that cannot be excluded by commercial Chinese search engines are removed by our approach.
  • KANG Heng,LIU Wen-ju
    2003, 17(4): 28-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    The performance of continuous speech recognition systems depends much on speech database. Text selection is the key step in designing of the speech database. Conventional text selection methods consider too few factors for the recognition systems to use linguistic information effectually. This paper describes a method which can select text automatically and consider multiple factors : triphone covering rate , triphone covering efficiency , triphone sparse rate and distribution of commonly used words , etc. The set of selected text covers 94.1% triphones , 75.4% most commonly used words , and also the covering rate and sparse rate are improved than that of conventional methods.
  • HAN Zhao-bing,JIA Lei,ZHANG Shu-wu,XU Bo
    2003, 17(4): 34-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    A crucial issue in triphone-based continuous speech recognition is the large number of parameters to be estimated against the limited availability of training data. To cope with the problem , two major context-clustering methods , agglomerative (AGG) and tree-based (TB) , have been widely investigated. We analyze both algorithms with respect to their advantage and disadvantage , develop several methods to improve on them , and introduce a novel combined method in the maximum likelihood framework. For LVCSR , the experimental results show the performance can be much improved by using the proposed combined method , compared with those of the existing TB method alone.
  • NIE Xin,WANG Zuo-ying
    2003, 17(4): 40-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    In TTS system , it is very important to mark phrase breaks correctly for high naturalness and quality of output speech. The paper discusses an algorithm for automatically predicting phrase breaks in Chinese sentences. At first , the text is segmented to words and converted to a sequence of part-of-speech tags ; then based on the POS tags sequence parameters and phrase-break distance information from training , Markov model is used to get the most likely phrase break sequence. In this paper several model parameters and rules are used , the recalling rate of predicated breaks is 68.2% , the overall predicted juncture correct rate is 85.1%.
  • CHEN Zhi-gang,HU Guo-ping,WANG Xi-fa
    2003, 17(4): 46-52.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text normalization is a procedure to generate information , such as pronunciation , rhythm and so on , for special symbols correctly. In this paper , a method based on hierarchical , external rules is presented. By matching rules , we can recognize normal special symbols and generate correct information. This paper introduces the concept of analysis tree firstly , then shows the steps of constructing rules and presents the experiment results. The results show that we can achieve easy-maintainability and easy-expandability , and the correct rate of open test is 99.76%.
  • HUANG Ya-ping,LUO Si-wei,CHEN En-yi
    2003, 17(4): 53-59.
    Abstract ( ) PDF ( ) Knowledge map Save
    Writer recognition , as an identification technology , has many advantages , such as natural interaction and non-intrusive detection , thus it becomes a hot topic in pattern recognition and machine learning research area. This paper proposes a new writer recognition algorithm of text independent , which adopts Independent Component Analysis (ICA) to extract texture feature and competitive learning mechanism to determine the center of class. Experimental results show that our algorithm is efficient .
  • LU Hong-lin,CHENG Yi-min,WANG Yi-xiao,TIAN Yuan
    2003, 17(4): 60-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes a new method based on ICA to transmit hidden Chinese characters , where a color image is used as the host image and analyzed to get its independent components. After the Chinese characters having been coded , the code was embedded in the lower significant bits of the proper independent components. Thus an image with the Chinese characters hidden in it can be composed , and then transmitted on the Internet . By this method , the Chinese characters can be transmitted secretly. This method has been simulated in PC. The experiments show that using this method the transmission ratio of the Chinese characters and its security can be considerably high under the condition of keeping image quality relatively high.