2004 Volume 18 Issue 2 Published: 15 April 2004
  

  • Select all
    |
  • LI Heng,ZHU Jing-bo,YAO Tian-shun
    2004, 18(2): 2-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    The classification algorithm based on SVM (support vector machine) attracts more attention from researchers due to its perfect theoretical properties and good empirical results. Compared with other classification algorithms , structural risk minimizations based SVM achieve high generalization performance with small number of samples. The text chunking , as a preprocessing step for parsing , is to divide text into syntactically related non-overlapping groups of words (chunks) , reducing the complexity of the full parsing. In this paper , we treat Chinese text chunking as a classification problem , and apply SVM to solve it . The chunking experiments were carried out on the HIT Chinese Treebank corpus. Experimental results show that it is an effective approach , achieving an F score of 88.67% , especially for a small number of Chinese labeled samples.
  • ZHANG Min,JIN Yi-jiang,MA Shao-ping
    2004, 18(2): 9-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    According to the variety of huge amount of web pages in Internet , it has been necessary to today's Web IR to search effectively on distributed collections. Therefore , the retrieval results fusion problem is derived. In this paper , a novel rank-based weighted insertion results fusion algorithm is proposed. Though it is possible that similarity scores of different results are absolutely incomparable , the proposed algorithm works effectively. Experimental results on 18GB large-scale Web standard test collection show the weighted insertion result fusion strategy enhances retrieval performance consistently. When the performances of distributed results are improved , the enhancement increases as well , which reaches to 10%. Furthermore , it also breaks the limitation in traditional result fusion studies that the final result merged by distributed collections is always worse than that of using single central database.
  • SUN Lian-heng,YANG Ying,YAO Tian-shun
    2004, 18(2): 16-23.
    Abstract ( ) PDF ( ) Knowledge map Save
    Evaluations are very helpful for the research of Machine Translation (MT) . The aim of evaluations is not only to output the differences among MT systems , but also to stimulate the improvement of key technologies in this area. In the past , the evaluations of MT are performed by human. With the increasing needs of MT research , the automatization of MT evaluations becomes more and more important . This paper introduces the basic framework of automatic MT evaluation using n-gram co-occurrence statistics. Three methods (BLEU , NIST and OpenE) based on this framework are described. The advantages and disadvantages of these methods are also discussed through the analysis of several experiments. Among these methods , OpenE adopts a new method of n-gram weighting which employs a local corpus and a large global corpus. Through the experiments , this method is proved to be practical for machine translation evaluation.
  • ZHANG Jun-lin,QU Wei-min,Sun Le,SUN Yu-fang
    2004, 18(2): 24-30,44.
    Abstract ( ) PDF ( ) Knowledge map Save
    Language model based IR system proposed in recent 5 years has introduced the language model approach in the speech recognition area into the IR community and improves the performance of the IR system effectively. However , the assumption that all the indexed words are irrelative behind the method is not the truth. Though statistical MT approach alleviates the situation by taking the synonymy factor into account , it never helps to judge the different meanings of the same word in varied context . In this paper we propose the trigger language model based IR system to resolve the problem. Firstly we compute the association ratio of the words from training corpus and then get the triggered words collection of the query words to find the real meaning of the word in specific text context . We introduce the relative parameters into the document language model to form the trigger language model based IR system. Experiments have shown that the performance of trigger language model based IR system has been improved greatly. Compared with classical language model IR system , Precision of the trigger language model based IR system increased almost 12% and recall of the system increased 10.8%.
  • QIAN Yi-li,ZHENG Jia-heng
    2004, 18(2): 31-36.
    Abstract ( ) PDF ( ) Knowledge map Save
    The disambiguation of multi-category words is one of the difficulties in part-of-speech tagging of Chinese text , which affects the processing quality of corpora greatly. Aiming at this question , the paper describes an approach to correcting the part-of-speech tagging of multi-category words automatically. It acquires correction rules for the part-of-speech tagging of multi-category words from right-tagged corpora based on the rough sets and data mining , and then corrects the corpora based on these rules automatically. According to the results of close-test and open-test on the corpus of 500,000 Chinese characters , the accuracy of multi-category words' part-of-speech tagging can be increased by 11.32% and 5.97% respectively.
  • CHENG Yan-xiang,DAI Bei-qian,ZHOU Xi,LI Hui
    2004, 18(2): 37-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper , a text-independent speaker verification system is proposed based on conversation. The key difference between this system and the conventional 1-speaker verification system is that the speech for training and testing is conversation. So speech segmentation based on speakers is applied to train the speakers' models and make the final decision. The GMM-UBM frame is introduced while an unsupervised speech segmentation algorithm based on GLR distance measure is emphasized. Then the normalization of score including ZNORM and duration penalty results in improvement of performance by 10%.
  • WU Zhi-yong,CAI Lian-hong
    2004, 18(2): 45-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper , a new unit selection approach for concatenative Text-to-Speech (TTS) synthesis based on prosodic correlation model is proposed. Firstly , prosodic correlations in continuous speech are studied. Then , some prosodic parameters , including prosodic correlation parameters , are concluded. Thirdly , a prosodic correlation model (association rules model from data mining) is put into use in unit selection. The experiments show that the unit selection method described in this paper can improve the naturalness of the synthesized speech : the MOS score can achieve 12.22% higher than before (3.49/3.11) .
  • SUN Yu-fei,CHEN Yan,ZHANG Yu-zhi
    2004, 18(2): 52-58.
    Abstract ( ) PDF ( ) Knowledge map Save
    Similar characters recognition has a great impact on the accuracy and usability of the whole OCR system. In this paper , the asymmetry in similar Chinese character recognition is introduced. The causes of the asymmetry phenomena are discussed and analyzed in details. Based on the asymmetry , we propose a method of category-based partial area matching for similar Chinese characters recognition. According to their structural characteristics , similar characters are divided into some different elementary categories. The different category features extracted in corresponding partial area are used to recognize similar characters. Our experiment results show the validity of the proposed method , which significantly improves the accuracy of similar Chinese character recognition. There are a 4.55 percent improvement on error-prone similar Chinese character recognition and a 0.38 percent improvement on less error-prone one.
  • CHEN Kai-qu,ZHAO Jie,PENG Zhi-wei
    2004, 18(2): 59-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    For now there are two effective methods to improve approximate string matching : bit-vector method and filter method. Since Chinese alphabet has many characters , it needs much computer memory for bit-vector method. This would be a problem for some little computer which has a small memory , such as embedded system. We present a new bit-vector method which needs only about 5% computer memory of original bit-vector method. And , we also utilize the fact that Chinese alphabet is very large and develop a new filter method , BPM-BM , for approximate string matching of Chinese text . It runs at least 14% faster than the known fasted algorithms. In most cases , our algorithm is even 1.5~2 times faster.
  • DONG Zhi-jiang,WU Jian,ZHONG Yi-xin
    2004, 18(2): 67-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    As we process minority scripts in computers based on ISO/IEC 10646 and Unicode standards , there is a bottle - neck problem that variations of presentation characters have no definite code points. It is why many software systems processing minority scripts are produced in repetition and are incompatible with each other. Based on scripts processing architecture in ICU , this paper illustrates methods of implementation of minority scripts processing complying with Unicode standard. Firstly , we analyze the characteristics of minority scripts , and point out the difficulties of processing them. Then the OpenType font technology , which can satisfy the requirements of minority languages processing , is introduced. Lastly , we illuminate the principle of Layout Engine , as well as present how to embed minority scripts processing in ICU.
  • LI Pei-feng,ZHU Qiao-ming,QIAN Pei-de
    2004, 18(2): 74-80.
    Abstract ( ) PDF ( ) Knowledge map Save
    It's a general tendency that the Han Character Internal Codes used in computer should transfer to ISO/IEC 10646 , but there are multi-Han Character Internal Codes used in the computer now , and this instance will stand a long time. So how to realize the Han Character Internal Codes auto recognition is the key to build a Multi-lingual Environment . This paper mainly discusses the Han Character Internal Codes recognition algorithms in the Multi-lingual Environment , and provides four recognition algorithms , such as Internal Code Bound Recognition Algorithm, Interpunction Recognition Algorithm , Han Character Frequency Recognition Algorithm and Semantic Recognition Algorithm. This paper also evaluates the algorithms mentioned in this paper , and the rate of Recognition can reach 99.9% used these recognition algorithms on the test documents.
  • WANG You-zhi,ZHAO Min,CHEN Jun-feng
    2004, 18(2): 81-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    A system analysis approach is applied to evaluate and nominate bilingual IT terms in this paper. Firstly , all the terms concerned are split up into basic term elements-termels , then eight aspects of quantificational indexes are given : errors-in-printing ratio ,‘inconsistent/ uncanonical spelling’ratio ,‘unnecessary one-many/ many-one correspondence’ratio ,‘inequivalent term’ratio ,‘discrepant interdisciplinary term’ratio ,‘unconformable-to-GB term’ratio ,‘incomplete/ redundant term’ratio , and overall‘term worthy of revision/ discussion’ratio. Based on these criteria , any IT (even any other discipline of natural sciences & technologies) bilingual vocabulary could be evaluated and analyzed systematically by means of termels. Finally , a suggestion of developing software for such systems analysis / evaluation and its function specification are proposed.