2006 Volume 20 Issue 4 Published: 15 August 2006
  

  • Select all
    |
  • ZHANG Yang-sen,CAO Yuan-da,YU Shi-wen
    2006, 20(4): 3-9,57.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese text automatic proofreading is an important research subject in NLP. A hybrid model based on the combination of rules and statistics are proposed in this article. According to the distribution of Chinese single-character after word segmentation in Chinese text and the conception of “non-multi-characterword error”, we proposed a group of rules to find errors in texts, to construct the automatic error-detection model and to implement its algorithm by combining the scattered single-character Bigram models, part-of-speech Bigram and Trigram models. Our experiment for the 30 texts that contain 578 error test points shows that the recall rate is 86.85% and accuracy rate is 69.43% , distorting rate is 30.57%.
  • LI Xing,ZONG Cheng-qing
    2006, 20(4): 10-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the analysis of the usage and the syntactic function of Chinese punctuations, this paper proposes a new hierarchical approach to parse the long Chinese sentences. In traditional parsing approaches, the parsing procedure is performed in an one-level way and the punctuation marks are not specially treated. Correspondingly, in our approach, the complex long Chinese sentences are broken into sub-sentences or units (say‘units’hereafter) by using punctuation marks with special functions, so that the original whole sentence is parsed unit by unit. This idea of‘divide-and-conquer’greatly reduces the difficulty in the traditional parsing approaches to recognize the syntactic relationship between the sub-sentences and phrases or inside the sub-sentences or phrases. And also, in our approach, the grammatical rules with punctuation marks and their probabilities are extracted from the large scale treebank, which are very beneficial to the syntactic disambiguation. Our experimental results have shown that comparing with the traditional Chart parsing algorithm, our approach can significantly reduce the time consumption and the numbers of ambiguous edges, and get about 7% of the correct rate and the recall rate increasing while parsing long Chinese sentences.
  • WANG Zhi-min
    2006, 20(4): 18-26.
    Abstract ( ) PDF ( ) Knowledge map Save
    As one of the intractable problems in the field of NLP, Metaphor has attracted more attention from researchers in recent years. Western scholars have done many researches on this field and made significant progress. However, research in Chinese metaphor is just at the starting point. In order to find a solution for Chinese metaphor formulization, this paper surveys the recent developments in the design of the computational model of metaphor and the construction of metaphor knowledge-base. Several computational models are introduced, including (1) Met5 system under the guidance of preference constraint view, ( 2 ) MIDAS system with example-based method, and ( 3 ) CorMet system with statistic approach.
  • LIU Ting,LI Wei-gang,ZHANG Yu,LI Sheng
    2006, 20(4): 27-34.
    Abstract ( ) PDF ( ) Knowledge map Save
    Paraphrase is a common phenomenon in natural language which captures core aspects of variability in language. The study of paraphrase is about the synonymy phenomena of phrases or sentences. With the development of foundation technology of natural language processing, research on paraphrase has been recently received growing attention. Currently, paraphrasing technology has been applied in many NLP fields, such as, information retrieval, question answering, information extraction, automatic text summarization, machine translation and text watermark, to improve the performance of these systems. This paper will mainly survey several aspects of paraphrasing technology as followed: paraphrases corpus construction, paraphrases rules extraction, paraphrases generation and paraphrase evaluation. And some of ourwork about paraphrase are also introduced in brief. At the last section, some challenges, together with the future directions of paraphrasing technology are indicated.
  • MIN Jin-ming,SUN Le,ZHANG Jun-lin
    2006, 20(4): 35-42.
    Abstract ( ) PDF ( ) Knowledge map Save
    One of the most frustrating obstacles in sharing online information among people in different countries is the multilingual problem. The research of Cross-Language Information Retrieval (CLIR) plays an important role on this problem. Firstly a formal definition and the standard framework of CLIR are given in this paper. Secondly, we presents the evaluation method for a CLIR system. Then three mainstream approaches in research of CLIR are reassessed, and the key problems, that is, out of vocabulary (OOV) and word sense disambiguation (WSD) , in CLIR are extracted from the fuzzy appearance. Finally , according to observations on the state of the art on CLIR, we give several promising directions for CLIR research in the near future.
  • LI Xiao-guang, YU Ge,WANG Da-ling
    2006, 20(4): 43-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    To overcome the incompleteness of modeling document characteristics and the lack of theory for current document similarity models, this paper puts forward to utilize mixture language model (MLM) to evaluate document-to-document similarity. In MLM, the characteristic of a document is described based on statistic language model, and the factors of influencing its characteristic are viewed as the latent models, and then the document language model is a mixture model combined with each latent models. MLM not only models document characteristics more perfectly, but it is flexible and scalable to be implemented with respect to applications. Within the framework of MLM, a document similarity method is presented from the viewpoint of document content. The experimental results over the TREC9 dataset indicate that MLM outperforms VSM.
  • WU Chen,ZHANG Quan
    2006, 20(4): 51-57.
    Abstract ( ) PDF ( ) Knowledge map Save
    Concept-Based Question Answering (QA) is a brand new research topic which takes concepts, instead of the lexical terms, as the processing object. Concepts, as a formalized meaning, can well help to resolve the word sense ambiguities. However, using concepts brings some new problems, such as the concept extracting; the semantic relativity calculation for concept as well as the QA-specialized issues such as how to comprehend the query; how to search the answers and how to generate the nature language answers. Most of them, especially the QA-specialized issues, have not been addressed. In this paper, we discuss these key issues for carrying out a concept-based QA system. Some algorithms will also be proposed in order to solve the problems. The experiments indicate that the concept-based QA system powered by the proposed algorithms performs very well. The precision of the system reaches almost 40%. The actual application also indicates these algorithms contribute a lot to a commercial concept-based QA setting.
  • I. Dawa,Yu-jie ZHANG,K. Uezono,Sen ZHANG,Okawa,H. Isahara,K. Shirai
    2006, 20(4): 58-64,95.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we firstly address the significance of digitizing of Mongolian and the current technical situation of this problem. Then, we focus on the differents of spoken and written Mongolian among different aear and countries and the problems related to process Mongolian by computer. Finally, we introduce our work in creating and designing the Mongolian language corpus at Japan. This work includs 2 kinds of coupus, one is the multi-dialectal speech corpus, and the other is the Multilingual Parallel Electronic Dictionary of Mongolian/English/Chinese/Japanese/Korean, which is supported by an international joint project.
  • ZHAO JI,LI Jing-jiao,WANG Li-jun,ZHANG Ji-sheng
    2006, 20(4): 65-69.
    Abstract ( ) PDF ( ) Knowledge map Save
    The study proposes a post-processing method to improve the performance of Manchu character recognition. A evaluation model based on the Bayes rule are used to estimate the probability of the candidate Manchu words, which takes both the posterior probability of candidate and the prior probability of Manchu phrases into account. A Hidden Markov Model and Viterbi dynamic programming algorithm are adopted to check the output of the character recognition and to correct the rejected and mistaken words. This efficiently enhances the recognition rate of Manchu manuscript. The results indicate that the post-processing performance depends on the language model and the accuracy of the evaluation model. Additionally, a higher recognition precision of SCR (Single Character Recogniton) will yield a better performance of error correction of post-processing.
  • LIANG Qi,ZHENG Fang,XU Ming-xing,WU Wen-hu
    2006, 20(4): 70-76.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, a language style based adaptive method for language model is proposed based on the differences between oral and written languages. Several interpolation methods based on trigram counts are used for the adaptation. An interpolation method considering Katz smoothing computes weights according to the confidence score of a trigram. An adaptation method based on the classification of a trigram’s style feature computes weights dynamically according to the trigram’s language style tendency with several weight generation functions proposed. Experiments on spoken Chinese corpora show that these methods could reduce the Chinese character error rate for pinyin-to-character conversion to some extent, more or less, and the one considering both a trigram’s confidence and style tendency achieved the best performance with character error rate reduction of 50.2% and 23.7% , respectively, compared with two baselines in this paper.
  • WU Yi-jian,Wang Ren-hua
    2006, 20(4): 77-83.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, the HMM-based trainable speech synthesis was applied for Chinese application. The appropriate HMM parameters are selected and optimized, and the contextual features and corresponding question set for tree-based HMM clustering are designed by considering the characteristics of Chinese, to improve the effect of HMM modeling and training. From the evaluation results, the preference score of the synthetic speech after the above improvement is 98.5%. Furthermore, in order to improve the rhythm of synthetic speech, a two-level based model is introduced for duration modeling and prediction, and the duration prediction RMSE was improved from 29.56ms to 27.01ms. From the evaluation results of the final system, the synthetic speech is stable, fluent and rhythmed. As the speech synthesis system only requires very small storage, it is specially fit for embedded application.
  • REN Ji-sheng,WANG Zuo-ying
    2006, 20(4): 84-89.
    Abstract ( ) PDF ( ) Knowledge map Save
    Topic-based language model adaptation algorithm should meet the real time need for speech recognition, this goal can be implemented through improving the updating speed of language model weighting coefficient and reducing the using of language model. In this paper, a novel quantization representation scheme for continuous adjoining bigram word pairwas proposed via clustering, then it was used to characterize the speech recognition predictive history and each text topic center. The global language model was not used in this new scheme, language model weighting coefficient was updated in terms of the similarity of predictive history vector with text topic center vector. Compared with the traditional topic adaptation method based on EM algorithm, the experiments show that it had an obvious speech recognition gain accompanied with a better efficiency. The reduction of relative recognition error rate is about 5.1%. So it was concluded that this new adaptation algorithm could more accurately identify the topic of the testing contents.
  • SU Guo-ping,MIAO Cheng,XIA Guo-ping
    2006, 20(4): 90-95.
    Abstract ( ) PDF ( ) Knowledge map Save
    According to the lingual characteristics of Uighur, Kazakh and Khalkhas (abbreviated as UKK in the following) and the special requirements for supporting those minority languages with Chinese and English at the same time, in this paper we presents the design goals and general framework of multilingual GUI processing platform under Linux environment based on the analysis and research of national language support in the system of I18N, The platform consists of four sub-systems , including localization, display, auto-adaptation input and printing of UKK ,which are made of more than ten modules. The implementation of these modules is introduced in detail. Our test shows the platform can support inputting, display, editing and printing of UKK , Chinese and Western Languages in common applications smoothly under Redhat Linux 810 and Turbolinux environment.
  • LIU Hui-dan,RUI Jian-wu,YAO Yan-dong,WU Jian
    2006, 20(4): 96-101.
    Abstract ( ) PDF ( ) Knowledge map Save
    There are various scripts in the world which have different writing directions. It’s a challenge to develop graphical user interface which can be adaptable to the writing direction of the script being processed. In this paper, the requirements of graphical user interface adaptable to various scripts are analyzed and four kinds of run-time modes are presented in according with writing directions of the scripts. Then the mechanism of Qt library to support scripts like Arabic, which is written from right to left, is analyzed. Based on this mechanism, a solution is proposed and implemented to support such scripts as traditional Mongolian written vertically from top to bottom. Test for this implementation shows that Graphical User Interface can automatically correspond to different directions of the scripts and this solution can satisfy the requirement of multilingual graphical user interface.
  • GU Ping,ZHU Qiao-ming,LI Pei-feng,QIAN Pei-de
    2006, 20(4): 102-107.
    Abstract ( ) PDF ( ) Knowledge map Save
    An intelligent digital code-based input technique for Chinese characters, which features in improving the input rules without modifying the original coding scheme and combining the language model, is proposed. The paper disusses how to design the Chinese character and word code to meet the various input modes at first. then designs a dynamic self-study language model, and analyses the data smoothing algorithm in the language model. The experimental results regarding the input performance are given at last, by comparing the intelligent input method with the orginal method, showing that the proposed input technique can not only reduce the average input code length, but also improve the hit rate of the first candidate character.