2006 Volume 20 Issue 3 Published: 15 June 2006
  

  • Select all
    |
  • XUN En-dong,QIAN Yi-li,GUO Qing,SONG Rou
    2006, 20(3): 3-7,30.
    Abstract ( ) PDF ( ) Knowledge map Save
    It is important to recognize the prosodic phrase breaks in text-to-speech. In this paper, a new method is introduced for this purpose,which uses binary tree as pruning strategy in the Maximal Entropy Model (MaxEnt) framework. First of all, the concep t of binary tree generated from a statistical language model is given. Then the process of generating the binary tree is discussed. In the process of applying MaxEnt to seeking optimal prosodic phrases, the binary tree is exploited so as to narrow the search space and improve the performance. Experimental results show that the F-score of predicating prosodic phrase breaks is about 35% better than the previous system, in which the binary tree strategy is not adopted.
  • LIU Feng-cheng,HUANG De-gen,JIANG Peng
    2006, 20(3): 8-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    An approach based on supervised AdaBoost MH learning algorithm for Chinese word sense disambiguation is presented. AdaBoost MH algorithm is employed to boost the accuracy of the weak decision stumps rules for trees and repeatedly calls a learner to finally produce a more accurate rule. A simple stopping criterion is also presented. In order to extract more contextual information, we introduce a new semantic categorization knowledge which is useful for improving the learning efficiency of the algorithm and accuracy of disambiguation, in addition to using two classical knowledge sources, part-of-speech of neighboring words and local collocations. AdaBoost MH algorithm making use of these knowledge sources achieves 85.75% disambiguation accuracy in open test for 6 typical polysemous words and 20 polysemous words of SENSEVAL3 Chinese corpus.
  • GUO Yong-hui,YANG Hong-wei,MA Fang,WANG Bing-xi
    2006, 20(3): 16-23.
    Abstract ( ) PDF ( ) Knowledge map Save
    An approach of base noun phrase (BaseNP) identification based on rough sets is proposed in this paper. It divides BaseNP identification into two ordinal subtasks: tagging and identification, and regards BaseNP tagging as a decision-making problem which can be solved in rough sets theory. So it characters feature reduction and rule optimization. In the paper, rough sets-based rule learning method and relevant algorithms are briefly introduced at first, the flow charts of BaseNP tagging and identification are then described, and the solution to the instance collision is put forward for improving the performance of BaseNP identification. The detailed experimental steps and results, and the comparison with some representative similar systems are given at last. According to the analysis of the results, the paper also points out the direction of further improvement of the approach.
  • LIAO Sha-sha,JIANG Ming-hu
    2006, 20(3): 24-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper,we propose a novel feature selection method based on concept extraction and shielded level. In this method, we use HowNet as the semantic dictionary to extract concept attributes. Based on their positions in the concept tree, the attributes will get proper weights, which present their description powers. A concept attribute will not be selected as feature if its weight is lower than the shielded level and the original word will be reserved for use. To each word, we calculate all the weights of the concept attributes in its DEF, and decide whether to extract the concept attributes or reserve the word. We focus mainly on how to weight the concept attributes, how to make a balance between concept features and word features, and how to treat the words out of the dictionary. The experiment shows that if a shielded level is set properly, it can not only reduce the feature dimension to a proper scale but also improve the classification precise. Moreover, it can reduce the difference of the classification precise among different categories.
  • MAO Wei,XU Wei-ran,GUO Jun
    2006, 20(3): 31-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    An automatic Chinese text categorization method based on n-gram language model and chain augmented na?ve Bayesian classifier is proposed. The paper introduces the representation of a text through n-gram language model, argues the advantage of combining n-gram language model and chain augmented na?ve Bayesian classifier, analyzes how to choose the parameters of n-gram language model, and discusses some crucial problems of the categorization system. The effect of quantity and quality of training corpus on classifier performance is also studied experimentally. The categorization system is tested on the 863-project data set for Chinese text categorization. The experimental result shows that the system performs well.
  • JIANG Di,YAN Hai-lin,SUN Bo-jun,Siqin Chaoketu,MENG Da-lai
    2006, 20(3): 38-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper firstly gives a brief introduction to the background of the Secret History of the Mongols, the great book published in 13 century in Yuan Dynasty, and its special complicated original typeface in form. After an analysis to its content and page form, a scheme of electronic retrieval system has been then designed for it, which resolves the problem of returning to the original shape of the archaic writing form with three lines representing one content. Furthe more, the retrieval system also provide the functions of retrieving and counting each contents of the origins, including Chinese transliterate, Chinese translation, and phonological and grammatical markers. The retrieval result includes numbers of traditional academic chapters and sections which are very important for the users, numbers of original volumes and pages, and retrieval objects’positions in the electronic text. In addition, the system makes full use of a concordance technology, which can present retrieval units with their contexts. Generally speaking, this retrieval system can basically satisfy the needs of studing history, literature, and language from the important historical document.
  • AYKIZ·KADIR,KAYSAR·KADIR,TURGUN·IBRAHIM
    2006, 20(3): 45-50,100.
    Abstract ( ) PDF ( ) Knowledge map Save
    Noun is one of the basic word classes in human languages. As Uighur language is a highly inflectional language, morphological analysis of Uighur noun, a highly inflectional word class, is very important for study of Uighur grammar and Uighur language information processing. This paper concerns the formalized morphological description and analysis of Uighur noun (number, person and case etc). It points out the essential morphological parameters of Uighur noun, sums up the rule of its composition, statistical type and gives a method for paring suffixes, This approach provides an effective way for noun analysis in Uighur language information processing.
  • JIA Yan-min,WU Jian,Ngodrup,SUN Yu-fang
    2006, 20(3): 51-56,79.
    Abstract ( ) PDF ( ) Knowledge map Save
    Office suite is one of the most widely used software of information processing. Presently, there is no office suite fully supporting Tibetan, which is a main bottleneck of the development of Tibetan information technology. The open source project "OpenOffice.org" has provided a good chance for developing Tibetan Office Suite. Based on the source code of OpenOffice.org and Tibetan newly-coded character sets (Extension A) ,we have developed a Tibetan office suite supporting Tibetan typeset style and Tibetan Locale, in which the problem of Tibetan text line breaking has been solved. This office suite can meet the requirements of office automation of Tibetan users.
  • LIU Yuan-chao,WANG Xiao-long,XU Zhi-ming,GUAN Yi
    2006, 20(3): 57-64.
    Abstract ( ) PDF ( ) Knowledge map Save
    As an unsupervised machine learning method, document clustering has been widely used in many NLP applications such as information retrieval, automatic multi-document summarization and etc. In this paper the background and the architecture of document clustering is discussed firstly, and then some related problems are surveyed which includes clustering algorithm, feature space construction, dimension reduction and the semantic problem. In the end this paper introduces the evaluation of cluster quality.
  • YUAN Xing-yu,WANG Ting,ZHOU Hui-ping,XIAO Jun
    2006, 20(3): 65-71.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the task of information filtering, the profile of the user’s interests and preferences is the key to the performance of the system. In the vector space model, the profile is usually represented as a set of features, but this kind of profile can not exactly reftect the user’s information requirements for the lack of the semantic information. This paper proposes an approach to construct the user’s profile based on ontology. In our method, the features in the vector space model are defined as the instances in the ontology, and their semantic relations are represented by the means of concept-properties model. We also designed a method to compute the semantic association between the instances according to the length of their paths in the ontology graph. Furthermore, the semantic relation between the document and the profile is calculated through the semantic association between the features.
  • LIU Yi-qun,ZHANG Min,MA Shao-ping
    2006, 20(3): 72-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    The existence of low quality Web pages affects the effectiveness and efficiency of Web search. In this paper, we define the Web page quality estimation as a learning problem. First, several query-independent features are investigated which can separate search target page from ordinary ones. Bayes estimation based on these features is then used to train a model to assign importance scores to Web pages. In TREC based experiments, the top-scored set reduces 45% low quality pages as well as retains 95% high quality ones. It shows the possibility to gain better performance with less storage and computing resource for search engines.
  • LV Bi-bo,ZHAO Jun
    2006, 20(3): 80-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    In information retrieval, relevance feedback is an effective way to improve retrieval performance. The goal is to input user's judgement on previous retrieved documents, and to select some terms for query expansion using certain strategy. This paper introduces some common query expansion approaches in relevance feedback based on probability model and vector space model, then a new term selection method is introduced based on language model,which takes into account two features of expanded terms - "relevance" and "coverage". The evaluation is conducted on the TREC Collection, which shows that our method is better than traditional ones on average precision.
  • DING Guo-dong,BAI Shuo,WANG Bin
    2006, 20(3): 86-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    Techniques for automatic query expansion have been extensively studied in information retrieval research as a solution to the word mismatch problem between queries and documents. Using the idea of Local Context Analysis, in this paperwe proposed a novel expansion method, called LOCOOC, which utilized the local co-occurrence information in top-ranked documents and the global statistical information in the whole collection to select most appropriate expansion terms. Experimental results show that LOCOOC offers more effective and robust retrieval performances, compared with local feedback based or LCA based expansion method.
  • WANG Hui-zhen,ZHU Jing-bo,JI Duo,YE Na,ZHANG Bin
    2006, 20(3): 94-100.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the field of topic detection and tracking, since topics develop dynamically, topic excursion problem may appear in the tracking process. To overcome this problem and the shortcomings of current adaptive methods, we propose a new adaptive method based on feedback learning. Based on the idea of increment learning, the paper presents a new algorithm for the adaptive learning mechanism in the task of topic tracking. This algorithm can solve the problem of topic excursion, and remedy the deficiency of current adaptive methods. Time sequence of topic tracking task is also considered in the algorithm, and time information is introduced. In the experiments, we use the Chinese part in TDT4 corpus as test corpus, and use the TDT2004 evaluation metric to evaluate the adaptive Chinese topic tracking system based on feedback learning. The experimental results show that the adaptive method based on feedback learning can improve the performance of topic tracking.
  • ZHOU Liang,GAO Peng,DING Peng,XU Bo
    2006, 20(3): 101-106.
    Abstract ( ) PDF ( ) Knowledge map Save
    It is a paradigm to integrate speech recognition and information retrieval techniques to implement content-based retrieval in mass speech data. The paper studies the relationship between speech recognition performance and retrieval performance, through analyzing the differences of keywords retrieval in the recognition documents with different recognition rates, which are adjusted by the language models. The experiment on 114 hours speech data indicates: speech recognition performance has some correlation with retrieval performance, and illuminates that improving the retrieval method can eliminate some speech recognition errors. The result provides the basis for further advancements in speech recognition engine, speech recognition results representation and rapid retrieval method.