2005 Volume 19 Issue 3 Published: 15 June 2005
  

  • Select all
    |
  • WU You-zheng , ZHAO Jun ,DUAN Xiang-yu ,XU Bo
    2005, 19(3): 2-14.
    Abstract ( ) PDF ( ) Knowledge map Save
    Question Answering (QA) is the next generation of search engine which is related to natural language processing ,information retrieval and etc. QA aims at providing more powerful information access tools to help users overcome the problem of information overloading. In the last decade , QA has become an important subfield of NLP and IR. Its development track , i. e. accelerating research via systematical and large scale evaluation , and some successful experiences , such as the effectiveness of partial2parsing techniques based on character surface and the importance of fast NLP tools , have made it a great and most important impetus to the research of NLP. Moreover , QA has built a more effective connection between NLP research and NLP application. It will be helpful to review the history and investigate state of the art of QA.
  • ZHANGJun-lin ,SUN Le , SUN Yu-fang
    2005, 19(3): 15-21.
    Abstract ( ) PDF ( ) Knowledge map Save
    Exact estimation of the document language model is important to the performance of the language model based IR system. In this paper we proposed a topic-based approach to language modeling for ad-hoc Information Retrieval. An improved two-stage k-means clustering method is designed to deal with the document collection and the clustered results are regarded as the topic information contained in the collection. Through combing the aspect model and text clustering technology , we can derive a more accurate document language model for ad-hoc Information Retrieval. Experiments have shown that the performance of IR system has been improved greatly. Compared with Jelinek-Mercer language model IR system , Precision of the trigger language model based IR system increased almost 16117 % and recall of the system increased 9164 %.
  • DANG Zheng-fa , ZHOU Qiang
    2005, 19(3): 22-28.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatically conversion between different annotated treebank is an important subject of natural language processing. After a brief summarization of several treebank annotation schema and conversion between them , we proposed a new converting algorithm to automatically convert Tsinghua Chinese Treebank(TCTfor brief) from phase structure to dependency structure. This algorithm makes full use of syntactic constituent tag and grammatical relation tag of TCT, and generates dependency structure treebank. The output dependency treebank indicates not only hierarchy dependency relationship between nodes , but also specified type of such dependency relationship. Precision of the conversion reaches 97137 %.
  • BI Yu-de
    2005, 19(3): 29-33,45.
    Abstract ( ) PDF ( ) Knowledge map Save
    In any NLP systems , including that of MT, syntactic and semantic information dictionary is an essential component . Based on the achievements in semantic project studies both at home and abroad , the present paper provides an integrative description of the syntax and semantics of Korean predicates , with an aim to construct an information2processing-oriented Korean knowledge database. The semantic framework is drawn from theta structure theory and semantic field theory. We begin with a semantic classification of Korean predicates , which is followed by a detailed description of the semantic properties of these predicates. And in the construction of the dictionary , we integrate syntactic and semantic properties in a structural way.
  • HAN Yong , XU De
    2005, 19(3): 34-40.
    Abstract ( ) PDF ( ) Knowledge map Save
    Pen and paper is one of the most important ways of communication in our daily life. Traditional pen and paper has its own fault , such as the content in a paper is difficult to modify or reprocess and information kept in a paper lack of effective way of maintenance and search , whereas computer is superior in those aspects. Research on pen user interface is to make those traditional works computable through study of hardware and software of relative fields. In our research work , we describe interaction tasks in word inputting and editing by the input events of pen device. And we design and realize a new pen based text editor that simulate pen2paper working mode of our daily life.Our editor integrate advanced Rubin algorithm and rule based recognition technologies. This system is suit for users who are not familiar with computers.
  • CAI Zeng-yu , GU Wen-xiang
    2005, 19(3): 41-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese character inputting play a key role in the Chinese information processing , and the input model is very important for the study on the Chinese character input. This paper makes some studied in the building of Chinese character input modes based on automata theoretic , handles the control operation in the input model , and introduces the concepts of two-way deterministic Chinese finite automaton and two2way nondeterministic Chinese finite automaton. Compared to the previous models , our models can handle the control operations in the process of inputting character , and have the more expression ability. So our input models can be treated as the expansion of the previous moedls.
  • LIU Yang ,LIU Qun ,LIN Shou-xun
    2005, 19(3): 46-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    Most current automatic metrics of machine translation evaluation do not consider that among unmatched words there may be neglected information. In this paper , we describe a strategy to find fuzzy2matched word pairs between reference and candidate translations automatically and propose an approach to compute the similarity. The whole process of finding fuzzy matched word pairs and computing their similarity is demonstrated in detail. Experiments show that our method is capable of finding neglected meaningful word pairs fairly well. More importantly , the performance of BLEU is significantly improved by integrating fuzzy matching. Fuzzy matching is possible to be utilized to improve other automatic methods.
  • ZHOU Yu ,ZONG Cheng-qing ,XU Bo
    2005, 19(3): 55-61.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper we propose a new algorithm called multi-layer filtering to extract the bilingual alignment chunks automatically from Chinese-English parallel texts. Various layers are used to extract bilingual chunks according to different features possessed by different chunks in the bilingual corpus. Our chunking and alignment algorithm does not rely on the information from tagging , parsing or syntax analyzing as most conventional algorithms do. The preliminary experimental results express that our algorithm achieves a good performance in chunking and alignment. The F-measure of chunking is 017 and the F-measure of alignment is 018. Moreover , the translations generated by this algorithm are much better than the results generated by the baseline word alignment algorithm; it almost improves of 10 %.
  • XIONGDe-yi,LIU Qun,LIN Shou-xun
    2005, 19(3): 62-67.
    Abstract ( ) PDF ( ) Knowledge map Save
    Knowledge acquisition is always regarded as a bottleneck in many NLP tasks , such as machine translation , information extraction. Treebank-based statistical parsing is not an exceptant. The latent linguistic knowledge in treebank is very rich , which , however , canpt be acquired directly. In our model , the following three ways are used to incorporate such rich linguistic features for Chinese statistical parsing. First of all , non-recursive noun and verb phrases are annotated in the Penn Chinese Treebank because of their strong mark of boundaries. Second , a new head percolation table is designed based on Xiaps table. The last linguistic feature our model uses is the context configuration frame which provides a stronger representation of bilexical dependency structures. All these three linguistic features gain an improvement of remarkable 2137 % in terms of F1 measure , 5136 % in terms of complete match ratio.
  • LUO Zhi-yong, SONG Rou
    2005, 19(3): 68-73,87.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recognition of proper noun is one of the most important parts in word segmentation system in modern Chinese.This paper firstly analyzes the shortcomings of traditional proper noun recognition method in statistical language models and other corpus-based models. Secondly , we put forward a recognition strategy of person names based on reliability. We also train the model with a bootstrapping method without the limit of manually tagged corpus. Large-scale test on real corpus shows that this method successfully resolves the problemof mis-estimate of candidate proper nouns in traditional methods. In addition , our method is comparable to traditional corpus-based method.
  • LIU Shi-yue ,LI Heng ,ZHANGLi ,YAO Tian-shun
    2005, 19(3): 74-80.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper we discuss the application of semi-supervised machine learning method —co-training on Chinese Text Chunking. Firstly , we give the definition of Chinese chunk ,then the formalized definition of co-training algorithm. We proposed a example selection method based on the consistence , using two classifiers : Transductive HMMand fnTBL to combine a classification system to perform the Chinese text chunking task with the small-scale labled Chinese treebank and large-scale unlabled Chinese corpus. The result were compared with the self-training result and the result of the non co-training experi ment in which we only used the small-scale Chinese treebank as training data and use one classifier (Transductive HMM or fnTBL) to recognize the Chinese chunk. The improvement is significant , the F value of the two classifiers reached 83. 41 % ,85. 34 % , get a improvement of 2. 13 points and 7. 21 points respectively.
  • WANGLi-xia , SUN Hong-lin
    2005, 19(3): 81-87.
    Abstract ( ) PDF ( ) Knowledge map Save
    The right boundary ambiguity of prepositional phrase (PP) is one of the most prominent phenomena in Chinese structural ambiguities. The goal of this paper is to recognize the boundaries of prepositional phrases automatically without introducing parsing technique. A statistical algorithm is applied for the recognition of prepositional phrases in which each word after a preposition in a sentence is viewed as a candidate for the right boundary of a PP and the likelihood of each position being right boundary of PP is estimated. To simplify the model we assume the right boundary of a PP is only dependent on the one word before and one word after the boundary. Although this simplified model is not accurate , it faces sparse data problem in parameter estimation since it depends on co-occurrences of words within one sentence. To get more reliable probability estimation , we use deleted interpolation to smooth the parameters in which different models mixing words and part-of-speeches are combined. At last , three simple rules are used to correct some errors produced by the statistical algorithm. The preposition“ZAI”(“在”) is chosen in our recognition experiments in which the algorithm gets precisions of 97 % and 93 % for closed test and open test respectively , achieving significant improvement over pervious similar work.
  • WANGMeng,HE Ting-ting,JIDong-hong,WANG Xiao-rong
    2005, 19(3): 88-94.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper presents an approach for Chinese text summarization. Unlike normal statistical method , we use concept (word sense) as feature , instead of word. Weight of sentence can be carried out in terms of weight of paragraph and thematic conceptual vector space model , after the weight of all the sentences have been carried out , the weights are ordering according to their magnitude. Sentences with high weight are selected as summarization sentences. In order to evaluate the summarization system , we use two different methods ; one is to compare machine abstracts with manual abstract. The other is to compare the precision between different methods of summarization.
  • CHEN Zhen-yu,CHEN Zhen-ning
    2005, 19(3): 95-105.
    Abstract ( ) PDF ( ) Knowledge map Save
    The temporal information of a sentence in modern Chinese is represented jointly by the units and their relations of the sentence. Thus , an integral temporal cognitive model should be established in order to calculate the temporal information. It includes three procedures : First , it reduces the concept of temporal properties into the the cognitive constituents as the three fundamental phrases (i. e. the beginning , continuance , and end of an event) , and categorizes the types of events.Based on them , the temporal model has been constructed. Second , it translates every possible units and relations in the modern Chinese sentences with emblems of the events types of and signs of temporal constituents. The result of the translation , namely , the meta-linguistic expressions (i. e. the translation forms) of the unit or the relation , makes it clear what the unit or the relation means in the process of carrying temporal information. Third , it establishes a rule2driven systemof calculation in the cognitive model , simplifying the meta-linguistic expressions into the most economical expressions which is exactly the temporal information coded in the sentence as a whole.