2005 Volume 19 Issue 2 Published: 15 April 2005
  

  • Select all
    |
  • CHE Wan-xiang , LIU Ting , LI Sheng
    2005, 19(2): 2-7.
    Abstract ( ) PDF ( ) Knowledge map Save
    Entity Relation Extraction is an important research field in Information Extraction. Two kinds of machine learning algorithms , Winnow and Support Vector Machine (SVM) , were used to extract entity relation from the training data of ACE (Automatic Content Extraction) Evaluation 2004 automatically. Both of the algorithms need appropriate feature selection. When two words around an entity were selected , the performance of the both algorithms got the peak. The average weighted F2Score of Winnow and SVM algorithms were 73108 % and 73127 % respectively. We can conclude that when the same feature set is used , the performance of different machine learning algorithms get little difference. So we should pay more attention to find better features when we use the automatic learning methods to extract the entity relation.
  • CHEN Ning-yu , ZHOU Ya-qian , HUANG Xuan-jing , WU Li-de
    2005, 19(2): 8-12,28.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper , we describe a system that applies maximum entropy (ME) models to the task of named entity recognition (NER) . We start with an annotated corpus and a set of features. These features include the morphological features and context information. They are easily obtainable for almost any language. We build a baseline NE recognizer based on these features. We first construct a named entity thesaurus based on the information on the web. Then the baseline together with the thesaurus is used to extract the named entities and their context information from additional un2annotated data. In turn , these data are incorporated into the final recognizer. The experiments prove that these data could further improve the recognition accuracy.
  • ZHAO Zhang-jie ,BAI Shuo
    2005, 19(2): 13-20.
    Abstract ( ) PDF ( ) Knowledge map Save
    For the description of the components of linguistic expressions and the relationships between them , most of grammar theories nowfocus on syntax , however , the theory of computation of category expressions focuses on semantics. We firstinvestigated some important language phenomena such as complete expression and incomplete expression , syntactic type and semantic type , inheritance , word order , component extraction , coordination and so on , and the interpretations which each grammar theory gave for them , then we provided a formal definition of a categorical expression , and analyzed the characteristics that the formal constraints of syntactic could guide the organization of semantic contents. Furthermore we provided typical examples to show how to compute categorical expressions with the help of phrasal structure. Such a mechanism can beformalized and verified , and does well in catching the relationship of components of a linguistic expression , and discovering what a linguistic expression says.
  • LIU Ping ,TAN Jian-long
    2005, 19(2): 21-28.
    Abstract ( ) PDF ( ) Knowledge map Save
    We propose an algorithm to do fast string match of XML files-XMatch. In the pattern string matching of XML files which contain path information , traditional string match algorithms canpt be effectively directly used due to the structured characteristics of XML files ; Most of the available methods of XML content filtering are based on SAX event driven which is not very efficient. When analyzing schema-the structure of XML files , XMatch utilizes the path information of pattern string to construct a DFA ; In addition , the algorithm support pattern matching with loop reference path information. XMatch is scalable and can support string matching of common structure text.Experiment results show that , the efficiency is distinctly improved compared with using the method of SAX event driven.
  • SUN Bin
    2005, 19(2): 29-36.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a brief introduction of the Sense MatrixModel (SMM) ,which employs a matrix representation of text for information retrieval.By taking the distribution of words in the sense direction into account ,SMM represents a document as a term-sense matrix and a document collection as a term-sense-document space.With such a document representation ,some useful data analysis techniques can be introduced or developed ,including matrix norms based similarities , sense weighting ,document transforms with DCT as well as MAD (multi-way data decomposition) ,kNN and SVM classification using the sense matrix representation ,etc. The model also provides novel techniques for cross-lingual IR and multi-lingual text classification without using any separated or integrated translation or“model training”. Some initial experiment results of document DCTwith the SMART IR system are also discussed.
  • ZHOU Zhao-tao ,BU Dong-bo,CHENG Xue-qi
    2005, 19(2): 37-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text representation is the basis of text processing. Most text representation model didn’t consider the order of the terms in the text ,which results in the losing of text semantics. We tried to introduce order in the text , using graphic structure to represent text. In this paper , we proposed a novel text representation model named Graph Space Model (GSM) and introduced a novel way to measure the representation ability of a text representation model. We compared the ability of the Vector Space Model and our GSM. Our model didn’t excel the VSM now , but there are much more problems need to be investigated in the text representation to take back the losing semantics.
  • LIU Yi-qun ,ZHANGMin ,MA Shao-ping
    2005, 19(2): 45-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    Virtual Organization (VO) is a basic concept in grid architecture. Analysis in the link structure of Web pages showed that there exist similar organizations in internet which were called Virtual Sites. Many features of virtual organizations could be founded in virtual sites , especially some non - content features , which were further used to select entry pages of Virtual Sites. This subset of Virtual Site entry pages proved to be qualified both in content and link structure analysis. Although this entry page set contains only about 21 %pages of the whole collection , it covers more than 70 %of its links. Furthermore , information retrieval on this page set makes more than 60 % improvement with respect to that on all pages.
  • CHEN Kang , WU Gang-shan
    2005, 19(2): 52-58.
    Abstract ( ) PDF ( ) Knowledge map Save
    Along with the rapid development of Web , the information resources in the web are becoming more and more abundant. People get information from Web mainly by search tools , but always puzzled by the precision of them. To solving this problem , we adopted domain Ontology in our information retrieval system. By using of the domain knowledge in Ontology , retrieval system could improve semantic understanding of retrieved documents , and give the chance to user to put their information request in more nature way (more precise way) . Experimental results show this method can increase the precision of information retrieval.
  • SU Qi ,ZAN Hong-ying ,HU Jing-he ,XIANG Kun
    2005, 19(2): 59-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    NLP technology combined with information retrieval has become mainstream in the IR field. In this article ,the authors combine POS tagging with IR ,in an attempt to find the effects of POS tagging on the performance of IR systems.Using the SMART system ,the authors performed experiments with different tagsets and different term vector weighting schemes.According to the experiments ,we found that retrieval performance using tags improved in certain topics and documents. The effects ,however ,are inferior to the assignment of appropriate term2weighting. The effects concern concrete words in topics and documents.We still need further research to find general rules.
  • LU Jiao-li , ZHENGJia-heng
    2005, 19(2): 67-71.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper is to fulfill text categorization tasks by using the perfect reduction theory of rough set. It mainly finished the following several jobs. Pretreated the documents. Improved the Okapi termweighting formula. It also separated the term weighting and completed attributes reduction and rules extraction tasks. Firstly it reduced the feature vector dimensions by using discernible matrix. Then reduced it again by computing relative reductions. Finally it produced the decision rules and employed the rule2combined tactics to produce the final decision rules. Designed an algorithmfor matching documents to rules so that the matching procession could be as simple and orderly as possible. The results of the experiment indicate that the approach is effective.
  • J IANGJi-fa , WANG Shu-xi
    2005, 19(2): 72-78.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper provides a method BRPAM for the acquisition of bi2relations and bi2relation patterns from free texts and its implementation BRPAM2Texts. The test done using BRPAM2Texts to extract bi2relations of〈organization , headquarter location of organization〉indicates BRPAM can acquire more same class bi2relations from a large free text set based on a few seed bi2relations given by users initially , and the precision/ recall of bi2relation extraction using this method is comparatively high.
  • ZHENG Ze-zhi,ZHANG Pu,YANGJian-guo
    2005, 19(2): 79-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    Nowadays , more and more lettered2words are used in Chinese texts , most of which are new terms or proper nouns. And this may become a trend quite obvious to us. Usually , lettered2words are unknown phrases or words in automatic Chinese segmentation. Based on the observation of lettered2words in our Chinese corpus , the correct identification of them will improve the quality of Chinese segmentation , information retrieval , searching technology , machine translation , etc. This paper analyzes the complex features of Chinese lettered2words , and discusses the difficulties in extracting them. An algorithm for the automatic identification of Chinese lettered2words is presented here , which uses a letter string as the anchor and search its left and right contexts for the boundaries of the lettered2word. The algorithm is simple , but it is effective. Our experiment on the corpus of the Peopleps Daily (Year 2002) shows the precision and the recall rates being 80 % and 100 % respectively.
  • CHEN Wen-liang , ZHU Mu-hua , ZHU Jing-bo , YAO Tian-shun
    2005, 19(2): 87-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a semi-supervised text categorization using bootstrapping. The System uses the Maximum Entropy Model as the text classifier. It learns more automatic labeled samples as new seed training samples from unlabeled samples using a small size of seed training samples. In this paper , we use a weighted factor to adjust the weight of new seed samples during the following training process. The experimental results show that the proposed system performs better than the conventional system with the same labeled documents. And it yields 70156 % F1 using only 1002labeled documents for each category , 417 % over the conventional system does. And it can provide the same performance as the conventional system using 50 % or less training samples. The results also show that the weighted factor can improve the performance.
  • ZHANGQi ,HUANG Xuan-jing , WU Li-de
    2005, 19(2): 94-100.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper introduces a new method for calculating similarity between sentences. The algorithm uses not only unigram but also bi2gram and tri2gram to calculate similarity. The algorithm is based on regression methods. Experimentations show that the method effective. The final summarization result is better than the algorithm that does not use it.We also propose a new summarization algorithm based on sentenceps weight and the new sentence similarity calculating method. While extracting the most important sentences ,redundancy is also reduced. The evaluation of DUC2003 and DUC2004 shows its effectiveness.Our system rank second among all systems that join in the DUC 2004.
  • ZHANG Yu ,LIU Ting ,WEN Xu
    2005, 19(2): 101-106.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid development of Internet ,the Open-domain Question Answering system becomes more and more attractive because the compact and exact result can be given by it. The Open2domain Question Answering system is composed of five parts ,that is ,question classification ,question expansion , search engine ,answer extraction and answer selection. The question classification plays an important role in the question answering system ,and affects the correctness of the question answering system directly. In this paper ,a modified Bayesian model was introduced based on analysis of the Bayesian model.The training set and testing set were constructed to verify the effect of this model. Experiments showed that this method could achieve better results in practice.