2008 Volume 22 Issue 1 Published: 15 February 2008
  

  • Select all
    |
    Review
  • Review
    ZHAO Yan-yan, QIN Bing, CHE Wan-xiang, LIU Ting
    2008, 22(1): 3-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    Event Extraction is an important research point in the area of Information Extraction. This paper makes an intensive study of the two stages of Chinese event extraction, namely event type recognition and event argument recognition. A novel method combining event trigger expansion and a binary classifier is presented in the step of event type recognition while in the step of argument recognition, one with multi-class classification based on maximum entropy is introduced. The above methods solved the data unbalanced problem in training model and the data sparseness problem brought by the small set of training data effectively, and finally our event extraction system achieved a better performance.
  • Review
    ZHANG Xiao-yan, WANG Ting, CHEN Huo-wang
    2008, 22(1): 9-14.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the analyses of news stories and the experimental verification, this paper introduces a multi-vector model for story representation. The model represents the feature set as detail as possible. A fuzzy matching method is proposed to compute the relatedness between two named entity sub-vectors in the multi-vector model. To measure the similarity of the stories, all the features together with the named entity relatedness are integrated by Support Vector Machine (SVM). The proposed methods have been tested on TDT4 Chinese corpus for story link detection. The experiment results indicate that story link detection based on multi-vector model can improve the performance, and the relation information generated by fuzzy matching can contribute to the improvement.
  • Review
    HAN Xian-pei, LIU Kang, ZHAO Jun
    2008, 22(1): 15-21.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper analyzed the different feature types of webpage blocks, and presented a webpage content block detection method based on layout features and language features, which effectively resolved the seesaw problem between detection accuracy and model generality across different types of webpages. The method used the vision-block tree to represent webpage, built two individual classifiers respectively for webpage’s layout features and language features, and used different strategies to combine these two classifiers. The experimental results show that, with holding the content block detection recall higher than 90%, the combined classifiers’ accuracy can reach 85 percents, 5 percents higher than the classifier using only the layout features, and 15 percents higher than the classifier using only the language features; and the experimental results also show that the combined classifiers obtained good detection performance over five selected websites which means that it have good generality.
  • Review
    MEI Xue , CHENG Xue-qi , GUO Yan ,ZHANG Gang , DING Guo-dong
    2008, 22(1): 22-29.
    Abstract ( ) PDF ( ) Knowledge map Save
    Web information extraction has been a hot topic in recent years. The challenge is how to extract important information from a large number of web pages as quickly and accurately as it can. In this paper a novel method is proposed for fully automatic wrapper generation for Web information extraction. This method makes use of structure of Web templates abundantly.It uses Web Page Link_Sort algorithm and Web Page Structure_Seperator algorithm to extract information from Web pages and output a wrapper accordingly. Experimental results showed that this method performs well in both rigidly and loosely structured records in Web pages.
  • Review
    LI Jing-jing, YAN Hong-fei
    2008, 22(1): 30-36.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid development of World Wide Web, Web information retrieval (IR) has been a hot research topic, but the research has been restricted by the lack of appropriate test collections. According to the framework of existing foreign test collections, we constructed large-scale Chinese Web Test collections (CWT), and organized SEWM Chinese Web search evaluation. Based on the investigation and analysis of current research, the details in constructing each component are introduced, and effective statistical analysis and experiments are carried through. The methodology used in engineering CWT should be readily applicable to the construction of future Web corpora.
  • Review
    LUO Chang-sheng , DUAN Jian-guo , GUO Li
    2008, 22(1): 37-43.
    Abstract ( ) PDF ( ) Knowledge map Save
    The ability to incrementally learn from batches of data is an important feature that makes a learning algorithm more applicable to real-world problems. Incremental learning may be used to keep memory and time consumption of the learning algorithm at a manageable level. Incremental learning algorithms have been widely used for solving large-scale dataset problems. For text classification problem, the paper presents the general issues of an incremental learning algorithm. Based on DragPush strategy, the paper introduces a text classification incremental learning method, named ICCDP. Finally, it explores the issues of incremental learning based on ICCDP. The results of the experiment reveals that ICCDP is of high value for its fast training and its excellent classification performance.
  • Review
    XU Yan, WANG Bin, LI Jin-tao,SUN Chun-ming
    2008, 22(1): 44-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    Feature selection(FS) plays an important role in text categorization(TC). Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), mutual information (MI), and so on are commonly applied in text categorization [J]. Existing experiments show IG is one of the most effective methods. In this paper, a feature selection method is proposed based on Rough Set theory. According to Rough set theory, knowledge about a universe of objects may be defined as classifications based on certain properties of the objects, i.e. rough set theory assume that knowledge is an ability to partition objects. We quantify the ability of classify objects, and call the amount of this ability as knowledge quantity and then following this quantification, we put forward a notion “knowledge Gain” and put forward a knowledge gain-based feature selection method(KG method). Experiments on NewsGroup collection and OHSUMED collection show that KG performs better than the IG method, specially, on extremely aggressive reduction.
  • Review
    ZENG Yi-ling , , XU Hong-bo , BAI Shuo
    2008, 22(1): 51-55,60.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a density-based clustering algorithm, OPTICS is capable of showing the intrinsic corpus structure within a visual plot. However, due to the improper strategy in organizing the points in sparse space, the algorithm does not reach its best performance. To solve this problem, we proposed an effective result-reorganization strategy for reordering those sparse points. Based on this strategy, a new text clustering algorithm named OPTICS-Plus was proposed according to the characteristic of text mining fields. Experiment on FuDan text classification corpus shows that our result-reorganization strategy is capable of helping the reachability plots generating clearer views of corpus structures. Furthermore, a comparison with K-means proves that the clustering performance of OPTICS-Plus is actually satisfactory.
  • Review
    ZHANG Gang, LIU Yue, CHENG Xue-qi
    2008, 22(1): 56-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    Proper collection partition can improve the performance of distributed information retrieval greatly. In order to solve the collection partition problem in distributed information retrieval, a query-space based collection partition algorithm was proposed in this paper. Comparing to the traditional document-space based partition algorithm, this algorithm was a new understanding and view point to collection partition. It gave a good solution for large scale document collection partition. The experiment indicates that this algorithm achieve great improvement in both effectiveness and efficiency.
  • Review
    WEN Jian , LI Zhou-jun
    2008, 22(1): 61-66, 122.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recent researches present topic language model improves the performance of information retrieval, but many problems still has not been solved include data sparseness problem, synonymy and polysemy problems, smoo-thing the seen term or not seen term. All the problems are important to IR, especially in domain literature IR, for example biological literatures. In this paper, a new topic language model based on cluster was proposed. The work mainly included two aspects. First, documents were represented by concepts of ontology, and concept-based clustering is done using Fuzzy C-Means, the clustering result was considered as the topics of document collections. The probability of a document generating topics is estimated by the similarity between the document and each cluster. Then, the probability of topic generating words is estimated using Expectation Maximization algorithm. At last, Through integrating the above algorithms into the aspect model,our topic language model was formed. This new language model accurately describes the distributed probability of words in different topics and the probability of a document generating a topic. Moreover, it can partly solve synonymy and polysemy problems. The new method was evaluated on TREC 2004/05 Genomics Track collections. Experiments have shown that the retrieval performance has been improved by the new method compared with simple language model.
  • Review
    LIU Wu-ying, WANG Ting
    2008, 22(1): 67-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    Spam filtering is defined as a task trying to label Emails with Spam or Ham in an online situation, which is essentially a self-learning procedure with user’s feedback. There are already some simple filters applying the linguistic features or behavior features. In this paper, we use the ensemble learning method to combine multi-filter and achieve a higher performance than the single one could. The experiment result shows the single feature learning is fast and the ensemble learning has better effects, in which the proposed SVM ensemble method has the highest performance.
  • Review
    WANG Peng-ming, WU Shui-xiu, WNAG Ming-wen, HUANG Guo-bin
    2008, 22(1): 74-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    Along with the coming of network times, the research of spam filtering technology has been imperative under the situation. However, some specialties of mail dataset such as the data sparseness, high dimensionalities and multi-collinearity in mail content make great difference between spam filtering work and text classification work. In this paper, a Partial Least Squares (PLS) feature extraction method on spam filtering is proposed, which could extract latent semantic components that can capture the content information and class information, and could copy with the multi-collinearity. The experiments on Enron-Spam database show that our method can get very good performance in spam filtering compared with χ2 statistics feature selection.
  • Review
    ZHANG Zhi-chang, ZHANG Yu, LIU Ting, LI Sheng
    2008, 22(1): 80-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic reading comprehension systems can analyze a given passage and generate/extract answers in response to questions about the passage. An approach integrating shallow semantic information to extract answer sentence is proposed in this paper. The labeled semantic roles in question and candidate sentences are represented as semantic trees, then the structure similarity is calculated using tree kernel between them. After combining the simila-rity with matching words count obtained using bag-of-words method, the sentence with the highest score is chosen as answer sentence. The proposed approach achieves 43.3% HumSent accuracy on the Remedia corpora.
  • Review
    SONG Rui, LIN Hong-fei, YANG Zhi-hao
    2008, 22(1): 87-92.
    Abstract ( ) PDF ( ) Knowledge map Save
    Mobile oriented automatic summarization is restricted to summary length due to the smaller screens. In this paper, a Chinese news oriented mobile summary system was designed and implemented. After parsing the web news page, Maximally Repeated Strings were extracted as key words set. The summary displayed on the mobile terminal was generated using Edit Distance. Considering some web pages were structured with subtitles, hierarchical summary was applied to them in order to improve the coverage of the summary. And then a Q&A based evaluation was designed to prove the effectiveness of this kind of summary. Experiment showed the summary created did well in conciseness, readability and coverage.
  • Review
    LI Fang-tao, ZHANG Xian, SUN Jian-shu, ZHU Xiao-yan
    2008, 22(1): 93-98.
    Question classification is one of the most crucial models in question answering system. And the key words play very important roles for question classification task. In this paper, we investigate the role of question word and head word in question classification. This paper proposed a novel hierarchical structure question classifier based on the question words and head words. Using question words, it first simple classified the question sentence into three categories. For each category, we designed an appropriate classifier respectively. As to the type of what questions, we constructed a head word based classifier using assassination rules. The novel hierarchical structure question classifier is tested on the TREC2007 QA question set and the UIUC Dataset. It can get accuracy of 90.6%, 84.0% respectively, which proved the importance of the question words and head words in the question classification.
  • Review
    SONG Li-zhe , ZHAN Chi-bing , WANG Sheng-hai
    2008, 22(1): 99-103.
    Abstract ( ) PDF ( ) Knowledge map Save
    User’s interests are presented by keywords or other ways based on keywords in most of personalized ser-vice of digital library. This paper tried to solve disadvantage of lack of semantic information of keyword. It proposed a new method to present user’s interest based on ontology, and provided a detailed introduction about this method, including constructing digital library domain ontology, presenting user’s information, and providing service using this method. Take the informational retrieval as example; it introduced the implement of service based on concepts similarity and relations between concepts, such as synonymy, Isa, PartOf and so on. Experiment shows the representation based on ontology provides more personalized information than the method based on keyword.
  • Review
    FENG Yuan-yong , ,SUN Le ,LI Wen-bo , ,ZHANG Da-kun ,
    2008, 22(1): 104-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    Conditional Random Fields (CRF) model becomes prevalent for sequential labeling tasks in the field of NLP. A general but slow optimization algorithm L-BFGS is commonly used in parameter estimation of CRF Model. In this paper, an improved algorithm is proposed to train CRF model more quickly. First, small scale character hint features are introduced to decrease the feature space. Then, a task-specific rule is applied to reduce search paths in Viterbi and Baum-Welch procedure. The experiments on China 863 program NER and SIGHAN 2006 corpora show that our schema saves training time significantly without performance drop.
  • Review
    HUANG Yu-lan ,GONG Cai-chun ,XU Hong-bo ,CHENG Xue-qi
    2008, 22(1): 111-115.
    Abstract ( ) PDF ( ) Knowledge map Save
    A domain dictionary generation algorithm based on pseudo feedback model is presented in this paper. The generation of domain dictionary is regarded as a domain term retrieval process: Assume that top N strings in the original retrieval result-set are relevant to C, append these strings into the dictionary, retrieval again. Iterate the process until a predefined number of domain terms have been generated. Experiments upon corpus show that the precision of pseudo-feedback-model based algorithm is much higher than existing algorithms.
  • Review
    XU Lin-hong, LIN Hong-fei, ZHAO Jing
    2008, 22(1): 116-122.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper introduced some experiences on constructing emotional corpus, and discussed several basic questions which included the tagging criterion, tagging set, tagging tools and quality monitoring. There were about 40 000 sentences in the corpus. Moreover based on these, statistical data about emotional distribution and rules of emotional transference were available, and characters and applications of corpus were analyzed, so emotional corpus provide support for text affective computing.
  • Review
    LIU Kang, ZHAO Jun
    2008, 22(1): 123-128.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper focuses on the task of sentence sentiment analysis. The traditional sentence sentiment analysis methods have the following two problems. First, the classification method cannot consider the contextual information; Second, the label redundancy in the single-layer model has negative effect on the labeling accuracy of the second layer. Aiming at these two problems, this paper proposed a new sentence sentiment analysis method based on cascaded CRFs model, which used multiple CRFs models to compute sentence sentiment and sentiment strength in a cascaded way. The cascaded frame can alleviate the negative impact of related labels on the labeling accuracy, and on the other hand CRFs model can consider the contextual information. This method can improve the accuracy of sentence sentiment strength label while labeling sentence sentiment effectively. The experiments can validate this method. The performance of experiments can be improved greatly than SVM method and classical CRFs model.