2008 Volume 22 Issue 6 Published: 15 December 2008
  

  • Select all
    |
    Review
  • Review
    WANG Dong-bo, CHEN Xiao-he, NIAN Hong-dong
    2008, 22(6): 3-7.
    Abstract ( ) PDF ( ) Knowledge map Save
    After introducing the basic principle of Conditional Random Fields (CRF), this article first defines the tag set with 7 words based on linguistic characteristic of Chinese coordination with overt conjunctions. Then it designs feature templates with 18 complex features and additional 4 linguistic features respectively for the CRF based identification of the coordination with overt conjunctions. Experiments on nesting coordination, non-nesting coordination and longest coordination in the Peking University Corpus and Tsinghua University 973 Tree Bank achieve the best F-score of 88.21%, 87.85% and 84.42% respectively in the open tests.
  • Review
    SONG Wei, QIN Bing, LANG Jun, LIU Ting
    2008, 22(6): 8-13.
    Abstract ( ) PDF ( ) Knowledge map Save
    Syntactic knowledge is important for pronoun resolution. In recent years, research on dependency parsing becomes active because dependency grammar benefits representing the relations between terms. We propose a SVM based method for Chinese pronoun resolution, employing effective syntactic role features and word sense similarities between head words of noun phrases. The experimental result on the ACE 2005 data shows that these dependency parsing based features are effective.
  • Review
    LI Zheng-hua, CHE Wan-xiang, LIU Ting
    2008, 22(6): 14-19.
    Abstract ( ) PDF ( ) Knowledge map Save
    The progress of Chinese dependency treebank construction has fallen behind other languages, such as English, in terms of scale and quality. Building a large scale treebank needs a lot of human and material resources. Meanwhile, it is very difficult to guarantee the quality of the treebank. In this paper, we explore a new method which combines rule-based method and statistical-based method to convert a constituent treebank named Penn Chinese Treebank to a dependency treebank which follows the annatation standard of HIT Chinese Dependency Treebank (HIT-IR-CDT). We increase the size of training data by adding converted treebank into HIT-IR-CDT and re-train the dependency parser. Experiments show that small addition of converted treebank can improve the performance of dependency parser, while large addition will bring it down. Through detailed analysis, we believe that convertion of constituent-to-dependency treebank still needs in-depth research as a method of improving performance of dependency parser by utilizing different treebanks.
  • Review
    DING Wei-wei,CHANG Bao-bao
    2008, 22(6): 20-26.
    Abstract ( ) PDF ( ) Knowledge map Save
    The semantic role labeling (SRL) is a new research area of natural language processing in recent years. Compared to the study in English, Chinese SRL is still in its infancy stage. In this paper, we focus on the semantic role classification (SRC), one key step of SRL. Besides introducing some new features, we also explore the inter-dependence of the semantic roles. We employ the context features to improve the performance of the semantic role classification. And a greedy algorithm is designed to select the different windows of the context for different feature templates, since the highest performance can be achieved with different window sizes for different feature templates. In the experiments, the precision of our SRC system can achieve 95.00%, proving the validness of our approach.
  • Review
    LI Bin, YU Li-li, SHI Min, QU Wei-guang
    2008, 22(6): 27-32.
    Abstract ( ) PDF ( ) Knowledge map Save
    The computation of metaphors in Chinese is certainly a challenging issue. Simile, with an obvious mark word, is a good start for automatic processing of metaphor. This paper is focused on automatic identification of Chinese simile phrases with the word “xiang”. Altogether 1 586 sentences containing “xiang” are first retrieved from the corpus and manually tagged and analyzed. Then the Maximum Entropy Model is applied to detect the simile meaning of “xiang”resulting in a F-score of 89% in open test. Finally, Conditional Random Fields (CRFs) Model is used to identify tenor, vehicle and similarity in the simile, achieving an acceptable F-score of 73%, 86% and 83%.
  • Review
    JIA Ning, ZHANG Quan
    2008, 22(6): 33-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    A sentence is composed by semantics chunks,and therefore ellipsis in sentence means ellipsis of semantics chunks. This dissertation tries to resolve ellipsis from the perspective of the share relationships among sentences on the basis of sentence category analysis. Ellipsis should be divided into two categories the ellipsis formed by full semantics chunks share and the ellipsis formed by integrant semantics chunks share. An algorithm is presented for ellipsis recovery, which is proved efficient for both kinds of ellipsis in the experiment.
  • Review
    GUO Yu-hang,CHE Wan-xiang, LIU Ting
    2008, 22(6): 38-42.
    Abstract ( ) PDF ( ) Knowledge map Save
    The lack of hand-crafted training data is a critical issue for supervised word sense disambiguation (WSD) systems. The monosemous lexical relatives substitution of target words have been proposed to acquire WSD corpus from the Web automatically. However, in some cases, the monosemous lexical relatives cannot be substituted by the target word suitably and then noises will be brought in. We propose a language models validation method to filter these noises, which can purify the training data, and improve the performance accordingly. Our experiments on Senseval-3 Chinese lexical sample task show that the system based on the training data acquired from the Web with language model validation achieves better accuracy than the one without language models validation.
  • Review
    ZENG Xiao-bing,ZHANG Zhi-ping,LIU Rong,YANG Er-hong,ZHANG Pu
    2008, 22(6): 43-49.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a new item in the Chinese Language Situation Report in 2007, the investigation of Chinese idioms and idiomatic phrases indicated that people paid more attention to the research of the Chinese idioms and idiomatic phrases. These researches are of extensive and profound significance to the applied linguistics. On the basis of the investigation based on the Large-Scale authentic corpora, this paper compare the “separate character’s difference” between Chinese idioms and idiomatic phrases, revealing some language phenomena and suggesting certain aspects in the linguistic evidence discovery, the linguistic rules summarization and the lexicon standardization etc.
  • Review
    MA Liang, HE Ting-ting, ,LI Fang, ,CHEN Jin-guang, , SHAO Wei,
    2008, 22(6): 50-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a strategy of summary sentence selection by keywords extraction for query-focused multi-document summarization. This method extracts the query related word features through the technique of query expansion, calculates the topic related feature through maximum likelihood estimation and then combines the two features to determine the importance of each word. The score of candidate sentence is set as the sum of importance of words in it, and the modified MMR technology is used to generate the final summary. Owing to the introduction to word level features, the experimental result shows a satisfactory performance in DUC 2005 corpus.
  • Review
    LI Ji-hong, WANG Rui-bo ,WANG Kai-hua,LI Guo-chen
    2008, 22(6): 55-62.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the CRCC v1.1 (Chinese reading comprehension corpus) built by Shanxi University, we establish a model of question answering method for Chinese reading comprehension by the maximum entropy (ME) model. Altogether 35 features are suggested on the word-level and syntax-level according to relationship between question sentence and candidate answer sentences, which result a 75.46% HumSent accuracy for ME modeling To overcome the dependency among these features which may influence the estimation of parameter weight, we adopt the principal components analysis (PCA) method for all above 35 features, which achieves 80.18% HumSent accuracy for ME modeling on the testing set. The experimental results show that the PCA method is valid for the feature selection in ME model and enhances the accuracy of the automatic reading comprehension (RC) system.
  • Review
    LIU Quan-sheng, YAO Tian-fang, HUANG Gao-hui, LIU Jun, SONG Hong-yan
    2008, 22(6): 63-68.
    Abstract ( ) PDF ( ) Knowledge map Save
    Subjective text is a non-restricted text that describes people’s idea, emotion and opinion. It is very different from objective text that states fact in content and structure. Opinioned text is a type of subjective texts, containing opinion elements (holder, claim, topic, sentiment). It is very popular in BBS, forum and blog in the Internet, which attracts more and more attention as the corpus for opinion mining. This paper introduces the difference between subjective text and objective text, focusing on the definition, character and category architecture of opinioned text and its application in opinion mining.
  • Review
    ZHAO Qing-liang, SUI Zhi-fang
    2008, 22(6): 69-74.
    Abstract ( ) PDF ( ) Knowledge map Save
    Attributes value is among the most important information to describe Ontology. However, few researches have been done about attribute values extraction so far. This paper proposes a method of extracting Ontology attribute values automatically based on WWW. Firstly, an interactive method is described to unilize the interaction between the attribute-value-related sentence selection and the attribute values extraction. This method can expand the target attribute value set from a seed set by the redundancy of WWW. Secondly, we present a method to construct the seed automatically. Experiments are conducted to examine the method in terms of precision and recall. In addition, automatically enriched Ontology information is applied in webpage content extraction to test the usefulness of our approach.
  • Review
    YANG Jie , JI Duo , CAI Dong-feng , LIN Xiao-qing, BAI Yu
    2008, 22(6): 75-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a keyword extraction method by first calculating word weight with ATF×PDF (Average Term Frequency*Proportional Document Frequency) and then determining the keywords by a joint weigh considering the semantic similarity between words. This method takes into account of the information of the frequency, the part of speech and the semantic relation simultaneously. The result shows that this method can efficiently extract keywords that cover multi-document’s topic, achieving an improvement in precision, recall and F-measure by 3%, 7%, and 4.4% respectively compared to keyword-based cluster-labeling algorithm.
  • Review
    LIU Xing-xing, HE Ting-ting, GONG Hai-jun, CHEN Long,
    2008, 22(6): 80-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    We propose a system to detect hot web event automatically. The system is focused on the stream of news report on the Internet, which provides a diagram concerning the tendency of the event and can be utilized to detect the hot web event in any period of time. Since news corpus is characterized by large scale data and distinct time features, it is divided into hundreds of groups according to the date. We further divide each group into some macro-clusters using the agglomerative clustering, select the macro-clusters during a certain period of time and then combine all these selected macro-clusters into event lists by the Single-pass clustering. Finally, we sort the candidate events by calculating their hot degree. Experiments on 2007 news corpus show that our system can produce satisfactory results.
  • Review
    LIU Wei, LIAO Xiang-wen,XU Hong-bo,WANG Li-hong
    2008, 22(6): 86-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we analyze many effective statistical features for splog filtering by investigating the differences between splogs and normal blogs. Then we present a splog filtering approach based on statistical characteristics of blog content. The experimental results on Blog06 data set show that the approach can reach an accuracy of 97%, which improves by 7% compared with term frequency based method. And with the test size increasing, its accuracy keeps around 95%, indicating a good generalization ability.
  • Review
    CHEN Lei,LIU Yi-qun, RU Li-yun, MA Shao-ping
    2008, 22(6): 92-97.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the explosive growth of information available on the Web, sponsored search has become one of the most popular forms of Internet advertising because of its effectiveness and feasibility. However, it remains questions whether the sponsored search results become obstacles in users’ information acquisition process. With an analysis of the large scale Web user access logs, we obtain several Chinese commercial search engines’ sponsored search statistics. We also look into users’ interaction behavior with sponsored search results and find out that search engine is hardly affected by sponsored search in meeting users’ information needs.
  • Review
    WANG Bing-qing, ZHANG Qi, WU Li-de, HUANG Xuan-jing
    2008, 22(6): 98-102.
    Abstract ( ) PDF ( ) Knowledge map Save
    A novel query expansion approach is presented in this paper, which applys the machine learning technique to the query expansion. It improves the retrieval performance by training a machine learning modular to predict and select the query expansion words. With the pseudo-relevance feedback, a set of candidate expansion words are generated for a certain topic. Then a Support Vector Machine (SVM) judges on these candidate words and forms an optimized query by selecting the top candidate words. To train such a SVM for query word judgment is difficult because the training data set is unavailable. This issue is resolved by generating the training data set via the retrieval results and evaluation tools available. In the opinion retrieval task of BLOG TRACK held by the TREC conference, we use this query expansion method to improve the Mean Average Precision (MAP) by 33.1% compared with the baseline result.
  • Review
    GUO Ji, LV Ya-juan , LIU Qun
    2008, 22(6): 103-109.
    Abstract ( ) PDF ( ) Knowledge map Save
    The translations of named entities, out of vocabulary words and terms play an important role in many application systems such as machine translation, cross-language information retrieval and question answer. However, these translations are hard to access from traditional bilingual dictionary. This paper proposes a method to automatically extract high quality translation pairs from Chinese web corpora. It analyzes the features of bilingual translation pairs in web pages, and then a statistical discriminative model combined with multiple features is used to extract translation pairs. Experimental results show that the quality of the extracted bilingual translations is improved greatly Top1 accuracy 82.1%, and Top3 94.5%. The paper also proposes a verification method to further improve the accuracy of the initial extractions with the help of search engines. Top1 accuracy grows up to 84.3% after the verification.
  • Review
    DAI Cui, ZHOU Qiao-li, CAI Dong-feng, YANG Jie
    2008, 22(6): 110-115.
    Abstract ( ) PDF ( ) Knowledge map Save
    By analyzing the characteristics of Chinese maximum noun phrase, the research proposes an automatic identification method of Chinese maximum noun phrase based on statistics and rules. Firstly, the feature set is empirically extracted by the combination of word and part of speech, and a conditional random fields (CRF) model is established for automatic identification. Then a rule base is constructed according to the boundary information and inner structure knowledge of maximum noun phrase for a post-processing module. The experimental results show the method is efficient for identifying Chinese maximum noun phrase, with a 90.2 % F-score in the open test.
  • Review
    NING Wei, CAI Dong-feng, ZHANG Gui-ping, JI Duo, MIAO Xue-lei
    2008, 22(6): 116-122.
    Abstract ( ) PDF ( ) Knowledge map Save
    Article choice is a difficult problem for Chinese to English translation since it involves complex knowledge about grammar, semantics and the world. Traditional researches based on rule or machine learning only deal with articles used in the noun phrases. This paper considers the article as a label and hence treats the problem as a sequence labeling task, proposing a strategy based on Conditional Random Fields. In the process of feature extraction, the proposed method makes good use of the word and part-of-speech, especially the mutual information feature. Experimental results on testing corpus composed of patent abstracts containing 91 106 articles show that the algorithm yields F-score of 80%.