2012 Volume 26 Issue 2 Published: 16 April 2012
  

  • Select all
    |
    Review
  • Review
    MENG Fandong1, XU Jinan2, JIANG Wenbin1, LIU Qun1
    2012, 26(2): 3-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    Large scale manually annotated corpora are usually used in research on statistical Chinese lexical analysis. The scale and quality of corpora affect the performance of statistical lexical analysis directly. Corpora in high quality and high rate of coverage are very valuable but limited, and it is very difficult to combine corpora of different domains directly since they are different in segmentation and part of speech (POS) tagging standards. These problems make it difficult to utilize existing resources and prevent the performance improvment in Chinese lexical analysis. To address this issue, this paper presents a simple but effective strategy to optimize the performance and domain adaptability of Chinese lexical analysis by merging different domains corpora automatically. Our experiments verify the validity, the stronger practicability, and the scalability to multiple corpora of the proposed method.
    Key wordslexical analysis; merging corpora; domain adaptation
  • Review
    ZHANG Meishan, DENG Zhilong, CHE Wanxiang, LIU Ting
    2012, 26(2): 8-13.
    Abstract ( ) PDF ( ) Knowledge map Save
    Generally, statistical methods for Chinese Word Segmentation dont have good domain adaptability owing to the specific training corpus. In practice, domain dictionaries are more easily achieved than humanly annotated segmentation corpus, and it contains plenty of domain information. We propose an approach which integrates dictionary information into statistical models (i.e., CRF model in this paper) to realize domain adaption for Chinese Word Segmentation. Experimental results show that our approach have good domain adaption. When the test corpus is identical to the domain of training corpus, the F-measure value increases 2%; when test corpus is in a different domain of the training corpus, the F-measure value increases 6%.
    Key wordsChinese word segmentation; CRF; domain adaption
  • Review
    XU Runhua1, CHEN Xiaohe2
    2012, 26(2): 13-18.
    Abstract ( ) PDF ( ) Knowledge map Save
    Commentaries of Pre-Qin documents contains a large lexical semantic knowledge which provide substantial evidences for segmentation. This paper uses “Zuo Zhuan” as the research object and proposes a new segmentation method based on commentaries aligned to “Zuo Zhuan”. Segmentation F-score reaches 89.0%, much higher than the baseline in the experiments. This method needs no training, and the idea of commentaries assisted segmentation is can be applied to the segmentation of other pre-Qin documents.
    Key wordsPre-Qin documents; commentaries documents; automatic alignment; automatic segmentation
  • Review
    CHE Wanxiang, ZHANG Meishan, LIU Ting
    2012, 26(2): 18-23.
    Abstract ( ) PDF ( ) Knowledge map Save
    It is necessary to have a large annotated Treebank to build a statistical dependency parser. Acquisition of such a Treebank is time consuming, tedious and expensive. This paper presents a method to reduce this demand via active learning, which selects the most uncertain samples for annotation instead of the whole training corpus. Experiments are carried out on the HIT-CIR-CDT, our results show that the parsing accuracy rises about 0.8 percent by active learning when using the same amount of training samples. In other words, for about the same parsing accuracy, we only need to annotate 70% of the samples as compared to the usual random selection method.
    Key wordsactive learning; dependency parsing; uncertainty-based sampling; query-by-committee
  • Review
    LI Yanjiao, YANG Erhong
    2012, 26(2): 23-28.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese Treebank as a kind of valuable resource in Chinese Information Processing includes rich information on the sentence structure and the constitunent combination. A study on the combination of POS strings is the basic work for the effective use of treebank information. This paper investigates the ambiguous combination in Chinese Treebank, revealing that it largely requires semantic feature to resolve the ambiguous combination and structure in Chinese, and can not be solved simply by grammatical features (such as POS information) of words.
    Key wordsambiguous combinations; semantic relations; treebank
  • Review
    ZHOU Yun1, WANG Ting1, YI Mianzhu2, ZHANG Lupeng3, WANG Zhiyuan1,4
    2012, 26(2): 28-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    All-Words Word Sense Disambiguation (WSD) can be regarded as a sequence labeling problem, and two All-Words WSD methods based on sequence labeling are proposed in this paper, which are based on Hidden Markov Model (HMM) and Maximum Entropy Markov Model (MEMM), respectively. First, we model All-Words WSD using HMM. Since HMM can only exploit lexical observation, we generalize HMM to MEMM by incorporating a large number of non-independent features. For All-Words WSD which is a typical extra-large state problem, the data sparsity and high time complexity seriously hinder the application of HMM and MEMM models. We solve these problems by beam-search Viterbi algorithm and smoothing strategy. Finally, we test our methods on the dataset of All-Words WSD tasks in Senseval-2 and Senseval-3, and achieving a 0.654 F1 value forthe MEMM method which outperforms other methods based on sequence labeling.
    Key wordsall-words word sense disambiguation; hidden Markov model; maximum entropy Markov model; very large state problem
  • Review
    WANG Qinghai1, MA Haihui1,2, CHI Yuhuan3, LI Ying1, DONG Lingchong4
    2012, 26(2): 35-40.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Chinese lexical knowledge base is an important component of the HNC knowledge base system. Its simple structure increases the complexity of the HNC symbol interpretation. In this paper, a new HNC based Chinese lexical knowledge base is proposed to improve its efficiency in natural language processing. Specifically, the entity attribute design method and the principle to fill the lexical knowledge base are illustrated, and two examples are provided to demonstrate its application.
    Key wordsChinese lexical knowledge base; HNC theory; relational database
  • Review
    WANG Xin, SUI Zhifang
    2012, 26(2): 40-46.
    Abstract ( ) PDF ( ) Knowledge map Save
    In research on the semantic role labeling based on dependency, most systems apply machine learning to arguments identification and arguments classification. This paper analyses the characteristics of the dependency tree, and find that arguments distribute in specific area of dependency tree. Therefore, we propose a novel rule based method for the semantic role identification according to the dependency tree distance. The maximal distance from candidate arguments to verb is limited to no more than three. We also obtain best candidate arguments related to the verb. For the gold syntactic dependency tree, this method recognizes 98.5% of arguments on CoNLL 2009 Chinese dataset. Combined with arguments classification based on machine learning, the F measure of the system finally reaches 89.46%, which is a significant improvements compared with the previous work (81.68%).
    Key wordsargument identification; dependency tree distance based method;semantic role labeling
  • Review
    ZHANG Peng1, LI Guochen1,LI Ru1 2, LIU Haijing1, SHI Xiangrong3, Collin Baker4
    2012, 26(2): 46-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a text entailment identification method using the frames and the frames relations in FrameNet together with the relevant knowledge with WordNet. The method finds the path between the frames evoked by the lexical units in text T and hypothesis H in the FrameNet Graph via depth-first search algorithm to identify the hyponymy relationships between the frames; Then it realizes the text entailment recognition through comparing the content of span which are filled in the mapping FE slots. Our experiments are based on certain parts of the evaluation corpus of RET2007.The experimental results reach 76.6% precision, which is consistent with the best results of RET2007 evaluation results in the task of “Recognize Text Entailment”.
    Key wordsrecognize text entailment; FrameNet; frame-to-frame relations
  • Review
    WANG Xun, LI Sujian, SONG Tao, JIANG Boping
    2012, 26(2): 51-56.
    Abstract ( ) PDF ( ) Knowledge map Save
    Texts usually contain various aspects of information. In natural language processing, many tasks would benefit from the recognition of these aspects. For example, in the summarization task, traditional method of extracting sentences is mainly bases on features of words frequency. Those sentences of great importance would be ignored if they appear infrequently. Aspect recognition can remedy this understanding defect.. In this paper, we use FrameNet corpus as ontology to annotate sentences based on lexical and syntactical features. The frame of the annotated sentence explains its aspect information. This method works well when tested on the news corpus and the precision of aspect recognition based on frame information can reach 61%.
    Key wordsFrameNet; aspect recognition; frame identification
  • Review
    WANG Rongyang, JU Jiupeng, LI Shoushan, ZHOU Guodong
    2012, 26(2): 56-62.
    Abstract ( ) PDF ( ) Knowledge map Save
    Opinion target is an important component of sentiment information in sentiment analysis. This paper explores Conditional Random Fileds (CRFs) based opinion target extraction. After employing frequently used features in sentiment extraction, we summarize all the features into four categories, i.e. lexical, dependency, relative-position and semantic. More importantly, we propose using semantic role as a specific feature. Great efforts and detailed comparative studies have been made to evaluate the performance by exploring various features and their combination. Experimental results show that semantic role is a good indicator for opinion target.
    Key wordssentiment analysis; opinion target extraction; the combination of features; semantic role labeling
  • Review
    JIAO Yan1,2, WANG Houfeng1, ZHANG Longkai1
    2012, 26(2): 62-69.
    Abstract ( ) PDF ( ) Knowledge map Save
    Abbreviations are commonly used in natural languages and constitutes a substantial proportion of Unknown Words, which challenges Natural Language Processing. This article proposes a strategy of predicting abbreviation from full form in Chinese. For a full form, it firstly generates a number of candidates using Conditional Random Field. Then each of the candidates is re-scored according to the results from Web Search Engine based on different search conditions and statistic methods. The candidate with highest score is selected as the abbreviation. Experiments show the precision improves about 5% compared with single Conditional Random Field method.
    Key wordsabbreviation; CRF model; web data
  • Review
    LIU Youqiang 1, LI Bin 1,2 , XI Ning 1 , CHEN Jiajun 1
    2012, 26(2): 69-75.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese abbreviations are widely used in modern Chinese texts, and the researches on them are important for Chinese information processing. In this paper, we propose an approach to extract Chinese abbreviations from Chinese-English parallel corpus. First we generate word alignments for the corpus, and extract Chinese-English phrase pairs consistent with the alignments. Then, we discriminate high quality phrase pairs from the bad ones by SVM Classifier. In the end, we extract Chinese abbreviation and full-form phrase pairs from the high quality group using their corresponding English translations and some rules. The experiments show that our approach can extract abbreviations with high accuracy, and could be an effective way to extract Chinese abbreviation and full-form phrase pairs.
    Key wordsabbreviation; parallel corpus; phrase extraction; classify
  • Review
    ZHANG Weiru1,2, SUN Le1 ,HAN Xianpei1
    2012, 26(2): 75-82.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a method to extract Chinese entity relations of high accuracy from open text based on Wikipedia and pattern clustering. We get relation instances by a mapping from HowNet to Wikipedia and via the structural characteristics of Wikipedia. Based on these, the method solves the entity recognition and generates significant sentence instances. Furthermore, significance assumption and keyword assumption are proposed to support classification and hierarchy clustering algorithm for pattern reliability. The results show that the method achieves a good performance with high-quality seeds and patterns.
    Key wordsrelation extraction;Wikipedia;pattern clustering
  • Review
    LI Wenjie1,2, SUI Zhifang1,2
    2012, 26(2): 82-88.
    Abstract ( ) PDF ( ) Knowledge map Save
    Most researches on concept instances and concept attributes extraction focuses on pattern-based approaches, which usually suffer from a low recall rate. In this paper, we present a method of extracting concept instances and concept attributes based on the coordinate structure. Since a part of candidate instances and attributes extracted by the coordination patterns can be putted into the similar-concept-phrases sets in advance, we can use these similar-concept-phrases sets to expand the extraction results in the procedure of co-occurrence pattern-based extraction. Compared with the baseline without using the coordination patterns, experimental results show that the coverage of this method is significantly improved without reducing the precision.
    Key wordscoordinate structure;Search Engine;instances extraction;attributes extraction;contextual pattern
  • Review
    HE Zhengyan, WANG Houfeng
    2012, 26(2): 88-92.
    Abstract ( ) PDF ( ) Knowledge map Save
    Baidu baike contains large amount of knowledges on the named entity, the link relationship and the category information. In order to recognize the product (or brand) name from open texts, we propose a graph-based method to discover product name using a few seeds. We incorporate the “related entry” and “open category” structure of baidu baike to reinforce the similarity measures. Applied this method on 1.3 million entries, satisfactory results are achieved for the product mining.
    Key wordsbrand name mining; semi-supervised learning; graph method
  • Review
    WANG Hongling, ZHOU Guodong, ZHU Qiaoming
    2012, 26(2): 92-97.
    Abstract ( ) PDF ( ) Knowledge map Save
    Multi-document summarization can help people to access information automatically and fast. Compared to single-document summarization, multi-document lays more emphasis on the correlation and redundancy between documents. Therefore, how to control information redundancy is a key problem to multi-document summarization. This paper proposes a model of redundancy control based on the features of summary. In this model, various similarities among the text units over topics probability distribution are used to determine the choice of a sentence. Experimental results show that this method can reduce redundancy effectively, and produce better overall performance than existing systems.
    Key wordsreduandancy control; multi-document summarization; Chinese automatic summarization
  • Review
    ZHANG Longkai,WANG Houfeng
    2012, 26(2): 97-102.
    Abstract ( ) PDF ( ) Knowledge map Save
    Extractive summarization attempts to extract important sentences from the original text and re-organize them in a summary. In this paper we propose a method to automatically identify significant sentences. The basic idea of this method is to label each sentence with one of two tags via the sequence labeling modelof Conditional Random Fields. Considering that many sentences tend to be rejected due to the fact that sentences in summarization are much less compared with the original sentences, we introduce a correction factor to correct the label bias. Experiment results show that the proposed method achieves a good performance.
    Key wordstext summarization;sentence extraction;CRF
  • Review
    PENG Xingyuan1, KE Dengfeng1, ZHAO Zhi1, CHEN Zhenbiao1, XU Bo1,2
    2012, 26(2): 102-109.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper studies new methods of automated Chinese essay scoring based on word scores. Under the hypothesis of the high correlation between the essay score and the scores of words in the essay, the equation of the relation is defined. The conventional methods and three enhanced methods are implemented to estimate the parameters of the equation. Compared with the e-raters methods, our new methods have a correlation close to 0.7, which demonstrates the performance of the latter is better. In addition, the performance of our methods is close to manual results.
    Key wordsword scores; automated essay scoring
  • Review
    HE Liang, LI Fang
    2012, 26(2): 109-116.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatically extracting topics from scientific literature and finding the research trends are of substantial significance to researchers. In this paper, we use LDA model to generate topics from the scientific literature, then calculate the strength and impact of the topic, and finally, find the trends of the hot topics vs. cold topics, high vs. low impact topics. The method of calculating topic strength and impact is applicable to any document. The experiments on ACL anthology have shown the research trend in computational linguistics. And the contrast experiment also proves validity of the proposed calculating method.
    Key wordstopic model; trend analysis
  • Review
    TANG Guoyu1 , XIA Yunqing1 , ZHANG Min2, ZHENG Thomas Fang 1
    2012, 26(2): 116-121.
    Abstract ( ) PDF ( ) Knowledge map Save
    Cross-Lingual Document Clustering is the task to automatically organize a large collection of cross-lingual documents into groups according to their contents or topics. This work extends traditional monolingual Generalized Vector Space Model (GVSM) to Cross-Lingual GVSM (CLGVSM) by using cross-lingual term similarity calculation methods in order to represent documents in different languages and compare different term similarity calculation methods in cross-lingual document clustering. This work also proposes new feature selection method for CLGVSM. Experiment results show that GVSM with Second Order Co-occurrence Point wise Mutual Information (SOCPMI) term similarity measure outperforms the latent semantic analysis (LSA) method.
    Key wordsCross-lingual document clustering; CLGVSM; text similarity; document clustering
  • Review
    YU Xiaojie1, WU Ji1, KONG Fanting2, LI Shusen1
    2012, 26(2): 121-128.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic story segmentation is very important for retrieval of broadcast news data. Recent research on automatic story segmentation is focused on video data. Semantic information extracted from speech recognition results and acoustic event information of audio data provide important information for story segmentation. This paper proposes a rule-based multi-information fusion method, using the audio information to adjust the results of text story segmentation. Experiments show that after fusing multi-information, the F-measure of automatic story segmentation of broadcast news data reaches 64.8%, Pk and WindowDiff reach 18.3% and 24.5% respectively.
    Key wordsautomatic story segmentation; improved SeLeCT algorithm; multi-information fusion