2011 Volume 25 Issue 3 Published: 15 June 2011
  

  • Select all
    |
    Review
  • Review
    ZHANG Muyu, LI Yaobing, QIN Bing, LIU Ting
    2011, 25(3): 3-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    Coreference Resolution is one of the core issues in Natural Language Processing. Based on flat features for traditional machine learning method, we propose a new method for exploiting information of the head. Firstly, we introduce an instance-matching algorithm based on simple flat features for coreference resolution. With such instance-matching algorithm, we introduce the head string of antecedent and anaphora as new feature, and propose a competition mode to integrate the head-string feature into instance-matching. Compared to other traditional machine learning methods which just consider flat features, our method can fully exploit the feature information for each training instance and the fusion of head string feature produces more accurate result.
    Key wordshead match;instance match;coreference resolution
  • Review
    YOU Hongliang1, ZHANG Wei2, SHEN Junyi1, LIU Ting3
    2011, 25(3): 9-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic Term Recognition(ATR),as an important task in Information Extraction and Text Mining, aims at acquiring formalized words that are not recorded in time in the glossary. In recent years, several statistical methods have made substantial progresses in this field, and emerging methods such as C-Value, NC-Value, Term-Extractor have shown great advantages on this task. However, few work has been done on the Weighted Voting algorithm which could merge those statistical metrics as a whole. In this paper, we first collect part-of-speech rules from already-known terms, then match them with pos-tagged strings to acquire candidate terms, and finally sort those terms by Weighted Voting algorithm. The experiment on literature in Electric Engineering field from IEEE2006-2007 metadata shows that the weighted voting algorithm performs better than any seperate metrics alone.
    Key wordsautomatic term recognition; voting algorithm; information extraction; text mining
  • Review
    SHI Yingchao1,2, WANG Huizhen1,2, XIAO Tong1,2, HU Minghan1,2
    2011, 25(3): 17-23.
    Abstract ( ) PDF ( ) Knowledge map Save
    The track of personal name disambiguation evaluation in CLP2010 (CIPS-SIGHAN Joint Conference on Chinese Language Processing) is essentially a clustering taskgiven a document set containing a query term string, group these documents by the entity each document refers to. The input files are the documents retrieved by character-based matching from a collection of Xinhua news documents. So for this task, the primary problem is to determine whether the query is a personal name (a full name or just part of name) or not. For this subtask, this paper presents a personal name recognition system based on the combination of multiple named entity recognition systems with heuristic rules based post-processing. The experiments on the training set of CLP2010 evaluation demonstrates a precision of 98.89%.
    Key wordsnamed entity recognition; personal name disambiguation; system combination; heuristic rules
  • Review
    ZHANG Li1,2, QIAN Lingfei1, XU Xin3
    2011, 25(3): 23-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    Opinion mining has become a hot topic in recent years. We focus on one of the sub-tasks of opinion mining in COAE2009 and propose a theory of learning from nuclear sentences. Ten types of syntactic relations are defined as features, and the Conditional Random Fields(CRF) model is applied to analyze and compare the original sentence against the nuclear sentence in terms of words, part-of-speech(POS) and syntactic relations. Thenthe CRFs re-learning is carried out to further enhance the extraction performance. Experiment result exhibits measurable improvement and therefore proves the feasibility and value of this method.
    Key wordsopinion mining; comment target extraction; nuclear sentences; syntactic relation; conditional random fields
  • Review
    PAN Xu, GU Hongbin
    2011, 25(3): 30-38.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we introduce a classification method to identify definitions of all terms from an aviation domain corpus. This method proposes a novel approach to over-sampling minority instance using distance distribution information, which is further combined bythe random under-sampling majority instance to construct a balanced training set. It adopts the balance random forest (BRF) to build the final aggregating classifier of C4.5 decision tree. This method achieves the best score with 65% in F1-measure and 78% in F2-measure, out-performing baseline of BRF method.
    Key wordsnature language process; term definition; definition extraction; text categorization; re-sampling
  • Review
    LI Jihong1, GAO Yahui2, WANG Ruibo1, LI Guochen3
    2011, 25(3): 38-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper address the issue of disambiguation during automatic identification of Chinese frame, i.e. to assign an appropriate frame that is presented in current CFN(Chinese FrameNet) to the given target word within the sentence. This frame disambiguation task is treated as a problem of frame classification based on the context, using a maximum entropy model. The selected features in this paper include BOW(bag-of-word), the current word, part of speech, the basic chuck information, and the label in dependency syntax tree, as well as the technique of optional sizes of slide window . The training and testing sets contain 2077 annotated sentences with 88 lexical unitsfrom current Chinese FrameNet.The best result achieves an accuracy of 69.28% in the 3-fold cross-validation experments.
    Key wordsChinese FrameNet; frame semantics; frame disambiguation; maximum entropy
  • Review
    NI Ji,KONG Fang,ZHU Qiaoming,LI Peifeng
    2011, 25(3): 45-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    According to the formation of Chinese person name, this paper summarizes the features in name character frequency and name boundary templates. It then combines the probability of cohesion, the probability of discrimination and the trustworthiness of boundary templates as the integrated trustworthiness of human name. This approach has been embedded into a simple named entity recognition platform and tested on the MSRA corpus. The experimental results show that our approach can increase the F measure by 2.27%, achieving a final F measure up to 91.72% for human name recognition.
    Key wordsChinese name recognition; trustworthiness; cohesion; boundary templates
  • Review
    WU Fanglei, LI Junhui, ZHU Qiaoming, LI Peifeng
    2011, 25(3): 51-59.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper explores semantic role classification in Chinese language via tree kernel methods, focusing on how to effectively capture the inherent structured knowledge in a parse tree. It extends the minimum syntactic structure and explores three syntactic structures with respect to the characteristics of semantic role classification. It also explores composite kernel to integrate feature-based methods and kernel-based methods. Evaluation on the Chinese PropBank shows that the tree kernel-based semantic role classification achieves a performance of 91.79% in accuracy. Moreover, our tree kernel method shows complementary to the feature-based methods and further boosts the performance to 94.28% in accuracy, better than the state-of-the-art.
    Key wordssemantic role labeling; semantic role classification; tree kernel
  • Review
    GAO Song1,2, FENG Zhiwei3
    2011, 25(3): 59-64.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text clustering is of substantial importance to information retrieval. The method of applying the information of syntactic distribution to text clustering is presented, in order to avoid the complex clustering algorithm whileenabling the linguistic interpretation of clustering features and the results of clustering. According to the dependency Treebank, ten dependency relations are suggested with distinctive distribution between oral and written Chinese By using five of them as clustering feature, the similarity of spoken and written classes achieves 71.98% and 83.13%, respectively. The experiment result shows that the proposed method of applying dependency relations to text clustering is feasible and effective.
    Key wordstext clustering;clustering feature;dependency treebank;dependency relations;part of speech
  • Review
    GUO Yan1, LIU Chunyang2, YU Zhihua1, ZHANG Jin1, DAI Yuan1
    2011, 25(3): 64-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    After thorough study on such concepts as Internet public opinion, web information source and the impact of information source, this paper proposes an evaluation index system for the impact of web information sources of public opinion. The evaluation method tries to grasp the essential characteristics of Internet public opiniontaking account of the expressive force of the information source, the feedback from netizens as well as the feedback from other media. Similar to the idea of ‘PageRank’, an algorithm of SrcRank is presented for computing the importance rating of information sources based on the reproduction relation among various information sources. The instance analysis results shows that the evaluation index system for the impact factor proposed by the paper is reasonable.
    Key wordspublic opinion; the impact of information source; the evaluation index system; PageRank
  • Review
    LIU Jun, YAO Tianfang, QIU Wei
    2011, 25(3): 72-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper introduces the space-time elements into opinion model, and proposes a concept of Opinion Important Factor. It analyzes the formulae of Time Important Factor and carries Resource Important Factor experiments on forums of cellular phone and mini-car. To validate the space-time elements, it mines the trends of mini-car reviews as well as discusses the opinion trend mining methods and its evaluations.
    Key wordsspace-time elements; opinion model; important factor; opinion trend; mining method
  • Review
    LU Dongyuan, LI Qiudan
    2011, 25(3): 79-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    The rapid development of Internet technologies enhancesthe interrelationship between the users and the online news. Besides the traditional characteristics of content and time information in the news, the readers interactive information such as readers mood is also considered as a characteristics of the news. Recently, it has become a challenging task to sufficiently explore these characteristics to facilitate users browsing experience in news. In this study, we propose a novel news recommendation method which integrates the reader mood information as well as traditional news information such as content and time. The proposed method studies the news ranking algorithm according to the readers mood, the relevance between queries and news content as well as the importance decreasing along with time drifting. Additionally, we build a novel news recommendation system, which demonstrates the effectiveness of the proposed method.
    Key wordsNews recommendation; news characteristics; reader mood; semi-supervised ranking algorithm; Negative correlation constraint
  • Review
    YANG Yuan, LIN Hongfei
    2011, 25(3): 86-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper identifies conditional sentences in product reviews and determines whether opinions expressed on different features are positive or negative. Conditional sentences usually contain conditional conjunctions, but some sentences dont contain conditional conjunctions, which are called implicit conditional sentences. Implicit conditional sentences contain some words which are able to express conditional relationships, and these words are called implicit conditional words. Conditional sentences are identified by using conditional conjunctions, implicit conditional words, their tags and class sequential rules. When analyzing the orientation of conditional sentences, conditional sentences are classified by SVM into four classes according to conditional conjunctions and implicit conditional words. It is proved that the proposed method is useful according to the experimental results.
    Key wordsconditional sentences; conditional sentences identification; orientation
  • Review
    LIU Yufan1, GUO Jinzhong2, CHEN Qinghua2
    2011, 25(3): 93-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    It is meaningful to investigate character frequency distribution among Chinese literatures across different periods since it could help us to know more about how Chinese language evolves over time. This paper presents the change of Chinese character frequency distribution since Tang Dynasty, by counting the character frequencies of 5 classical as well as modern Chinese literatures. It is clear that two character frequency distributions are more similar when they are derived from closer periods, and all the distributions could be well fitted by exponential power law functions. And the exponential property is increasing while the power law feature is decreasing over time.
    Key wordsChinese literature; character frequency distribution; exponential truncated power law
  • Review
    YE Na, ZHANG Guiping, HAN Yadong, CAI Dongfeng
    2011, 25(3): 98-104.
    Abstract ( ) PDF ( ) Knowledge map Save
    Compared with the automatic machine translation, the computer assisted translation is more practical for real applications. In traditional computer assisted translation, users can only passively accept the translation provided by the system and perform post-editing on it. This paper proposes a computer assisted translation approach based on user behavior model, in which users explicit behaviors in the post-editing process are recorded and users translation decisions are discovered. In this way, the system can dynamically acquire and share users translation knowledge to improve the quality of aided translation. Experimental results show that the user behavior model built on the post-editing of the first 30% text in a document improves the BLEU score of the translation candidates for the remaining 70% text by 4.9%. The precision of the translation knowledge in user model achieves 94.1%.
    Key wordscomputer assisted translation; post-editing; user behavior model; translation knowledge; BLEU
  • Review
    ZHAI Feifei, ZONG Chengqing
    2011, 25(3): 104-112.
    Abstract ( ) PDF ( ) Knowledge map Save
    Fillers and redundancy are the most common phenomena in spoken dialogs. It always influences the results of spoken language understanding and translation system. Based on the analysis and statistical classification of fillers in lexical level of spoken dialog corpus, we propose statistical methods to recognize the fillers. Experiments on translation of the spoken sentences before and after processing of the fillers have been conducted. The experimental results have shown that the performance of spoken language translation system is significantly improved if the fillers are processed before translating.
    Key wordsfillers; maximum entropy classifier; SVM; CRFs
  • Review
    NUO Minghua1,2, WU Jian1, LIU Huidan1,2, DING Zhiming1
    2011, 25(3): 112-118.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes methods to extract Tibetan phrase translation for translation-ready Chinese phrases from the Tibetan corpus on laws and regulations and official documents. So far, the widely used phrase extraction methods depend heavily on the result of word alignment or additional resources like part-of-speech or syntactic analysis and so forth. Taking account of inadequate resources in Tibetan at present, this paper proposes Tibetan words string Statistical Method (TSM) and Tibetan words sequence intersection algorithm (TIA) to extract Tibetan phrase. TSM works well on 1-1 translation with over 90% accuracy, but miss the 1-n translation. TIA can extract Tibetan phrase not only continuous or discontinuous but also 1-1 translation or 1-n translation. The experiments show TIA achieves 81% accuracy for 1-n translation.
    Key wordsTibetan chunk; phrase translation extraction; Tibetan information processing; Chinese information processing
  • Review
    SU Jinsong1,2, LIU Qun1, LV Yajuan1
    2011, 25(3): 118-123.
    Abstract ( ) PDF ( ) Knowledge map Save
    The phrase translation probability features have great effect on the statistical machine translation. The traditional method has a deficiency in the estimation of phrase translation probability by just dealing with the phrases with consistent word alignments. In this paper, we modify the traditional formula to consider all occurrences of phrases in the corpus. The experimental results on the various test sets demonstrate the effectiveness of our method.
    Key wordsstatistical machine translation; alignment unconsistency; phrase translation probability
  • Review
    CHEN Shunqiang
    2011, 25(3): 123-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    After describing the necessity and significance of the word segmentation for Yi language,it introduces the principle of Yi word norms and vocabularyand discusses the Yi word segmentation algorithm. According to the characteristics of Yi language, the Yi Man Automatic Segmentation is implemented in Java for a good segmentation results.
    Key wordsautomatic segmentation; Yi language; sub-word units