2012 Volume 26 Issue 3 Published: 15 June 2012
  

  • Select all
    |
    Review
  • Review
    SONG Yan1, HUANG Changning2, KIT Chunyu1
    2012, 26(3): 3-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    Combinatory Categorial Grammar (CCG) is a type-driven lexicalized grammar formalism with a transparent interface between syntax and semantics, which is essential to in-depth text processing. To apply CCG to real texts, however, a large scale lexicon needs to be constructed as indispensable support, demanding a great deal of manpower and resources. An effective way to alleviate this problem is to transform an existing treebank into a CCGbank. This paper presents an approach to deriving a Chinese CCGbank from Tsinghua Chinese Treebank, with the aid of a number of predefined verb sub-categorization and Chinese sentence patterns. The resulted CCGbank includes 32 737 sentences, of over 350k word tokens. The effectiveness of this approach is confirmed by an evaluation with manually annotated references and a comparison with reported works on the construction of several CCGbanks.
    Key wordscombinatory categorical grammar; treebank; Chinese sentence pattern; verb sub-categorization frame
  • Review
    MA Ji, ZHU Muhua, XIAO Tong, ZHU Jingbo
    2012, 26(3): 9-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a single-model system combination method for a shift-reduce parser. In the training step, a group of shift-reduce parsers are automatically constructed by varying the distribution of the training data. In the decoding step, our method first uses the parsers to parse the input sentence. Then it uses a linear model to select the parse tree with the highest score as the final result. We conduct our experiment on the Engligh Penn Treebank. Experimental results show that our method leads to significant improvements over the baseline system.
    Key wordsparsing; system combination; shift-reduce parser
  • Review
    ZHOU Huiwei, HUANG Degen, GAO Jie, YANG Yuansheng
    2012, 26(3): 16-22.
    Abstract ( ) PDF ( ) Knowledge map Save
    We present a mentod to combine the maximum spanning tree (MST) algorithm and the deterministic algorithmfor Chinese dependency parssing. We introduce the results and the dependency degree of Nivre parser into MST parser. Our system achieves the accuracy of 86.49% using 10-fold cross-validation on the Penn Chinese Treebank Corpus, which is a significant improvmentin the parsing accuracy.
    Key wordsChinese dependency analysis; maximum spanning tree algorithm; deterministic algorithm
  • Review
    CHEN Bo1,2, JI Donghong2, LV Chen2
    2012, 26(3): 22-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    Constructing large scale Chinese semantic resources is one of the major tasks in current Chinese Information Processing. However, the conventional approaches in semantic parsing have some defects in that they cannot denote the semantic relatedness between Chinese words and constituents. We propose a semantic annotation approach based on Feature Structure and accordingly construct large scale Chinese semantic resources. We choose “subject-predicate predicate sentence” as the research target in the paper, summarize seven categories of Feature Triples, and compare the results of three different analysis methods. Parsing based on Feature Structure solves the problems of appropriate ways to tackle the special Chinese sentence patterns. It also provides more semantic information than conventional approaches, and produces an improved annotating efficiency and accuracy.
    Key wordsfeature structure; Chinese subject-predicate predicate sentence; semantic tagging; semantic resource
  • Review
    S.Loglo,HUA Shabao,Sarula
    2012, 26(3): 27-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    Mongolian language information processing has completed the basic task of word processing stage, and now is entering the stage of sentence processing. Under the support of National Natural Science Foundation, we have constructed the Mongolian Dependency Treebank (MDTB). In this paper, we use MDTB as training and evaluation data, designing and implementing a Mongolian dependency parsing model based on lexical dependent probability. Currently, the model achieves accuracies of 71.24%, 61.42% and 93.05% in the unlabelled annotation score, the labeled annotation score and the head word annotation score, respectively.
    Key wordsMongolian; dependency grammar; parsing; probability model
  • Review
    WANG Zhongqing, LI Shoushan, ZHU Qiaoming, LI Peifeng, ZHOU Guodong
    2012, 26(3): 33-38.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentiment classification has undergone significant development in recent years. However, most existing studies assume the balance between the numbers of negative and positive samples, which may not be true in reality. In this paper, we collect product reviews from four domains and find that the positive samples are much more than negative ones. To handle the imbalanced classification in Chinese sentiment classification, we propose a novel approach to combine both sampling and classification algorithms under an ensemble learning framework. Evaluation across different domains shows the proposed approach performs better than several existing imbalanced classification methods.
    Key wordssentiment classification; imbalanced classification; ensemble learning
  • Review
    SONG Yanxue, ZHANG Shaowu, LIN Hongfei
    2012, 26(3): 38-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper studies on the ambiguous collocations with the perspevtive to sentiment analysis, which are of substantial importance for text sentiment analysis. We use Association Rule to determine whether word pairs are collocations, cut off the unnecessary ones by PMI, and then resolve the collocation ambiguity. As a result, we build an ambiguous collocations dictionary, which is adopted in the sentencesentiment analysis . Experiments on the COAE2008 corpus show that ambiguous collocations are essential in the sentiment analysis, and it significantly influences the accuracy of sentiment analysis.
    Key wordsambiguous collocations;association rule;orientation;affective computing
  • Review
    WEI Xiangfeng, ZHANG Quan, MIAO Jianming, CHI Yuhuan
    2012, 26(3): 44-49.
    Abstract ( ) PDF ( ) Knowledge map Save
    The event sentiment analysis is of substantial significance in mining the public opinion and hot topics. The words which impact the effect of event sentiment analysisare classified into four categoriesobject, polarity, logic and grade words. This paper proposes the bi-gram and tri-gram models forathe sentence sentiment analysis, and the opinion of an article can thus be achieved via the semantic chunk. In the experiment, the key objects and their opinions are confirmed by human, then the system can obtain the opinion of an article based on the sentence sentiment analysis results. Experimental results show that semantic chunks improve the accuracy of analysis and standpoint decision is a key step in in the event sentiment analysis.
    Key wordsopinion;semantic chunks;standpoint decision
  • Review
    YU Miao, LV Yajuan, SU Jinsong, LI Xianhua
    2012, 26(3): 49-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a method for Chinese address translation by combining rules and statistics. First the word segmentation and word-type tagging are performed using the division dictionary, keyword dictionary and pattern table. Then the address is chunked into several address units according to word-type. The address unit is translated by statistical hierarchical phrase translation model with the help of a few dictionary entries and manual templates. Finally, the address unit translation results are generated as a whole in reverse order. Experiments show that these approaches achieve good translation results.
    Key wordsChinese address; machine translation; address unit
  • Review
    XI Ning, LI Boyuan, HUANG Shujian, CHEN Jiajun
    2012, 26(3): 54-59.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese word segmentation is the first phase in building statistical machine translation (SMT) systems from Chinese into other languages. But the Chinese word segmenters trained from monolingual corpus are not necessarily suitable for SMT systems. Therefore, it is necessary to build a MT-motivated Chinese word segmenter in order to improve the quality of translation. In the paper, we incorporate two kinds of knowledge to train a Chinese word segmenterthe first comes from the Chinese-character-based bilingual alignment; and the other comes from conventional monolingual Chinese word segmentation. Both kinds of knowledge are jointly employed to train a MT-motivated word segmenter using Conditional Random Fields. In the experiment, we segment the Chinese portions of the training, development and test sets with the proposed segmenter, and built a phrase-based machine translation system. The results show an effective improvement over the baselines in terms of translation quality.
    Key wordsChinese word segmentation; statistical machine translation; word alignment reliability
  • Review
    XIU Chi1, SONG Rou1,2
    2012, 26(3): 59-65.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese word segmentation based on CRF(Conditional Random Field) has attracted the most attention in recent research. But this method has certain defects in handling the ambiguity of word segmentationeliminating most original ambiguity errors at the cost of more new errors. In this paper, we attempt on a simple and example-based machine learning method to deal with the problem of word segmentation ambiguitythe method based on stable string. The experiment results indicate that stable string based method can solve the ambiguity simple and effective. And it will not introduce more new errors.
    Key wordsChinese Word Segmentation (CWS); CRF; stable string; ambiguity; machine learning
  • Review
    ZHANG Yingjie1, LI Bin1,2, CHEN Jiajun1, CHEN Xiaohe2
    2012, 26(3): 65-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word Sense Disambiguation (WSD) is a basic task of Natural Language Processing,including the processing of ancient Chinese documents. In this paper we focuse on the specific field of analyzing pre-Qin ancient Chinese documents. Considering the shortage of training data and semantic resources, we employe a semi-supervised machine learning method to perform all-word WSD of Zuo Zhuan and use Chinese Dictionary v2.0 as the knowledge resource. We randomly selecte 22 words of different frequency and sense number to evaluate the proposed method. On the selected words, our method achieves an average accuracy of 67%, which is significant higher than the baseline method of selecting the most frequent sense. This method is promising for sense tagging of ancient Chinese documents when there is no training data available. It also provides a raw sense tagging result for human correction, enriching traditional dictionaries which usually suffer from insufficient word sense entries.
    Key wordsword sense disambiguation; sense tagging; ancient Chinese; natural language processing
  • Review
    ZHANG Yangsen, HUANG Gaijuan, SU Wenjie
    2012, 26(3): 72-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    We present a new approach to Chinese word sense disambiguation based on latent maximum entropy principle(LME), which is different from Jaynes' maximum entropy principle that only use the context statistical characteristics to construct language model. After studying the relationship between the word and the sememe in Hownet, we convert the word collocation that obtained from the context of training corpus into the sememe collocation, and realize the extraction of text latent semantic features based on sememe collocations. Combined with the traditional context features, the latent maximum entropy principle is applied to disambiguate polysemy words. Experimental results show that the method proposed improves the accuracy by about 4% in the sense disambiguation of 10 polysemous verbs word.
    Key wordslatent maximum entropy principle; text latent features; sememes collocation information; word sense disambiguation
  • Review
    ZHANG Shujuan, DONG Xishuang, GUAN Yi
    2012, 26(3): 79-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    Focused on the synonym recognition in e-commerce. this paper presents a method to recognize synonyms based on user behaviors to deal with the considerable new words, typos, and near-synonyms in this domain. Firstly, candidate synonym sets are retrieved by analyzing the titles and their corresponding queries based on SimRank theory. Then, features including literal feature, title feature, query feature, click feature are extracted. Finally, Gradient Boost Decision Tree model is adopted to determine whether candidate synonyms are true or not. The experimental result shows that Gradient Boost Decision Tree(GBDT) is more suitable for this task, achieving a precision of 56.52%.
    Key wordssynonym recognition; user behaviors; SimRank; Gradient Boost Decision Tree
  • Review
    ZHOU Qiang1,2, WANG Junjun3, CHEN Liou3
    2012, 26(3): 86-92.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a solution to the construction of a large-scale Chinese event knowledge base. The static knowledge bases and dynamic annotated corpus are integrated to describe complete event content. In a unified framework, 5 different sub-databases are partitioned and developed independently. They can be combined as whole event knowledge base through the build-in keywords among them designed in advance. A demonstration knowledge base to describe Chinese existence and ownership events were built under this framework. Its static knowledge base covers 72 situations and 1548 word senses, and the dynamic annotated corpus contains 100,000 event chunk annotated sentences for 598 event target verbs. The experimental results prove the feasibility of the proposed method.
    Key wordsevent analysis; event annotation; event knowledge base
  • Review
    YANG Erhong1, ZENG Qingqing1, LI Tingting2
    2012, 26(3): 92-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    The distribution of event word in text reveals the event information structure. Through observation on the real News texts of the sudden event, our research indicates that the news text is composed of two elements, the main information chain and the second information chain. The main information chain is just the text's event information structure including the preceding-core event information chain, the core event information chain, the secondary event information chain and the post-generation event information one. Also, we study the event information structure detection with the event word as a trigger, adopting the HowNet to improve the event word based event information structure detection.
    Key wordsevent word; event information structure; the main information chain; the second information chain
  • Review
    NUO Minghua1,2, LIU Huidan1,2, WU Jian1, DING Zhiming1
    2012, 26(3): 98-104.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper aims to construct Chinese-Tibetan multi-word equivalence dictionary for machine-aided translation system. It proposes CMWEPM model which can extract multi-word equivalences in two phases. First, CMWEPM defines the boundary of Chinese multi-word units by collocation and binding degree. Then it extracts strict and constrained multi-word equivalences based on word alignments, respectively. CMWEPM model classifies multi-word units according to its lengths and frequency, and set different thresholds for different types. This strategy can improve the translation quality with higher recall of multi-word equivalent pairs that play a significant role in Chinese-Tibetan machine-aided translation system.
    Key wordsTibetan information processing; multi-word units; collocation
  • Review
    YANG Yuan, MA Yunlong, LIN Hongfei
    2012, 26(3): 104-109.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper focuses on clustering different feature expressions in product reviews into proper groups. In product reviews, the same features may have different expressions, e.g. “appearance” and “design” of a mobile phone actuallyindicate the same feature. Considering the fact that different expressions are always used with same sentimental words in a sentence, this paper first extracts product feature expressions and sentimental words in pairs to build a bipartite graph, and then adopts the Weight Normalized SimRank to compute similarity between different feature expressions in the bipartite graph, and finally optimizes the Bayesian classifier in Semi-Supervised Learning via the similarity. Experimental results show that the proposed method is valid.
    Key wordsproduct features; group; SimRank; semi-supervised learning
  • Review
    TU Xinhui, ZHANG Hongchun1,2, ZHOU Kunfeng1,2, HE Tingting1,2
    2012, 26(3): 109-116.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Wikipedia is the biggest web-based encyclopedia, which is written collaboratively by volunteers around the world. It has many advantages, such as wide knowledge coverage, highly structured and rapid information update. However, the Wikipedia official website just provides some original data files, and much structured semantic knowledge cant be used directly. Therefore, in this paper, we firstly extract the structured information from these data files; then, we design the object model for the information in Wikipedia, and provide an open API for Wikipedia information; finally, we propose a novel method to compute relatedness between words.
    Key wordssemantic relatedness; Chinese Wikipedia; structured information
  • Review
    WANG Suge1,2,WU Suhong3
    2012, 26(3): 116-122.
    Abstract ( ) PDF ( ) Knowledge map Save
    Feature-Opinion Extraction is one of the key researches in the area of opinion mining, bearing significant affect on the performance of opinion orientation identification. This paper proposes an approach to mining evaluation features and opinions based on the dependency information and the chunk information. With the dependency relation between word and word, we construct the rules to obtain chunks containing the evaluation feature and opinion and further design three algorithms to get the candidate evaluation features and candidate feature-opinion pairs. Experimental results show that the whole F1-measure can achieve 87.10% in scenic spots reviews of Shanxi, proving effectiveness of the proposed method.
    Key wordsfeature-opinion; dependency relation; chunk
  • Review
    XU Liheng1, LIU Yang1, LAI Siwei1, LIU Kang1, TIAN Ye2, WANG Yuli2, ZHAO Jun1,
    2012, 26(3): 122-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes an ontology concept acquisition method based on heterogeneous features. We regard the Encyclopedia of China as the taxonomy of ontology, extract Web knowledge base articles as concepts and learn taxonomic relations between concepts by considering the text content, folksonomies as well as semi-structured information.We extend the Encyclopedia of China to a mega-scale global Chinese ontology which provide practical support for concept attributes extraction, non-taxonomic relations extraction and other applications such as Question Answering System. Experimental results show that the proposed method achieved 11.8% performance improvement compared to the single feature method.
    Key wordsontology;heterogeneous features;concept acquisition
  • Review
    HAO Xiulan1, HU Yunfa2, SHEN Qing1
    2012, 26(3): 129-137.
    Abstract ( ) PDF ( ) Knowledge map Save
    Internet is flooded with user-generated contents, such as posts in web forums. How to monitor these scrappy, rambling messages is concerned by safety agents. Topic Detection and Tracking (TDT) is one effective way to monitor sensitive information. However, the salient features of a reply to the post in a web forum (e.g. short in length, swift in “topic drifting”) challengethe TDT over web forums. According to the characteristics of the reply, three models are proposed in this paper. First, a baseline model employed a single pass clustering procedure is described. Second, to alleviate “topic drifting”, an improved model is proposed, in which terms in title are used to adjust the weight of term in the post and a topic is represented by a seminal vector and a tracked vector. Third, the late reweighting technique of named entity (NE) is applied. To deal with the free format of user-generated contents and meet the speed requirement, a new feature extraction procedure is proposed. Experimental results on real data set prove that the proposed models and feature extraction procedure are feasible.
    Key wordscontent monitor; Chinese web forum; feature extraction