2013 Volume 27 Issue 5 Published: 15 October 2013
  

  • Select all
    |
    Review
  • Review
    ZHANG Kaixu, ZHOU Changle
    2013, 27(5): 1-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    Large-scale unlabeled data contains abundant lexical information for NLP tasks such as Chinese word segmentation and POS tagging. This work extracted high-dimensional distributional lexical information from a large-scale unlabeled Chinese corpus. An auto-encoder then performed the unsupervised dimension reduction. The learned low-dimensional lexicon features were used as new lexical features for a joint Chinese word segmentation and POS tagging task. Experiments on the Chinese Treebank 5 corpus showed that the additional lexicon features improve the performance and are better than those features learned by using the principal component analysis and the k-means algorithm.
    Key wordsunsupervised feature learning;Chinese word segmentation;part-of-speech tagging
  • Review
    LAI Siwei, XU Liheng, CHEN Yubo, LIU Kang, ZHAO Jun
    2013, 27(5): 8-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word segmentation is a fundamental technology of Chinese natural language processing. Using character-based statistical machine learning methods to perform Chinese word segmentation is the main trendcurrently. However, conventional machine learning methods heavily rely on manually designed features, which require intensive labor to modify the features and verify their effectiveness. With the rapid develop of neural-network-based representation learning, it becomes realistic to learn featuresautomatically. This paper investigates a Chinese word segment method based on representation learning. We first learn embedding vectors for Chinese characters from a large corpus unsupervisedly, and then apply them to neural-network-based Chinese word segmentation supervisedly. Experimental results show that representation learning is an effective method for Chinese word segmentation. However, due to the limitation of corpus size, it still cannot replace conventional machine learning methods whichbased on manually designed features.
    Key wordsrepresentation learning; Chinese word segmentation
  • Review
    XU Runhua1, QU Weiguang2, CHEN Xiaohe3, WANG Dongbo4
    2013, 27(5): 15-22.
    Abstract ( ) PDF ( ) Knowledge map Save
    The productive and derivative of four-character idioms are extremely high, the use of four-character pattern to create new words in the vocabulary of modern Chinese is still on the rise. This article looks into the eyes of the large number of four-character idioms in word-segmented corpora, and works on the four-character idioms in corpora for analysis and induction. Then this article works on the segmented comparison of four-character idioms both in single segmented corpora and between different segmented corpora. Finally, through the introduction of CRF statistical model, and take the results of segmented comparison of four-character idioms as training corpora, this article develops the research of the recognition of four-character idioms in corpora. Recognition results show the accuracy of four-character idioms can reach more than 93% in both closed test and open test.
    Key wordsfour-character idioms; word-segmented corpora; segmented comparison; CRF
  • Review
    ZHANG Yan1, ZHANG Yang2, SUN Maosong1
    2013, 27(5): 22-29.
    Abstract ( ) PDF ( ) Knowledge map Save
    The study of dialect is composed of voice study, vocabulary study and grammar study, of which the first step is to recognize the dialect vocabulary. By now, collection of Chinese idiom words is mainly accomplished by experts, and it is time-consuming and labor-intensive. With the development of information technology, people communicate widely through the network, and thus input method data contains vast amount of vocabulary resources as well as the geographical information, which can help automatically discover dialect words corpus. However, in literature, there have been very few studies on how to exploit the input method data to systematically investigate the dialects. Therefore this paper analyzes the user behavior of Chinese input method, and based on which we propose to automatically discover the geographical dialect vocabulary. Specifically, the paper gets the two representative features of dialects in Chinese input method, and uses different combinations of these two features to recognize dialect words. Finally, extensive experiments are performed to evaluate the impacts of the feature combinations on the dialect word recognition.
    Key wordsdialect detection; Chinese Pinyin input method; feature combination
  • Review
    LI Guangyi, WANG Houfeng
    2013, 27(5): 29-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    Named Entity Recognition and Disambiguation is an important research of Natural Language Understanding. For the task of Named Entity Recognition and Disambiguation in the situation of entity knowledge base provided, this paper presents a method based on multi-stage clustering. First, we link the document to the entity definition in the knowledge base by two rounds of clustering. Second, we group entities which dont exist in the knowledge base by Hierarchical Agglomerative Clustering. Finally, we recognize ordinary words and adjust the results by K-Means Clustering. Our experiments on data of CLP-2012 Chinese person name disambiguation task proves our system performs well. The F score on test data is 86.68%, exceeding the best result of the Bake-off by 6.46%.
    Key wordsnamed entity recognition; name entity disambiguation; clustering
  • Review
    ZAN Hongying, ZHANG Jingjie, LOU Xinpo
    2013, 27(5): 35-43.
    Abstract ( ) PDF ( ) Knowledge map Save
    Functional words play an important role in modern Chinese words, and constitute the syntactic means of Chinese with the word order, so functional words have an important influence on the syntactic analysis. Dependency parsing is a hotspot of research in the field of natural language processing. In order to improve the recognition effect of dependence relation, the usages of functional words are applied to the recognition process of dependence relation in this paper. Through the study of functional words usages, as well as the analysis of dependence relation in the dependency parsing, it found that the coordination relation has close connection with conjunction. And then, the conjunction usages are considered in the recognition process of coordination relation to improve the recognition performance. The experimental results show that, through considering the conjunction usages, the LAS and the UAS of coordination relations have increased 3.43% and 2.29% respectively.
    Key wordsfunctional words usages; dependency parsing; coordination relations
  • Review
    SHI Cui1,2, ZHOU Qiaoli1, ZHANG Guiping1
    2013, 27(5): 43-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the Chinese patent corpus, this paper counts and analyzes the internal and external features of Coordination with Overt Conjunctions (COC) in the Chinese patent literature. It mainly investigates the internal features including coordination tag, internal analysis of coordination structure and the distribution of Part-Of-Speech (POS). Its mainly counted the candidate boundary markers by the external features, and analyzes the contextual information of the coordinate structures in the Chinese patent literature.
    Key wordsCOC; Chinese patent literature; internal features; external features
  • Review
    XIONG Hao1,2, LIU Qun1, LV Yajuan1
    2013, 27(5): 51-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    Traditional methods for Semantic Role Labeling (SRL) generally utilize some local features to identify and classify the semantic roles which are hard to capture labeling inconsistency. In this paper, we propose a graphical model to rerank the results via label propagation algorithm. Experimental results on PropBank show that our models significantly improve the performance 2.4 points in term of F score, and obtain the best results on this data set without using any system combine techniques.
    Key wordssemantic role labeling; graphical model; reranking
  • Review
    CHEN Bo1,2, JI Donghong2, LV Chen2
    2013, 27(5): 60-67.
    Abstract ( ) PDF ( ) Knowledge map Save
    Parsing Chinese special sentence patterns is one of the major tasks in Chinese Information Processing. The Current conventional approaches in semantic parsing have some defects in that they cannot denote the semantic relatedness between Chinese words and constituents. In the paper, we choose “Serial-Verbs Sentence” as the research target, propose a semantic annotation approach based on Feature Structure, and study the semantic annotation models of “Serial-Verbs Sentence”. Feature Structure Model provides a different semantic parse approach for Chinese Information Processing, which can represent the complicated semantic relations among the subject, the predicates and the objects of “Serial-Verbs Sentence”.
    Key wordsfeature structure; Chinese Serial-Verbs Sentence; semantic annotation; semantic resource
  • Review
    CHEN Tao1,2, XU Ruifeng1, WU Mingfen2, LIU Bin1
    2013, 27(5): 67-75.
    Abstract ( ) PDF ( ) Knowledge map Save
    Considering that opinionated sentences always have the same or similar syntax and semantic expression frameworks, this paper proposes a sentiment analysis approach based on sentiment sentence framework. Firstly, we divided sentiment sentence frameworks into three categories and 105 subcategories. A sentence framework extraction method is designed to semi-automatically extract sentiment sentence frameworks from annotated sentiment sentences using dependency features, syntactic features and synonym features. The polarity of input sentence is determined through the classification of its sentiment sentence frameworks. The evaluations on NLP&CC 2013 micro-blog emotion analysis corpus and RenCECps blog emotion corpus show that our proposed sentiment classification approach achieves better precision performance compared to word-based support vector machine classifiers.
    Key wordssentence framework; sentiment classification; syntactic feature; dependency feature
  • Review
    GUO Chong1,WANG Zhenyu2
    2013, 27(5): 75-84.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper defines the concept of sentiment ontology tree, which organizes the evaluation pair and hierarchy between product aspects in fine-grained opinion mining. And proposes an auto-construct method to build the sentiment ontology tree. We focus on the evaluation pair extraction, orientation predict of evaluation pair and aspect aggregation. Experimental results show that our algorithm is proper and efficient.
    Key wordssentiment ontology tree; evaluation pair; orientation predict; aspect aggregation
  • Review
    LV Yunyun1, LI Yang1, WANG Suge1,2
    2013, 27(5): 84-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    The large scale and high quality domain training data is an important guarantee for constructing a high performance classifier. However, it is an expensive work to label a large scale corpus in a domain. In this paper, we propose a method for identifying Chinese opinion sentences using a small-scale labeled corpus. At first, the method uses BootStrapping to expand the small-scale labeled corpus. Using the expanded labeled corpus we then train three classifiers that are based on naive Bayes, support vector machine and maximum entropy respectively. At last, an ensemble classifier is obtained by assigning a set of probability weights to the three trained classifiers. Experimental results indicate that the ensemble classifier is superior to the three single classifiers. And the proposed method can achieve the analogous experimental results by using partially labeled training data or using totally labeled training data.
    Key wordsopinion sentence identifying; BootStrapping; ensemble classifier
  • Review
    LI Sophia Yat Mei1, LI Shoushan1,2, HUANG Churen1, GAO Wei2
    2013, 27(5): 93-100.
    Abstract ( ) PDF ( ) Knowledge map Save
    Emotion cause detection is an important task in the research on emotion analysis. This task aims to detect the cause description of a emotion happening. In this study, we model this task as a sequence labeling problem and predict each related sentence to be in a emotion cause or not. Specifically, we apply the conditional random field (CRF) model to solve this problem with various of features, such as basic word features, POS features, context features and linguistic rule features. Empirical studies demonstrate that these features are effective for the task, especially the context features. Moreover, we find that the sequence labeling model is superior to the classification model when similar features are employed.
    Key wordssequence labeling; emotion cause detection; context feature; linguistic rule features
  • Review
    LI Xia1,2,LIU Jianda2
    2013, 27(5): 100-107.
    Abstract ( ) PDF ( ) Knowledge map Save
    Nowaday, there are a large number of Chinese English learners in China, the substantial quantities and great difficulties in English writing assessment is now the bottleneck problem in English teaching and testing. So the effective automatic essay scoring algorithms are in great need of in China. In this paper, we first propose a feature selection method which can extract Chinese learners writing characters effectively and automatically. And then we continue propose a resemble learning based essay automatic scoring algorithm for unbalanced essays data. The classification results on 1 115 university CET4 and CET6 essays from CLEC shows that our algorithm has dramatically promotion in precision, recall, and F-meature value compared with classification for balanced data.
    Key wordsessay automatic scoring; unbalanced data classification; multinorminal nave bayes
  • Review
    LIU Liu1, LI Bin1,2, QU Weiguang3, CHEN Xiaohe1
    2013, 27(5): 107-114.
    Abstract ( ) PDF ( ) Knowledge map Save
    Words property of times shows rules of how a word changes in a particular times. We divide the Pre-Qin times into three parts as Pre-Chunqiu, Chunqiu and Zhanguo. We find out and focus on three kinds of words which are only in a times, popular in a times and arised in a times. We also propose methods using VSM and Naive Bayes Classifier to decide the times of a text with which we experiment on 25 texts of Pre-Qin. The latter one s result turn out much better. With the same method we verified that Lie Zi is not written in Pre-Qin.
    Key wordsPre-Qin words; times; VSM; Naive Bayes classifier
  • Review
    KOU Wanqiu, LI Fang
    2013, 27(5): 114-122.
    Abstract ( ) PDF ( ) Knowledge map Save
    Traditional topic models use word probability distribution to represent topics. These words are difficult to be understandable and express a consistent meaning. This paper proposed a topic label extraction method based on seed words. The method first extracts topic seed words according to weight formulas, then uses bootstrapping algorithm to generate a key phrase set that contains seed words. Finally, the method selects topic label from the key phrase set according to the integrity and generalization of a phrase. The experiments were made on two corpora. One is topic oriented reports, the other is event based news reports. According to the experimental results, the method work well in extracting a meaningful phrase to represent a topic.
    Key wordstopic labelling; seed words extraction; bootstrapping method
  • Review
    HONG Huan, WANG Mingwen, WAN Jianyi, LIAO Yanan
    2013, 27(5): 122-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    Query expansion is an effective way to improve the retrieval effectiveness, traditional query expansion methods mostly extend the query words only considered the relevance of a single query word, without fully considering the relevance between terms, documents, as well as between queries, so this makes the expansion effect poorly. To solve this problem, first, we construct the Markov network of terms and documents subspace for extracting the maximum term cliques and document cliques, then, we divide the maximum word cliques into documents dependent word cliques and non-documents dependent word cliques through the mapping relation between term and document cliques, and build the Markov network retrieval model based on document cliques dependency to do the initial search, then we construct the Markov network of queries subspace from the search results, which are used for extracting the maximum query cliques, finally, we calculate the probability between document and query in an iterative method, and build the final multi-layer Markov network information retrieval model based on iteration. Experimental results show that our model can improve the retrieval results.
    Key wordsMarkov network; query expansion; document reliance; clique; information retrieval
  • Review
    LIU Maofu1,2, LI Yan1,2, JI Donghong3
    2013, 27(5): 129-137.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to strengthen deep semantic analysis and inference of textual entailment, this paper proposes the method of event semantic feature based Chinese textual entailment recognition. The method generates event graphs base on event labeled corpus, and then the entailment recognition between text pairs can be changed to entailment recognition between event graphs. The event semantic feature can be computed based on max-common sub-graph. The event semantic feature combined with the surface statistical feature, lexical semantic feature and syntactic feature is used to classify textual entailment based on support vector machine and can obtain the preliminary experimental result, and the correction module based on event semantic rules handles preliminary experimental result to get the final experimental result. The experimental results show that the event semantic feature based Chinese textual entailment recognition is effective and efficient in Chinese textual entailment recognition.
    Key wordsTextual Entailment; Event Semantic Feature; Max-Common Sub-graph, Support Vector Machine
  • Review
    ZHOU Huiwei,YANG Huan,HUANG Degen,LI Yao,LI Lishuang
    2013, 27(5): 137-144.
    Abstract ( ) PDF ( ) Knowledge map Save
    Hedge scope detection is used to distinguish factual information and uncertain information, which could improve the authenticity and reliability in information extraction. Hedge scope detection is a difficult task because of its dependency of the semantic and syntactic structures. In this paper, we propose a hedge scope detection method based on syntactic structural constraints. First, two decision trees are constructed on dependency structure and phrase structure respectively to build the syntactic constraint set. And then the hedge scope detection results based on the syntactic constraint set are used as the syntactic constraint features for Conditional Random Fields (CRF) models. Experiments on the CoNLL-2010 corpus achieve the 70.28% F-score on the golden standard hedge cues, which is 4.22% higher than the system with the common syntactic construction features.
    Key wordshedge scope detection; syntactic structural constraints; decision tree; conditional random fields
  • Review
    CHEN Peng1,GUO Jianyi1,2, YU Zhengtao1,2, XIAN Yantuan1,2, YAN Xin 1,2, WEI Sichao1
    2013, 27(5): 144-149.
    Abstract ( ) PDF ( ) Knowledge map Save
    For the problem that based on the feature method, different kernel functions caused different performances in Chinese field entity relation extraction by the machine learning method, which supports kernel function, this paper proposed a convex combination kernel function method to deal with this problem. First, this paper chose lexical information, phrase syntactic information and dependent syntactic information as features. Next step was to get different high-dimensional matrixes though mapping by different convex combination kernel functions. Finally, we could get the optimal kernel by testing all classified model that trained all high-dimensional matrixes by SVM. This paper conducted the relation extraction experiment on collecting 600 corpuses in tourist field, the experimental result shows that the optimal convex combination kernel function this paper presents can effectively improve the extraction performance, and it gets the best F value which reaches 62.9.
    Key wordsrelation extraction; convex combinationm kernel function; support vector machine
  • Review
    SHI Bei, SUN Le, HAN Xianpei
    2013, 27(5): 149-156.
    Abstract ( ) PDF ( ) Knowledge map Save
    The alias of entity means the different names which refer to the same entity. Traditional alias extraction methods often have two problems1) the difficulty of constructing training corpus; 2) the lack of timeliness. To resolve the two problems, this paper proposes a graph based alias extraction method using query log. This method uses context information and query-link information, constructs a two-layer graph (including the candidate alias layer and the query-link layer) and sorts the alias using random walk algorithm. The experimental results show that1) our method achieves the accuracy of 71.8%, which proves our method is effective. 2) Using query-link information outperforms the method which uses context information and the combination of this two type s information improves the performance.
    Key wordsquery log; alias extraction
  • Review
    ZHAO Jiandong, GAO Guanglai, BAO Feilong
    2013, 27(5): 156-160.
    Abstract ( ) PDF ( ) Knowledge map Save
    The researches on Mongolian machine translation, syntax analysis and semantic analysis are restricted because of few researches on Mongolian automatic Part-Of-Speech (POS) tagging. In view of this, we proposed a history-based Mongolian automatic POS tagging method which incorporating a lookahead mechanism into the decision making process. Experiment results showed that the POS tagging accuracy of Mongolian unknown words, known words and all words are 71.276 6%, 99.148 2% and 95.301 0%, respectively, which demonstrate that our method is appropriate for Mongolian automatic POS tagging.
    Key wordsHistory-models; learning with lookahead; Mongolian; automatic POS tagging
  • Review
    YU Hongzhi1, LI Yachao1, WANG Kun2,TASHI Lengben1
    2013, 27(5): 160-166.
    Abstract ( ) PDF ( ) Knowledge map Save
    Tibetan Part of Speech (POS) is an important problem for Tibetannatural language processing, the paper studies the fusion of morphologicalfeatures for Tibetan part of speech withmaximum entropy model, based on the analysis of Tibetan scripts and the result of statistics, and define the feature templates. Experimental results show that, Tibetan POS with maximum entropy achieves much better results,syllable features can increase the performance of Tibetan POS significantly, and obtain an error reduction of 6.4% compare to the baseline.
    Key wordsTibetan; part of speech; maximum entropy; morphological features
  • Review
    HUA Quecairang1,3,JIANG Wenbing2,ZHAO Haixing1,LIU Qun2
    2013, 27(5): 166-173.
    Abstract ( ) PDF ( ) Knowledge map Save
    According dependency syntactic theory this paper gave Tibetan typed dependencies and its hierarchy, and then we analyzed some problems in building Tibetan dependency Treebank. We proposed a mode to construct dependency tree semi-automatically, it includes word-pairs dependency classification model and dependency edges annotation model with rich features template based on Tibetan language grammar. And we implemented visualized tool which used to build and proofreading 11 thousand sentences Treebank. On the baseline system the experimental results show that, the dependency recognition accuracy obtains an improvement of 3%.
    Key wordsTibetan dependency syntax; word-pair dependency classification; Tibetan Treebank; Tibetan dependency annotation tool
  • Review
    MI Chenggang1,2, YANG Yating1, ZHOU Xi1, LI Xiao1, YANG Mingzhong3
    2013, 27(5): 173-179.
    Abstract ( ) PDF ( ) Knowledge map Save
    There are many Out-Of-Vocabulary words in Uyghur-Chinese machine translation, a large part of them are loan words (including person names, place names, et.al). This paper presents a novel method that recognition the Chinese loan words in Uyghur according to the feature that one loan word pronounce similar with its original word. This method training the existing corpus first, and getting the Uyghur Latin rules that use to recognize Chinese loan word in Uyghur; this paper Latin the Uyghur words according to the rules, Romanization of Chinese words, these transform the sounds similarity to strings similarity which is easy to quantification; proposed three modelsPosition-related Minimum Edit Distance model, Weighted Common Subsequence model and the fusion model that fused above two with parameters. The experimental results show that the fusion model considering strings global similarity and local similarity, so it gets the best recognition results.
    Key wordsloan words; Out-Of-Vocabulary words; pronunciation similarity; string similarity
  • Review
    WANG Haibo1,3,ZU Yiqing2,LITIFU·Tuohuti3
    2013, 27(5): 179-184.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a typical agglutinative language, Uyghur have rich suffixes to express syntax and mood. This paper contrast two kinds of POS-Tagging method in Uyghur language processingone is POS-Tagging based on the stem words,the other is based on the suffixes. We statistics the sum, the frequency, and the cover degree of common functional suffix strings in a big corpus, aim to judge the feasibility of POS-Tagging method based on suffix strings. We define the regulation of suffix POS-Tagging based on the theory of Prof. Litip Tohti and label some corpus based on this kind of POS-Tagging definition, which is not only useful to Uyghur, but also to other Turkic languages which have much similar suffixes.
    Key wordsUyghur; suffix strings; POS-Tagging
  • Review
    SU Chen1, ZHANG Yujie1, GUO Zhen1, XU Jin’an1
    2013, 27(5): 184-191.
    Abstract ( ) PDF ( ) Knowledge map Save
    In developing a domain-specific Chinese-English machine translation system, the accuracy of Chinese word segmentation in large-scale training corpus often decreases because of unknown words. The lack of domain-specific annotated corpus makes supervised learning approaches unable to adapt. This problem results in many errors in translation knowledge extraction and therefore seriously affects translation quality. To resolve the domain adaptation problem, we implemented Chinese word segmentation by exploiting n-gram statistical features in raw corpus and bilingually motivated word segmentation information in parallel corpus, respectively. We further propose a lattice-based method to combine multiple results and use dynamic programming algorithm to get the best word segmentation result. For evaluation, we conducted experiments of Chinese word segmentation and Chinese-English machine translation using the data of NTCIR-10 Chinese-English patent task. The experimental results show that the proposed method brought about improvements both in F-measure of the Chinese word segmentation and in BLEU score of the Chinese-English statistical machine translation system.
    Key wordsChinese word segmentation; domain adaptation; bilingual motivation; Lattice; machine translation
  • Review
    HU Yanan, SHU Jiagen, QIAN Longhua, ZHU Qiaoming
    2013, 27(5): 191-198.
    Abstract ( ) PDF ( ) Knowledge map Save
    The scale of training corpus plays an important role in machine learning-based semantic relation extraction between named entities, however, the annotation of corpus is time-consuming and labor-intensive. In order that a resource-rich language can help a resource-poor language in semantic relation extraction, we propose an approach to transforming relation instances from the source language to the target language via machine translation, and then add them into the training corpus of the target language by way of entity alignment. The experiments on the ACE2005 Chinese and English corpora show that, Chinese and English can help each other in relation extraction. Furthermore, this help is particularly significant especially when the scale of training corpus in target language is small.
    Key wordsCross-lingual relation extraction; machine translation; entity alignment
  • Review
    CHEN Lei, LI Miao, ZHANG Jian, ZENG Weihui
    2013, 27(5): 198-205.
    Abstract ( ) PDF ( ) Knowledge map Save
    The reordering models are significant in reducing the difference of word orders between the language pairs in statistical machine translation. Most reordering approaches have high requirements of the scale of the parallel corpus in statistical machine translation. Chinese minority language resources are very scarce and difficult to achieve substantial growth in a short time. Therefore the current reordering approaches cannot play good effect in the translations between Chinese and minority languages. After analyzing the related studies, the paper proposes a source-side reordering method based on a small parallel corpus. In virtue of the linguistic knowledge, we analyzed both corpus and translations to obtain the verb phrases which affected the word orders of translations evidently. And then we studied the reordering rules of these verb phrases, including manually written rules and automatically extracted rules. Experiments show that our method can improve the performance of the state-of-the-art phrase translation models.
    Key wordsstatistical machine translation; reordering; verb phrase; small parallel corpus