Journal of Chinese Information Processing

Select

Review

A Survey to Discourse Relation Analyzing

YAN Weirong, XU Yang, ZHU Shanshan, HONG Yu, YAO Jianmin, ZHU Qiaoming

2016, 30(4): 1-11.

Abstract ( ) PDF ( )

Knowledge map

Save

The research on discourse relation is aimed at inferring the inter-sentential semantic relationship which occurs in the same discourse. This relation plays an important role in discourse content understanding and structure analyzing, becoming research focus in the field of discourse analysis. In this paper, we introduce the corpus and background, annotation and evaluation system as well as in this field based three corpora: Rhetorical Structure Theory Discourse Treebank (RSTDT), Penn Discourse Treebank (PDTB) and HIT Chinese Discourse Treebank (HIT-CDTB). Finally, through analyzing current work, we summarize the main difficulty and challenge in recognizing discourse relation especially implicit discourse relation.

Select

Review

A Study on the Sequential Patterns of Semantic Constituents of the Bi-Comparative Structure

PARK Minjun , LI Qiang , YUAN Yulin

2016, 30(4): 12-20.

Abstract ( ) PDF ( )

Knowledge map

Save

The Bi-structure, which highlights a contrasting characteristic between two elements, is the key comparative sentence structure in Chinese. This structure consists of 7 types semantic items (SUB, BI, OBJ, ITM, DIM, RES, EXT), of which various sequential patterns may occur. To provide meaningful information for the keyword extraction task of this comparative structure, this study first begins with the tagging of the 7 semantic items on about 460 sentences. Second, association rules and sequential patterns are extracted using the Apriori and PrefixSpan algorithms, from which 6 rules of the item distribution are established. Finally, this paper illustrates the rationale behind the construct of these 6 rules, providing a better understanding of the linguistic characteristics for feature selection task of the Bi-comparative structure in Chinese.

Select

Review

Chinese Patents Maximal-length Noun Phrases Identification Using Markov Logic

CAI Dongfeng, ZHAO Qimeng, RAO Qi, WANG Peiyan

2016, 30(4): 21-28.

Abstract ( ) PDF ( )

Knowledge map

Save

The main problems that limited the development of Maximal-length Noun Phrases recognition on Chinese patent literatures are the lack of annotated corpus and the difficulty of recognizing verbs and nouns. This paper presents a new Markov Logic approach to maximal-length noun phrases identification from Chinese patents. Instead of recognizing various of noun phrases, the approach focuses on the identification of MNPs boundary markers. To recognize Chinese patents MNPs, three categories of features, i.e. word features from sentences, transfer features from TreeBanks and bilingual features from patents abstractions, are employed. The experiment results show that bilingual features can bring a notable improvement on identification of MNP boundary markers such as verbs, prepositions and conjunctions. And the F-score on MNP identification reaches 83.27%.

Select

Review

Word Semantic Similarity Computation Based on HowNet and CiLin

ZHU Xinhua,MA Runcong,SUN Liu,CHEN Hongchao

2016, 30(4): 29-36.

Abstract ( ) PDF ( )

Knowledge map

Save

A word semantic similarity computation method based on the HowNet and CiLin is proposed in this paper. First, according to the characteristics of sememe hierarchical structure, an edge weighting strategy of monotonic decreasing curve with flat top and steep bottom is used in the HowNet part. In the CiLin part, a special method of taking the distance between words as the main factor and the branch node quantity and branch interval as micro-adjustable parameters is used. Then, according to the distribution of words, a dynamic weighting strategy of considering both HowNet and CiLin is used to calculate the final similarity, which greatly expands the computable range of words and improves the computation accuracy of word similarity.

Select

Review

Measuring Term Similarity Based on Internal Semantic Role in Patent Text

JIANG Lixue, JI Duo, CAI Dongfeng

2016, 30(4): 37-43.

Abstract ( ) PDF ( )

Knowledge map

Save

The Chinese term is composed of one or multiple words with certain semantic roles. The traditional similarity calculation methods based on statistics, which regard the term as a basic unit for similarity computation, ignore the semantic roles inside a term. This paper presented a method for computing similarity of Chinese terms based on the internal semantic roles, i.e. calculating term similarity according to the different semantic roles assigned to them automatically. Experiments show that the proposed similarity calculation method achieves better results than traditional methods.

Select

Review

Active Learning for Frame Element Labeling

TU Hanfei,LI Ru,WANG Zhiqiang,ZHOU Tiefeng

2016, 30(4): 44-55.

Abstract ( ) PDF ( )

Knowledge map

Save

The frame element labeling still mainly adopts supervised machine learning methods, which rely on examples of large-scale artificial marked as the training corpus, in order to reduce the cost of manual annotation, this paper presentan active learning aproach, which selects the most uncertain samples for annotation instead of the whole training corpus. Experimental results show that the frame elements labeling F values rise about 4.83 percent by active learning when using the same amount of training samples. In other words, for about the same labeling performance, we only need annotate 70% of the samples as compared to the usual random selection method.

Select

Review

Statistical Analysis of the Collocation Networks of Relative Words in Chinese Complex Sentences Based on Complex Network Theory

HU Quan, XIE Fang, LI Yuan, LIU Yanshen

2016, 30(4): 56-64.

Abstract ( ) PDF ( )

Knowledge map

Save

The relative words are markers in Chinese Complex Sentences, indicating the relationships between the clauses. The collation relationship of relative words means the co-occurrence form of one or more relative words in one complex sentence. It can influence the semantic and gradation relationship of the clauses. This paper constructs a Collation Network of relative words of Chinese Complex Sentences with 470 relative words based on the complex networks. We study the characteristics of average length of path, clustering coefficient, and distribution of degree depending on the collation network. These results can be applied to analyze the collation strength of relative words, which might help identify the gradation relationship and logic semantics of complex sentence automatically.

Select

Review

Chinese-Thai Cross-language Text Similarity Computing Based on WordNet

SHI Jie, ZHOU Lanjiang, XIAN Yantuan, YU Zhengtao,

2016, 30(4): 65-70.

Abstract ( ) PDF ( )

Knowledge map

Save

Text similarity calculation is widely used by information retrieval, question answering system, plagiarism detection and so on. At present, most research just aim at text similarity of the same language, and research on cross-language text similarity calculation remains an open issue. This paper propose a WordNet-based method of Chinese-Thai cross-language text similarity calculation. We apply the semantic dictionary WordNet to convert the Chinese text and Thai text into a middle layer language, and compute the text similarity between Chinese and Thai in the middle layer. Experimental results show that, this paper’s method of computing the similarity between Chinese text and Thai text has 82%’s accuracy.

Select

Review

Measuring Semantic Similarity in Short Texts Through Complex Network

ZHAN Zhijian, YANG Xiaoping

2016, 30(4): 71-80.

Abstract ( ) PDF ( )

Knowledge map

Save

Traditional methods of text similarity measure will cause erroneous results when applied in short texts, because most of them treat texts as a set of words. Due to the very brief content of short texts, those methods not only ignore the semantics information of the words but also the order information and grammar information. This paper proposes a new semantic similarity measurement between short texts, based on the complex network. This approach first pre-processes short text, and uses the complex network to model short text. With the definition of short text semantic similarity, this paper resolves the semantic information of terms in short text. Finally, several K-Means clustering methods are used for evaluating performance of the new short text similarity measurement. By comparing with TF-IDF and another semantic information method, the results show that it can promote the evaluation metrics of F-Measure.

Select

Review

Recognizing PDTB Style Implicit Discourse Relations

LI Sheng, KONG Fang, ZHOU Guodong

2016, 30(4): 81-89.

Abstract ( ) PDF ( )

Knowledge map

Save

Recognizing implicit discourse relation is a challenging task in discourse parsing. In this paper, we propose an implicit discourse relation recognizing method in the Penn Discourse Treebank (PDTB) considering some traditional features (e.g., verbs, polarity, production rules, and so on), and provide a systematic analysis for our implicit discourse relation method. We apply all labeled data to build multiple classifiers, and use the adding rule to identify final classification result for each instance. We also use forward feature selection method to select an optimal feature subset for each classification task. Experimental results in the PDTB corpus show that our proposed method can significantly improve the state-of-the-art performance of recognizing implicit discourse relation.

Select

Review

Centering Theory and Discourse Structure Based Approach to Coreference Resolution for Interactive Question Answering Text

LI Ying, KONG Fang

2016, 30(4): 90-97.

Abstract ( ) PDF ( )

Knowledge map

Save

The interactive question answering texts is rich in linguistic phenomena. Taking this a advantage, a novel coreference resolution approach for interactive question answering text is proposed. On basis of shallow sematic role analysis, the discourse structure is identified, upon which the preferred center and types center shift are further identified. These form a new feature set related to centering theory and discourse structure. Experiments on TREC2004 to TREC2007 corpora show that the proposed approach can significantly improve the performance of coreference resolution about 3.2% in F-measure for interactive question answering texts.

Select

Review

Research on the Distribution of Tibetan Character Forms

CAI Zhijie, CAI Rangzhuoma,

2016, 30(4): 98-105.

Abstract ( ) PDF ( )

Knowledge map

Save

Researching on the distribution of Tibetan character forms is the foundation of Natural languages processing, provides a theoretical basis for word attribute analysis, input design, sorting, speech synthesis and character information entropy studies. This paper classified the Tibetan character forms into single-element character and combined-element character, and further classify the combined-element characer by their artifacts’ structures and numbers. This paper conducts statistical analysis of glyph structure from 85 million Tibetan words in 450M corpus containing, establishes distribution statistics of Tibetan glyph structure.

Select

Review

Research on Input Method of Naxi Dongba Hieroglyphs Based on Topological Characteristics

WANG Haiyan, WANG Hongjun, XU Xiaoli

2016, 30(4): 106-109.

Abstract ( ) PDF ( )

Knowledge map

Save

Naxi Dongba characters are a kind of pictographs that is even more primitive than Oracle pictographs. As there is a large number of ancient Naxi classical books are needed to be protected and input into the computer system, an input method based on the topological characteristics of Dongba characters is designed for ordinary users. Firstly, the five basic topological features including number of blocks, number of holes, end points counts, three-connection-points counts and four-connection-points counts of 1,561 Naxi Dongba characters are after statistics and recorded. After that, this method is tested by a Java-based program combined with TTF font file and it proves that the method is feasible. Statistics show that more than 50% of Dongba pictographs can be identified uniquely through these five characteristics and more than 80% of them can be identified by this method with no more than 4 repetitions. It provides a new way to input Naxi Dongba hieroglyphs with the manual input and with high efficiency of identification.

Select

Review

A Khmer Word Segmentation and Part-of-Speech Tagging Method Based on Cascaded Conditional Random Fields

PAN Huashan, YAN Xin, ZHOU Feng, YU Zhengtao, GUO Jianyi

2016, 30(4): 110-116.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a Khmer automatic word segmentation and POS tagging method based on Cascaded Conditional Random Fields(CCRFs) model. The approach consists of three layers of Conditional Random Fields(CRFs) models: the first layer is the word segmentation model in Khmer character cluster(KCC) granularity, integrating the word formation characteristics of Khmer into the feature template; the second layer is the word segmentation correction model in word granularity, integrating the characteristic of Khmer named entities into the feature template; the third layer is the POS tagging model, integrating the rich affixes information into the feature template, and achieved the Khmer POS tagging. We experimented on an open corpus and obtained a final accuracy rate of 95.44%, indicating that the proposed method can effectively solve the Khmer word segmentation and POS tagging problems.

Select

Review

Research on IHMM-Based Synonyms Matching for Automatic Evaluation of Machine Translation

LI Maoxi, XU Fan, WANG Mingwen

2016, 30(4): 117-123.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic evaluation of machine translation promotes the rapid development and application of machine translation. And how to automatically identify and match the synonyms between the machine translation and human reference translation is a key issue. We take the source language sentence as a bridge, utilizes the indirect hidden Markov model to align the machine translation with reference translation, and matches the synonyms among them, to improve the correlation between automatic approaches and human judgment. Experimental results on LDC2006T04 corpus and WMT datasets show that the system-level correlation and the sentence-level correlation between the proposed approach and human judgment not only consistently outperform the widely used automatic metrics BLEU, NIST and TER, but also outperform the METEOR metrics that take use of word stem information and thesaurus.

Select

Review

Uyghur Chinese Sentence Alignment Based on Multi Featuresand Optimal Matching

Ni Yaoqun , Xu Hongbo , Cheng Xueqi

2016, 30(4): 124-133.

Abstract ( ) PDF ( )

Knowledge map

Save

The content of Uyghur webpage news is usually partial comparable with the content of the Chinese counterpart. Uyghur sentence sequences may be shuffled or even partially missing in Chinese text, which cause some difficulties in mining parallel sentences (i.e. sentence bead) from bilingual news. Fist, to improve the word matching rate of this kind, person and location names in Chinese are extracted and translated into Uyghur to enhance bilingual mapping. Then we scan the Chinese sentences with translation of Uighur words and calculate the translation rate via string matching to avoid mistakes in Chinese word segmentation. The final similarity of a sentence pair is calculated by combining the word translation rate with the numbers, punctuations and length of sentences as features. Similarities of all the bilingual sentence pairs constructed a weight matrix. We used greedy algorithm and maximum weight matching algorithm in bipartite graph to find the parallel sentence pairs with highest probability. Our method achieves an accuracy of 95.67% in sentence alignment.

Select

Review

An Information Retrieval Graph Model Based on Term Importance

WANG Mingwen, HONG Huan, JIANG Aiwen, ZUO Jiali

2016, 30(4): 134-141.

Abstract ( ) PDF ( )

Knowledge map

Save

In information retrieval modeling, to determine the importance of index terms of the documents is an important content. Those retrieval models which use a bag-of-word document representation are mostly based on the term independence assumption, and calculate the terms importance by the functions of TF and IDF, without considering about the relationship between terms. In this paper, we used a document representation based on graph-of-word to capture the dependencies between terms, and proposed a novel information graph retrieval model TI-IDF. According to the graph, we obtained the co-occurrence matrix and the probability transfer matrix of terms, then we determined the terms importance (TI) by using the Markov chain computing method, and used TI to replace traditional term frequency at indexing time. This model possesses a better robustness, we compared our model with traditional retrieval models on the international public datasets. Experimental results show that, the proposed model is consistently superior to BM25 and better than its extension models, TW-IDF and other models in most cases.

Select

Review

Weakly Supervised Relation Extraction Based on Tri-training and Noise Filtering

JIA Zhen, YE Zhonglin, YIN Hongfeng, HE Dake

2016, 30(4): 142-149.

Abstract ( ) PDF ( )

Knowledge map

Save

Weakly supervised relation extraction utilizes entity pairs to obtain training data from texts automatically, which can effectively deal with the problem of inadequate training data. However, there are many problems in the weakly supervised training data such as noise, inadequate features, and imbalance samples, leading to low performance of relation extraction. In this paper, a weakly supervised relation extraction algorithm named NF-Tri-training (Tri-training with Noise Filtering) is proposed. NF-Tri-training employs an under-sampling approach to solve the problem of imbalance samples, learns new samples iteratively from unlabeled data and uses a data editing technique to identify and discard possible mislabeled samples both in initial training data and in new samples generating at each iteration. The experiment on dataset of Hudong encyclopedia indicates the proposed method can improve the performance of relation classifiers.

Select

Review

Recognizing the Feature Synonyms in Product Review

XI Yahui

2016, 30(4): 150-158.

Abstract ( ) PDF ( )

Knowledge map

Save

With the great development of e-commerce, the product review mining has recently received a lot of attention. In product reviews, people often use different words and phrases to describe the same product feature, which are necessary to be recognized as synonyms for effective opinion summary. In this paper, we first calculate the similarity of product features. Then the must-link and cannot-link constraints are exacted based on the analysis of product reviews. Finally, the constrained hierarchical clustering algorithm and the extracted constraints are applied to recognize product feature synonyms. Experiments on diverse real-life datasets show promising results.

Select

Review

Opinion Targets Extraction from Chinese Microblogs Based on Conditional Random Fields and Domain Ontology

DING Shengchun, WU Jingchanyuan, LI Xiao

2016, 30(4): 159-166.

Abstract ( ) PDF ( )

Knowledge map

Save

Fine-grained sentiment analysis of Microblogs is very important. The extraction of opinion targets from opinion sentence is the key issue to sentiment analysis of Microblogs. To improve the performance of opinion targets extraction, this paper proposes to select features from words, parts of speech, emotional words and ontology, based on the characteristics of Chinese microblog and the construction of microblogging comment ontology, and then uses CRFs model to evaluate object extraction. At last, we apply the proposed method to Task5 of COAE2014. The accuracy of the evaluation object extraction is 61.20 percent, ranking first in all the evaluation team. The experiment results show that it is possible to effectively improve the accuracy of the evaluation opinion targets extraction to introduce the ontology into CRFs Model.

Select

Review

A Text Event Elements Extraction Method Based on Event Ontology

LIU Wei, LIU Feijing, WANG Dong, LIU Zongtian

2016, 30(4): 167-175.

Abstract ( ) PDF ( )

Knowledge map

Save

Extraction of event elements is a challenge in event-based information extraction. Currently, the main solutions are based on machine learning method which is subject to the corpus sparsity. This paper proposes an event element extraction method based on event ontology. Event elements reasoning process includes two steps: Firstly, elements values are initially complemented according to positional relations between event elements words and event indicators words, selecting the event with the highest confidence as the seed event; Secondly, search the seed events to for their event classes restrictions and non-taxonomic relations from event ontology, to complement and revise event elements. The experimental results show that this method can improve the accuracy of event elements extraction.

Select

Review

Entity Linking Based on Multiple Features

CHEN Yubo, HE Shizhu, LIU Kang, ZHAO Jun, LV Xueqiang

2016, 30(4): 176-183.

Abstract ( ) PDF ( )

Knowledge map

Save

Entity linking is an important method of entity disambiguation, which aims to map an entity to an entry stored in the existing knowledge base. Several methods have been proposed to tackle this problem, most of which are based on the co-occurrence statistics without capture various semantic relations. In this paper, we make use of multiple features and propose a learning to rank algorithm for entity linking. It effectively utilizes the relationship information among the candidates and save a lot of time and effort. The experiment results on the TAC KBP 2009 dataset demonstrate the effectiveness of our proposed features and framework by an accuracy of 84.38%, exceeding the best result of the TAC KBP 2009 by 2.21%.

Select

Review

Chinesemicro-blog Sentiment Analysis using Both Explicit and Implicit Text Features

CHEN Tieming, MIAO Ruyi, WANG Xiaohao

2016, 30(4): 184-192.

Abstract ( ) PDF ( )

Knowledge map

Save

Micro-blog sentiment analysis is a key technique of public opinion research for social networks. Micro-blog emoticons and sentiment words are both of intuitive called as explicit emotion features, while the content semantics are called implicit features which sometimes are very important for micro-blog emotion discrimination. Therefore, in this paper, a new systematic methodology for sentiment analysis is proposed using both explicit and implicit emotion features. At first, the sentiment analysis dictionary, the glossary of social networking terms, as well as the emoticon library, are all initialized. Then, the text micro-blog frequent word sets are defined. According to the feature set of words, the initial micro-blog clusters can be directly generated depending on the maximum frequent item sets. Furthermore, as to solve the micro-blog overlap problem between multiple initial clusters, an efficient elimination method is proposed employing the extended membership degree of the short-message semantic. Finally, the semantic similarity matrix for each separated cluster is defined, based on which a hierarchical sentiment clustering for micro-blogs is conducted. Taking the well-known contest NLPCC2013 in China as instance, the efficiency of our proposed method is proved by the comparing experiments. At last, a real-world case is also done to exactly show the emotion change from Chinese micro-blogs for the Malaysia Airlines Disappear Incident during March 8 to Spril 8, 2014

Select

Review

Extracting Sentimental Lexicons from Chinese Microblog: a Classification Method using N-Gram Features

LIU Dexi , NIE Jianyun, ZHANG Jing, LIU Xiaohua, WAN Changxuan, LIAO Guoqiong

2016, 30(4): 193-205.

Abstract ( ) PDF ( )

Knowledge map

Save

Sentimental analysis heavily relies on resources such as sentimental dictionaries. However, it is difficult to manually build such resources with a satisfactory coverage. A promising avenue is to automatically extract sentimental lexicons from microblog data. In this paper, we target the problem of identifying new sentimental words in a Chinese microblog collection provided at COAE 2014. We observe that traditional measures based on co-occurrences, such as pointwise mutual information, are not effective in determining new sentimental words. Therefore, we propose a group of context-based features, N-Gram features, for classification, which can capture the lexical surroundings and lexical patterns of sentimental words. Then, a classifier trained on the known sentimental words is employed to classify the candidate words. We will show that this method works better than the traditional approaches. In addition, we also observe that, different from English, many sentimental words in Chinese are nouns, which cannot be discriminated using co-occurrence-based measures, but can be better determined by our classification method.

Select

Review

Subjective Sentence Recognition Based on Hidden Markov Model

LIU Peiyu, , XUN Jing, FEI Shaodong, ZHU Zhenfang

2016, 30(4): 206-212.

Abstract ( ) PDF ( )

Knowledge map

Save

The current subjective and objective text classification methods are mainly based on statistical model over the feature lexicon, which didn’t take into account the syntax and semantic relationships between features. The paper proposes a Chinese subjective sentence recognition based on Hidden Markov Model. In this method, seven kinds of subjective and objective features for classification are extracted tagged among each sentence by HMM. The subjective sentences are decided by the importance of features and syntactic structure of sentences. The method is examined in the task of COAE2014 for its effeiciency.

Select

Review

User Churn Prediction for Online Game: Comparison and Analysis of Approaches Based on Sampling for Imbalanced Data

WU Yuexin, ZHAO Xin, GUO Yanwei, YAN Hongfei

2016, 30(4): 213-223.

Abstract ( ) PDF ( )

Knowledge map

Save

The problem of user churn prediction is a research focus in many fields. Currently the main approach of the problem is based on classification, which predicts whether users will churn by a 2-class classification process. This paper addresses an approach for online game user churn prediction based on 2-class classification. We summarize some important features for the problem of online game user churn prediction. Furthermore, we noticed that churned users is relatively rare, and introduce the imbalanced learning methods into our work with a focus on the sampling methods. We conducted experiments on major sampling methods and analyzed the results.

Please choose a citation manager

Content to export

2016 Volume 30 Issue 4 Published: 15 August 2016