Journal of Chinese Information Processing

Select

Language Analysis and Calculation

Initial Seed Set Automatic Generation Algorithm for Domain Knowledge Graph Based on Thesaurus

HAN Qichen, ZHAO Yawei, YAO Zheng, FU Lijun

2018, 32(8): 1-8.

Abstract ( ) PDF ( )

Knowledge map

Save

With the rapid development of cognitive computing, the automatic construction of general knowledge graph has made great progress. However, it improves slowly in the vertical domain due to the lack of semantic information like ontology and others. In addition, the thesaurus is widespread in various domains with abundant semantic information. If the semantic information can be extracted and utilized reasonably, the effects of domain knowledge graph automatic establishment can be improved. In this paper, we propose two hypotheses, which can be used to extract the entity type and the relationship type from the internal structure of the thesaurus. And then we design an initial seed set automatic generation algorithm for domain knowledge graph based on thesaurus. Finally, the initial seed set generated by the geology and forestry domain thesaurus are used as the input of Bootstrapping algorithm for extraction. Experimental results demonstrate that the initial seed set obtained by the thesaurus are close to artificially designed seed set. In addition, the proposed model can be applied generally and provide a new idea for the application of the thesaurus for domain knowledge graph construction.

Select

Language Analysis and Calculation

Word Sense Disambiguation Based on Context Similarity with POS Tagging

MENG Yuguang, ZHOU Qiaoli, ZHANG Guiping, CAI Dongfeng

2018, 32(8): 9-18.

Abstract ( ) PDF ( )

Knowledge map

Save

While learning embeddings, context2vec represent words with different parts-of-speech as one point without considering they often have different meanings. To avoid this low-quality context vectors and context similarity, we present a context2vec model with POS features to differentiate different meanings represented by one point in the vector space. Experiments show that the accuracy of word sense disambiguation reaches 75.3% on 2004 Senseval-3, outperforming baselines on SemEval-13 and SemEval-15.

Select

Language Resources Construction

Chinese Nested Named Entity Recognition Corpus Construction

LI Yanqun, HE Yunqi, QIAN Longhua, ZHOU Guodong

2018, 32(8): 19-26.

Abstract ( ) PDF ( )

Knowledge map

Save

Nested named entities contain rich entities and semantic relations between them, which facilitates to improve the effectiveness of information extraction. Due to the lack of uniform and standard Chinese nested named entity corpora, currently it is difficult to compare the research works on Chinese nested named entities. Based on the existing named entity corpora, this paper proposes to use semi-automatic method to construct two Chinese nested named entity corpora. First, we use the annotation information in the Chinese named entity corpora to automatically construct as many nested named entities as possible, and then manually adjust them to meet our annotation requirements for Chinese nested entity in order to build high-quality Chinese nested named entity corpora. The preliminary experiment of nested named entity recognition both within and across the corpora shows that Chinese nested named entity recognition is still a quite difficult problem and requires further research.

Select

Language Resources Construction

Noun Phrase Alignment in the Korean-Chinese Bilingual Corpus Based on Statistics and Lexicon

LING Tianbin, BI Yude

2018, 32(8): 27-31.

Abstract ( ) PDF ( )

Knowledge map

Save

Phrase alignment in a bilingual corpus is of great significance to the example-based Korean-Chinese machine translation system. This paper begins with a study of the structural features of Korean noun phrases, conducts an experimental analysis of the statistics- and lexicon-based methods of word alignment, and puts forward the method of the noun phrase alignment of Korean-Chinese bilingual corpus based on the results of the analysis. This approach resorts to statistics to obtain information of word alignment position, based on which the word alignment correction is conducted from the similarity calculation in lexicon. Then the noun phrases and their Chinese translations are extracted from the rules of left and right boundaries of the Korean noun phrases, and the method of correlation measurement is applied to filter the noun phrases and realize their alignment. The experiments show that the proposed method has achieved satisfactory results of phrase alignment in the case of a large-scale corpus.

Select

Language Resources Construction

Automatically Building a Large Scale Dictionary of Chinese Entity Sentiment Expressions

LU Qi, CHEN Wenliang

2018, 32(8): 32-41.

Abstract ( ) PDF ( )

Knowledge map

Save

Except for some sentiment dictionaries. There are not sentiment expressions for entities which are very important for analysis. This paper proposes a method of automatically building a dictionary of entity sentiment expressions from large-scale raw text. In our method, we use a sorting algorithm based on a bipartite graph to rank the candidates of sentiment expressions. Then, we present a refining algorithm according to semantic similarity to extract some expressions from the low-rank set. Finally, we conduct the experiments on three datasets from different domains. The experimental results show that the accuracy of the extracted expressions is better than 90%. Totally we obtain a large scale dictionary including about 300K sentiment expressions.

Select

Machine Translation

Data Generalization and Phrase Generation Methods in Neural Machine Translation

LI Qiang, HAN Yaqian, XIAO Tong, ZHU Jingbo

2018, 32(8): 42-52.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper studies data generalization and phrase generation methods in neural machine translation. Data generalization method has been proposed to address the out-of-vocabulary and low-frequency vocabulary problems based on the subword method. Parallel consistency checking and decoding optimization methods have been proposed for our generalization method. As standard neural machine translation is word-based method, a phrase generation method is further proposed and the generated phrases are incorporated into our neural machine translation systems to improve the translation performance. Experiments show significant improvements of 1.3 and 1.2 BLEU points on Chinese-to-English and English-to-Chinese translation tasks, respectively.

Select

Machine Translation

The Influence of Different Use of Training Corpus on Neural Machine Translation Model

KUANG Shaohui, XIONG Deyi

2018, 32(8): 53-59,67.

Abstract ( ) PDF ( )

Knowledge map

Save

Neural machine translation (NMT) is an emerging end-to-end machine translation paradigm. In NMT, the stochastic gradient descent (SGD) is used to update the model parameters. This paper explores the influence on NMT system resulted from the batch, the dropout and the shuffle in SGD. The results show that the size of batch affects the convergence speed of NMT model, hyper parameter dropout has a huge impact on the performance of the NMT model, and data shuffle can improve the translation quality of NMT system.

Select

Machine Translation

Mongolian-Chinese Neural Machine Translation Based on RNN and CNN

BAO Wugedele, ZHAO Xiaobing

2018, 32(8): 60-67.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper, Mongolian-Chinese neural machine translation model based on RNN and CNN is discussed. Mongolian word model, segmentation model and subword model are used as input signals of the translation system. We compare our method with the traditional phrase-based SMT. Experimental results show that the subword model can effectively improve the quality of NMT and the RNN-based Mongolian-Chinese NMT model has surpassed the traditional phrase-based SMT model.

Select

Machine Translation

Mongolian-Chinese Machine Translation Research Based on Part of Speech Tagging with Gated Unit Neural Network

LIU Wanwan, SU Yila, Wunier, Renqingdaoerji

2018, 32(8): 68-74.

Abstract ( ) PDF ( )

Knowledge map

Save

Statistics machine translation may be able to predict a relatively accurate target word with statistical analysis method, but it cannot get a much better translation as it couldn’t fully understand the original semantic relations. To address this problem, the model of Mongolian-Chinese machine translation system is constructed using gated unit recurrent neural network structure, and introduce the global attention mechanism to obtain bilingual alignment information. In the process of constructing a dictionary, the bilingual words are annotated to strengthen the semantics, alleviating the problem caused by erroneous training. The research result shows that the BLEU value is certainly promoted and improved compared with previous benchmark research and traditional statistical machine translation method.

Select

Ethnic Language and Cross Language Information Processing

Auxiliary Feature Extraction for Kazakh Syntactic Parsing

CHEN Xue, Gulila Altenbek

2018, 32(8): 75-79,90.

Abstract ( ) PDF ( )

Knowledge map

Save

In the Kazakh syntax parsing, we use the averaged perceptron algorithm to train the syntactic parsing model, withthe beam search algorithm to decode the sentence structure. During the parsing, we construct a two-way LSTM model to extract the information between the structures of each word in the sentence to predict its syntactic role. Then we apply this information as a lookahead feature to the syntactic parsing process. Experiments show that this method has improved the precision and recall rate.

Select

Ethnic Language and Cross Language Information Processing

Uyghur Text Sentiment Classification Based on Bi-tagged Features

Raxida Turhuntay, Wushour Slamu

2018, 32(8): 80-90.

Abstract ( ) PDF ( )

Knowledge map

Save

The current Uyghur text sentiment classification method uses the unigram features obtained from space segmentation as a text representation, and is not able to mine the deep language phenomena related to emotional expressions. This paper, based on the word order dependence of Uyghur language, summarized several rules, extracted Bi-tagged features that can express rich emotional information, and classified Uyghur sentiment corpora with a support vector machine (SVM) classifier. Results indicated that, in the Uyghur text sentiment classification: (1) the Bi-tagged features performed optimal results when it contained all parts of speech rules presented in this paper; (2) the Bi-tagged features are able to extract rich emotional information and negative information as well; (3) in comparison to the results of unigram, bigram features and their combined features on the datasets in this paper, the combination of Bi-tagged and unigram features have lead to improved performances. Accordingly, the classification accuracy is 4.225% higher than that of the baseline accuracy used in this paper. Our results, therefore, further improved the classification efficiency of Uyghur text sentiment classification. In addition, the methods presented in this paper can also be applied as a reference for the sentiment classification of other closely related languages such as Kazakh and Kirgiz.

Select

Information Extraction and Text Mining

Text Feature Selection Based on Inclusion Degree and Frequent Pattern

CHI Yunxian, ZHAO Shuliang, LI Renjie

2018, 32(8): 91-102.

Abstract ( ) PDF ( )

Knowledge map

Save

In big data era, the growth rate of text information is too fast to deal with. Finding text features is one of the key issues in field of text mining. It is a great challenge to ensure the quality of features, which are mined from texts, due to the presence of large-scale words and patterns. Pattern-based methods have many superior characters while term-based methods have not. Pattern-based methods can remove noises efficaciously and promote performance of text mining. Algorithm Text Feature Selection Based on Inclusion Degree and Frequent Pattern (TFSIDFP) is proposed. First of all, standard of similarity measure for frequent patterns based on inclusion degree is defined. Secondly, algorithm Filtration of Redundancy for Frequent Patterns based on Inclusion Degree Theory (FRFPIDT) is put forward, algorithm FRFPIDT measures similarity of frequent patterns based on inclusion degree, and removes subpatterns and cross-patterns with high similarity degree. Performance of frequent patterns mining is increased by cutting out redundancy patterns. At last, feature weighting model is put forward. In this model, features are selected based on non-redundant frequent patterns that are disposed through algorithm TFSIDFP. Correlation between features and documents is taken into account in feature weighting, thus correlation degree between them is higher and effect of classification is better. Experimental results on data sets from Reuters-21578 indicate algorithm TFSIDFP is superior to the widely used feature selection and feature extraction methods.

Select

Sentiment Analysis and Social Computing

Process-crash Detection by Analogy with Social Network

CHENG Ziqiang, HUANG Rong, YANG Yang

2018, 32(8): 103-110.

Abstract ( ) PDF ( )

Knowledge map

Save

Various networks surrounding us can be divided into physical networks and information networks according to similar internal mechanisms. For networks with obvious physical characteristics, we can use the basic physical knowledge to explain the nature of its internal structure or nodes; For information networks, we may need to combine some prior knowledge to understand, and social network is such an example. However, there’re no clear ways or means for analyzing networks without significant physical or social backgrounds. In this paper, we explore a similar approach of social network analysis to analyze the process network on China Telecom CSB cluster; specifically, to predict the crashing of process on the cluster. Such approach has brought credible results on this particular dataset, and according to our research, the running information such as loads of CPU and memory, communications between processes and the structural features in the process network are valuable in predicting the states of processes and ports; furthermore, the changes of features mentioned above in the time dimension reflect the states of processes or ports.

Select

Sentiment Analysis and Social Computing

Research on Community Detection from Complex Weighted Network

TAN Hongye, WU Yongke, ZHANG Hu, LIU Quanming, LI Ru

2018, 32(8): 111-119.

Abstract ( ) PDF ( )

Knowledge map

Save

The connection strength between nodes in complex networks can largely affect the community structure of the network, therefore, it is of great significance to use the weight to describe the difference of the connection strength and apply it to the community discovery research. For this purpose, this paper proposes an improved method for measuring the correlation degree of nodes based on the direct link weights of nodes and the edge weights of common neighbor nodes. Furthermore, we construct a community discovery model based on the improved measure of the correlation degree between nodes and the aggregation method between groups. The experiments are performed on the weighted network of scientists and the national train network, and the results show the effectiveness of this method.

Select

Sentiment Analysis and Social Computing

A Study on Deployment Strategy of Efficient Observers for Locating Spreading Source

LIU Dong, ZHAO Jing, NIE Hao

2018, 32(8): 120-127,142.

Abstract ( ) PDF ( )

Knowledge map

Save

Spread of a rumor or a disease can be modeled as propagation in a network. To estimate the spreading source in a network, the partial observers are necessary. To select effective observers, this work analyzes the influence of deployment strategies including random, degree-based, clustering-based, eigenvector-based, closeness-based and betweenness-based method. The experiments simulate three kinds of synthetic networks and four real networks using SI propagation model and reverse greedy algorithm. The results show that eigenvector deployment strategy is most contributive to the accuracy of estimating the spreading source.

Select

Sentiment Analysis and Social Computing

Sentiment Analysis Based on Collaborative Filter Attention Mechanism

ZHAO Dongmei, LI Ya, TAO Jianhua, GU Mingliang

2018, 32(8): 128-134.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper investigates the influence of users personality and product information on data emotion category in review data. Among the many factors that affect the emotional data type, the subject of the evaluation, that is, the user and the object of the evaluation, are emotionally important to the commentary data. In this paper, an emotional analysis model (LSTM-CFA) based on cooperative filtering attention mechanism is proposed. The user interest distribution matrix is calculated by using the collaborative filtering (CF) algorithm. After the matrix is decomposed with SVD, the matrix is added to the hierarchical LSTM model as an attention mechanism in order to achieve emotion classification. Experiments show that the LSTM-CFA model can extract the information of users personality and product attribute efficiently, to improve the accuracy of emotion classification.

Select

Sentiment Analysis and Social Computing

Binary Affective Cognitive Model for Product Reviews

CHEN Fang, WANG Ke, LIANG Shuang, HUANG Yongfeng

2018, 32(8): 135-142.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a binary affective cognitive model for product reviews, which consists of three main modules: binary affective commonsense knowledge base, evaluation system knowledge base and sentiment analysis engine. This model has following characteristics: (1)It can learn the prior knowledge from large-scale reviews, and save it in the knowledge bases. These databases make it easier to revise and reuse knowledge, which embodies the cognitive ability of the model. (2) This model is able to reveal explicit opinions and infer high level sentiments. This paper gives the algorithms of constructing binary affective commonsense knowledge base and evaluation system knowledge base, and introduces the application of emotional analysis engine in opinion mining and sentiment inference. The experiment on product review corpus verifies the validity of the model.

Please choose a citation manager

Content to export

2018 Volume 32 Issue 8 Published: 15 August 2018