Journal of Chinese Information Processing

Select

Survey

A Survey of Central Asian Language Processing

Tuergun Ibrahim, Kahaerjiang Abiderexiti, Aishan Wumaier, Maihemuti Maimaiti

2018, 32(5): 1-13,21.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper reviews the progresses of natural language processing of Turkish, Kazakh and so on, which belong to the same language family in Central Asia. First, morphological analysis, syntax analysis, named entity recognition and machine translation are reviewed. Then the language independent methods for agglutinative language morphological analysis are discussed. Finally, problems and challenges of Central Asian language processing at home and abroad is summarized, and future studies are suggested.

Select

Language Analysis and Calculation

Relation Structure Taxonomy and Annotation of Interactive Question Answering

ZHOU Xiaoqiang, WANG Xiaolong, CHEN Qingcai

2018, 32(5): 14-21.

Abstract ( ) PDF ( )

Knowledge map

Save

Interactive question answering(iQA) is a kind of information communication with conversational, continuous, correlated characteristics. The relation structure of iQA reflects the contextual association of the interactive scenario at different linguistic levels. This paper summarizes the dialogue act and sentence relation of iQA, and presents the relation structure taxonomy based on the results of analysis. To verify the rationality of the taxonomy, we conduct annotation experiments on the collected iQA corpus from the real world, which include dialogue act and contextual sentence relation. For iQA, we use the Hiddden Markov Model(HMM) to analyze the transformational rules of dialogue act, and show the characteristics of sentence relation structure with the statistical analysis.

Select

Language Analysis and Calculation

A Entity-Action Relationship Model for Text Clustering

LIU Zuoguo, CHEN Xiaorong

2018, 32(5): 22-30.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper present an Entity-Action Relationship Model (EARM) for text clustering with a purpose to describe Chinese semantic entities and behaviors. Since Chinese is a non-inflection language, we cannot easily find a one-to-one relationship between word properties and syntax elements at the surface level. A syntax element recognition mechanism is designed to recognize entities and actions according to words properties and position characters. Then EARM is built according to sentence patterns so as to describe the entities' behaviors and states. For some complex sentences, e.g. the nested sentences, it is necessary to launch action layer decomposition and simplify them into simple sentences in order to mine Entity-Action Relationship during the period of syntax analysis. For the omission and inversion in the syntaxa recognition mechanism is designed to move entities and reorder sentences by matching inverted sentences with similar sentence patterns. Maximum Common Sub-graphs of syntax trees are introduced to calculate text similarity and take clustering. Finally, the experiment shows that EARM is accurate and effective and the clustering result is reasonable.

Select

Language Resources Construction

Automatic Conversion of Phrase Structure Treebank to Sentence Structure Treebank

ZHANG Yinbing, SONG Jihua, PENG Weiming, ZHAO Yawei, SONG Tianbao

2018, 32(5): 31-41.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper puts forward a method to convert a phrase structure Treebank to a sentence structure Treebank. Taking Tsinghua Constituent Treebank(TCT) as the test corpus, we realize the conversion of the two structures, together with the visualization display of them using a scalable visualization system. This study enlarges the scale of Chinese sentence structure Treebank, which will promote the follow-up researches of it.

Select

Machine Translation

Mongolian-Chinese Machine Translation Based on CNN Etyma Morphological Selection Model

Wunier, Suyila, LIU Wanwan, REN qingdaoerji

2018, 32(5): 42-48.

Abstract ( ) PDF ( )

Knowledge map

Save

The machine translation based on recurrent neural network is gradually replacing the statistical machine translation, especially between the major languages in the world. I Due to the shortage of Mongolian corpus, a method of Mongolian-Chinese machine translation based on Convolutional Neural Network is proposed. In the process of encoding the source corpus, through the pooling layer, the semantic relation and information of CNN in the sentence can be obtained according to the characteristics of Mongolian word formation. The experimental result shows that the method outperforms RNN NMT in the aspect of the quality and training speed of the translation.

Select

Ethnic Lauguage and Cross Language Information Processing

The Acoustic Model for Tibetan Speech Recognition Based on Recurrent Neural Network

HUANG Xiaohui,LI Jing

2018, 32(5): 49-55.

Abstract ( ) PDF ( )

Knowledge map

Save

The recurrent neural network and the connectionist temporal classification algorithm are applied to the acoustic modeling of Tibetan speech recognition, so as to achieve end-to-end model training. According to the relationship between the input and output of the acoustic model, the time domain convolution operation on the output sequence of the hidden layer is introduced to reduce the time domain expansion of the network’s hidden layers. Experimental results show that the recurrent neural network model achieves better recognition performance in Tibetan Lhasa phoneme recognition compared with the traditional acoustic models based on Hidden Markov Model, while the acoustic model based on recurrent neural network with time-domain convolution possesses higher training and decoding efficiency while maintaining the same recognition performance.

Select

Ethnic Lauguage and Cross Language Information Processing

Zero Pronoun Resolution of Uyghur Based on Stacked Denoising Autoencoder and Word Embedding

QIN Yue, YU Long, TIAN Shengwei, FENG Guanjun,
Turgun Ibrahim, Askar Hamdulla,ZHAO Jianguo

2018, 32(5): 56-64.

Abstract ( ) PDF ( )

Knowledge map

Save

Adopting deep learning mechanism, this paper apply Stacked Denoising Autoencoder (SDAE) to deal with Uyghur zero pronoun anaphora phenomenon. Firstly, word embedding trained on large-scale unlabeled Uyghur corpus is used as semantic features of candidate antecedents and zero pronouns. Secondly, according to Uyghur characteristics, we extract 14 hand-crafted features for zero pronoun resolution. Experimental results show that, compared to SAE(Stacked Autoencoder), SVM and ANN, the F value of SDAE is increased by 4.450%, 10.032% and8.140%, respectively.

Select

Ethnic Lauguage and Cross Language Information Processing

Identifying Accompanying Relationship between Uyghur Events Based on Deep Belief Network

HU Wei, YU Long, TIAN Shengwei,Turgun Ibrahim,FENG Guanjun,Askar Hamdulla

2018, 32(5): 65-73.

Abstract ( ) PDF ( )

Knowledge map

Save

The accompanying relationship between the events is common in the Uyghur language. This paper proposes a method to identify the accompanying relationship between the Uyghur events based on deep belief network(Deep Belief Network, DBN). According to the characteristics of the Uyghur language, this paper extract 12 features which are based on the event structure information; It also applies the Word Embedding to calculate the semantic similarity between the two trigger words. The experiments show that the precision rate, the recall rate and F value of the proposed method reach 81.89%, 84.32% and 82.48%, respectively, which outperforms SVM (Support Vector Machine, SVM).

Select

Information Extraction and Text Mining

Text Fingerprint Extraction Based on Latent Semantic Analysis

CUI Tongtong,CUI Rongyi

2018, 32(5): 74-79.

Abstract ( ) PDF ( )

Knowledge map

Save

The arrival of the era of network and big data enriches the information resources in cyberspace. However, the diversity and the rapid growth of data bring pressure and challenge to the storage and the effective utilization of information resources. A text fingerprint extraction method based on latent semantic analysis was presented in this paper. The proposed method is a compression representation of data information, and it is an improvement on the semantic deficiency of current fingerprint extraction methods. By this method, the semantic latent semantic features of document were obtained using singular value decomposition, and furthermore, the original document vector space was transformed into the corresponding latent semantic space. Finally, according to the random hyperplane principle, the document in the space was transformed into binary digital fingerprint, and the difference between fingerprints was measured by Hamming distance. The proposed method was verified by the similarity experiments and clustering experiments with the academic literature from CNKI. The experimental results show that the method can better characterize the semantic information of the document with accurate and effective compressed representation.

Select

Information Extraction and Text Mining

Word Feature for Cold Start Users Based on Trust Relationships and Word Correlation

GAO Hengde, WANG Zhiqiang, LI Ru

2018, 32(5): 80-88,96.

Abstract ( ) PDF ( )

Knowledge map

Save

The user's word feature obtained from the text is the basis for achieving the task of user theme modeling, interest mining, and personalized recommendation. To derive the word feature for cold start users who contain scarcely texts, this paper presents a method of merging the trust relations of the user and the word correlation, Specifically, we combine the user's trust relation matrix, words correlation matrix and the feature word matrix via probabilistic matrix factorization. The experimental results on 4 data sets from Sina microblogging and twitter show that the proposed algorithm achieves better results.

Select

Information Retrieval and Question Answering

A Coreference Resolution Based Entity Search Model

XIONG Ling, XU Zengzhuang, WANG Xiaobin, HONG Yu, ZHU Qiaoming

2018, 32(5): 89-96.

Abstract ( ) PDF ( )

Knowledge map

Save

The goal of Slot Filling (SF) is extracting certain attribute value of given entity(query) from large scale corpus. Entity search, as an important component of SF, retrieves documents referring to the given entity for other components to extracting attribute values from them. In contrast to the existing entity search based on boolean logic, we propose a cross document coreference resolution (CDCR) based entity search model. This CDCR improves the precision of IR results by filtering documents which do not contain mentions referring to the given entity. To minimize the loss of recall in filtering process, we introduce the pseudo relevant feedback method to augment the information of given entity. Experimental results show that our model outperforms the baseline by increasing the precision and F1 score by 5.63% and 2.56%, respectively.

Select

Sentiment Analysis and Social Computing

LI Xin, LI Yang, WANG Suge

2018, 32(5): 97-104.

Abstract ( ) PDF ( )

Knowledge map

Save

In text sentiment analysis, unsupervised clustering method is challenged by low precision. To improve the text similarity measure lying as key to clustering, this paper proposes a semantic subspace (RESS) method to deal with the high dimension and sparseness of sentiment text representation issue. It also helps to caputure the implicit expression of sentiment. The experimental results show that RESS can effectively reduce the feature of data set and generat better results.

Select

Sentiment Analysis and Social Computing

Public Opinion Analysis for Xinjiang Violence News Based on Topic Model

ZHANG Shaowu, SHAO Hua,LIN Hongfei,YANG Liang

2018, 32(5): 105-113.

Abstract ( ) PDF ( )

Knowledge map

Save

With the explosive growth of networks, the internet public opinion becomes a non-negligible issue. A typical example is focus on the events about Xinjiang Violence happened in recent years. In order to examine the corresponding public opinion trends, this paper investigates the key words of topic and the change of both topic strength and its content. On the crawled news about Xinjiang Violence from 2013.01 to 2015.12, we apply the dynamic topic model (DTM) which generate topics by applying NMF twice. Compared to HDP, we reveal some properties by visualized analysis.

Select

NLP Application

Research on the Evolution Mechanism of Policy Lineage Network Based on Semantic

LIU Gang, FU Weiping, MA Yingge

2018, 32(5): 114-127.

Abstract ( ) PDF ( )

Knowledge map

Save

The article constructed a complex network system, which was composed of a micro complex network, a meso complex network, and a macro complex network. The article constructed a micro policy network, a meso policy network and a macro policy network respectively by the improved similarity calculation method between policy words based on semantic, dependent clause analysis method and the method based on the Vector Space Model. On the basis of the policy network system, the article extracted the hierarchy of the macro network, and cleared the poli\|cy fragments existed in the network, and constructed an ordered policy forest. Then it introduced a policy anti-fragmentation method based on the forest. The experimental results indicate that the methods proposed in this article can solve the problem of policy fragmentation effectively.

Select

NLP Application

The Study of Multi-label Assistant Diagnosis of Obstetrics Based on Feature Fusion

MA Hongchao, ZHANG Kunli,ZHAO Yueshu,ZAN Hongying,ZHUANG Lei

2018, 32(5): 128-136.

Abstract ( ) PDF ( )

Knowledge map

Save

The information extraction and assistant diagnosis of obstetric EMRs is of great significance in improving the fertility level of the population. Since the admitting diagnosis in first course record of EMR is reasoned from the information of chief complaints, auxiliary examinations, physical examinations etc, we treat the diagnostic process into multi-label classification problem. The features of LDA extraction and the digital features of medical records are fused into new features by vector merging, and RAkEL, MLkNN, CC and BP-MLL are used for multi-label classification. The experimental results show that the proposed method can improve the assistant diagnosis of Chinese obstetric electronic medical records.

Select

NLP Application

Predicting Knowledge Points of Questions: an Expertise-Enriched CNN Model

HU Guoping, ZHANG Dan, SU Yu, LI Jia, LIU Qingwen, WANG Rui

2018, 32(5): 137-146.

Abstract ( ) PDF ( )

Knowledge map

Save

In online learning systems, to offer students better learning services, a fundamental task is predicting questions’ knowledge points, i.e., predicting the knowledge concepts or skills of a question. Existing methods for this task usually rely on human labeling or traditional machine learning methods, They are defected in either labor intensive or focusing only on shallow features without capturing the deep semantic relations between questions and knowledge points. In this paper, we propose an Expertise-enriched Convolutional Neural Network(ECNN)to predict questions’ knowledge points. Specifically, we first define and extract question features under the guidance of educational experience. Then, we leverage a convolutional neural network to exploit question representations from deep sematic perspective. After that, considering the relations between questions and expertise priors, we develop an attention based method for calculating the importance of expertise for questions. At last, we design an objective function for model learning that constrains both knowledge points and semantics. Extensive experiments on a large-scale dataset demonstrate the effectiveness of the proposed model, showing a good application value.

Please choose a citation manager

Content to export

2018 Volume 32 Issue 5 Published: 15 May 2018