Journal of Chinese Information Processing

Select

Survey

A Survey of Image Captioning

MA Longlong, HAN Xianpei, SUN Le

2018, 32(4): 1-12.

Abstract ( ) PDF ( )

Knowledge map

Save

As a new multimodal task which connects vision and language, image captioning can be applied to text-based image retrieval and network image analysis etc., thereby has drawn wide attention from the research and business community. Generally, existing image captioning methods fall into three categories: generation-based method, retrieval-based method and encoder-decoder method. In this paper, we first present the representative work of three methods with analysis of the advantages and disadvantages of these methods. Then we give the datasets, evaluation metrics and several open-source toolkits of image captioning. Finally we reveal the key technical problems in image captioning task.

Select

Language Analysis and Calculation

Towards the Relevance Between Effect and Time of GardenPath Sentence Processing

DU Jiali, YU Pingfang

2018, 32(4): 13-23,30.

Abstract ( ) PDF ( )

Knowledge map

Save

This research discusses the correlation between decoding effect of English garden path sentence and reaction time in the method of a paired sample T-test based on the data from 126 Chinese college students.Garden path sentence is a kind of local ambiguity sentence which can easily lead the students down to cognitive confusion. With other experimental conditions fixed, we extend the reaction time from 5 seconds to 10 seconds in two experiments for a single sample of garden path sentence, and then calculate the students' T-test. The paired sample t-test for S1 gets the result of 3.71 which is larger than the critical value. For the experiments of S2-S100, we find the fact that the chosen sentences may affect the experimental results. Generally speaking, there is significant correlation effect, between different reaction time during the processing of English garden path sentence, and the extension of reading reaction time can, to some extent, help students to better disambiguate the local ambiguity sentence.

Select

Language Analysis and Calculation

Studies on the Discourse Connectives in Wenxin Diaolong

FENG Wenhe, GUO Haifang, LIU Tao

2018, 32(4): 24-30.

Abstract ( ) PDF ( )

Knowledge map

Save

By labeling the discourse structure of the book titled Wenxin Diaolong, we investigate the explicit and implicit discourse connectives, covering their semantics and usage. We reveal the following results: 1) the implicit connectives (78.1%) are used more often than the explicit connectives (21.9%). Among all the 17 types of discourse relations, only in 4 relations (cause-result, transition, hypothetical, purpose) the explicit connectives are used more than implicit ones. 2) There are different ways to use synonymous connectives to represent the same relation. On the one hand, the connectives are used most frequently in “Continuity” - 17 times. On the other hand, the relation of “Total and Background” can be set up without any connectives. 3) Among all the 56 discourse connectives, monosemous connectives are dominant (44) in contrast to the polysemous (12), and the most ambiguious connective contains 5 semantic entries. In addition, we conduct case analysis of the usage of synonymous and polysemous connectives, and compare the discourse connective usage in other books of that time.

Select

Language Analysis and Calculation

LIANG Yongshi, HUANG Peijie, CEN Hongjie, TANG Jiecong, WANG Jundong

2018, 32(4): 31-39.

Abstract ( ) PDF ( )

Knowledge map

Save

Current semantic similarity computing can be classified as either vector-based or lexical taxonomy based approach. This paper proposes a method of semantic similarity by linking vector model to multi-source lexical taxonomies. In this method, vector representation of a word is calculated through distributed representation from vectors-based models, and synonym relations are derived from multi-source lexical resource. Furthermore, this paper explores the way to select and fusion the knowledge from multiple lexical taxonomies. The combination strategy can alleviate the defects the two classical method. We experiment on PKU 500, the dataset of the NLPCC-ICCPOL 2016 shared task on Chinese word similarity measurement. Our method achieves a Spearman score 0.637, i.e. 23% improvement comparing to the best result in the shared task.

Select

Language Analysis and Calculation

Semantic Enhanced Topic Modeling by Bi-directional LSTM

PENG Min, YANG Shaoxiong, ZHU Jiahui

2018, 32(4): 40-49.

Abstract ( ) PDF ( )

Knowledge map

Save

To construct a semantic coherent topic model, this paper proposes a probabilistic topic model DGPU-LDA(Double Generalized Polya Urn with LDA) which is built on the deep semantic reinforcement from bi-directional LSTM. In order to embed the semantic information of documents, we design a document-wise semantic encoder DS-Bi-LSTM (Document Semantic Bi-directional LSTM). For the model inference, we employ such mechanisms as document-topic GPU semantic reinforcement, word-word GPU semantic reinforcement and LSTM iterative dependency modeling to capture the Gibbs sampling process. Finally, we implement our method and other baselines on SogouCA and 20 Newsgroup dataset. Experimental results show that in the aspect of topic semantic coherence and text classification, the proposed DGPU-LDA outperform some of the state-of-the-art topic models. Meanwhile, these improvements also indicate that our DGPU-LDA have great power in text semantic feature representation.

Select

Language Resources Construction

Constructing the Repository for Modern Chinese Adjectives Repository

RAO Qi, WANG Houfeng, WANG Mengxiang, LI Hui

2018, 32(4): 50-58.

Abstract ( ) PDF ( )

Knowledge map

Save

The adjective, the noun and the verb form the main part of the Chinese content words, which has a strong dependence on “nouns” in syntax, whose core function lies in conceptual level to evaluate nouns' features under the adjustment of cognitive attention mechanism. The thesis reports the procedure of constructing an adjective knowledge base of Modern Chinese. Firstly, the paper investigates the collection of adjectives in the knowledge base and reference books and constructs a comprehensive set of adjective words by combining adjectives which are newly created by evolving procedure of language change. Secondly, the paper describes the construction concept of knowledge base in details. Then, the characteristic description system of knowledge base is presented. Finally, the paper prospects application scenarios of the knowledge base.

Select

Language Resources Construction

Semantic Role Labeling Based on Correspondence Rules Between Syntactic Pattern and Semantic Pattern of Sentences

HE Baorong, QIU Likun, SUN Panpan

2018, 32(4): 59-65.

Abstract ( ) PDF ( )

Knowledge map

Save

The construction of large-scale semantic corpus can provide useful training data for computer to understand the semantics of natural language. This paper focuses on the semantic rules for the construction of semantic corpus. On the basis of artificial semantic role tagging, the corresponding relation between syntactic patterns and semantic patterns of sentences is analyzed, and a set of semantic role labeling rules based on sentence patterns is extracted, leading to 78.73% precision on the test set.

Select

Knowledge Representation and Acquisition

A Method of Knowledge Representation and Ontology Modeling Basedon Hierarchical Network of Concepts

WEN Liang, LI Juan, LIU Zhiying, JIN Yaohong

2018, 32(4): 66-73.

Abstract ( ) PDF ( )

Knowledge map

Save

In the field of natural language processing (NLP), it is an important issue to be addressed at present that knowledge representation is not unified, and semantic information can not be used systematically. This paper presents a multi-dimensional and unified semantic knowledge representation method covering words, sentences and discourses based on the theory of hierarchical betwork of concepts (HNC). With this method, we build a multi-language ontology based knowledge base (KB), which can provide theoretical reference for the semantic processing study of large-scale Chinese texts, and support for the construction of knowledge resources in specific domains.

Select

Machine Translation

An Experimental Analysis of Unknown Words in Neural Machine Translation Using Sub-word Unit

HAN Dong, LI Junhui, XIONG Deyi, ZHOU Guodong

2018, 32(4): 74-79,119.

Abstract ( ) PDF ( )

Knowledge map

Save

Neural machine translation, as state-of-the-art method for machine translation, is substantially challenged by the issue of unknown word translation. Byte Pair Encoding (BPE) is a well recognized solution , in which a word is discomposed into sub-word units of higher frequency before translation. This paper investigates the effectiveness of BPE method to resolve the unknown word translation in Chinese-English translation. Experimental results show that BPE method achieves 1.02 BLEU improvements. Further analysis reveals that neural machine translation with BPE method achieves 0.45 accuracy in unknown word translation, comparable to that of classical statistical machine translation.

Select

Information Extraction and Text Mining

Location Based Link Prediction for Knowledge Graph

ZHANG Ningyu, CHEN Xi, CHEN Jiaoyan, DENG Shumin, RUAN Wei,
WU Chunming, CHEN Huajun

2018, 32(4): 80-86,129.

Abstract ( ) PDF ( )

Knowledge map

Save

Link prediction is the basis of complement and analysis of knowledge graph. To leverage the rich location characteristics in the location-related entities and in their relationships, this paper presents a location-based knowledge graph link prediction method. This method first classifies relations by analyzing the semantic features of entities and relationships, and then proposes a method for mining features and rules based on location-based entities and relationships. Secondly, by mining the entity location features and rules, we construct the constrains on the prediction results of vectorization methods for entities and relatiouships, and get the final results. Based on the experiments of WikiData, FB and WN datasets, we proved that the method has good effect on location-based relationship and entity link prediction.

Select

Information Extraction and Text Mining

Multiple-to-One Chinese Textual Entailment for Reading Comprehension

CHEN Qian, CHEN Xiafei, GUO Xin, WANG Suge

2018, 32(4): 87-94.

Abstract ( ) PDF ( )

Knowledge map

Save

As a kind of micro reading pattern, machine reading comprehension has attracted much attention in the field of automatic question-answering in recent years. The multiple-to-one textual entailment is a popular phenomenon in the machine reading comprehension. This paper first constructs M2OCTE corpus with 8000 multiple-to-one Chinese textual entailment pairs. Then it adopts a hierarchical neural network model, which can effectively integrate the semantic information between multiple sentences, to establish a unified expression for the multiple-to-one entailment pairs in an end-to-end style. The accuracy of the method on the university entrance exam of modern article reading comprehension entailment data set is 58.92%, which is higher than the traditional one-one entailment method. We also verify the effectiveness of the proposed method on an English data set.

Select

Information Extraction and Text Mining

Mining Hierarchical Structural Holes of the Network and Its Analysis

CUI Pingping, ZHAO Shu, CHEN Jie, QIAN Fulan,
ZHANG Yiwen, ZHANG Yanping

2018, 32(4): 95-104.

Abstract ( ) PDF ( )

Knowledge map

Save

In social networks, structural holes refer to a class of nodes which occupy the important position in the information diffusion. According to the study, 5% of structural holes control 50% of the information diffusion. The researchers have studied how to mine structural holes under a single granularity, however, there are a lot of networks, whose structure with hierarchical multi-granularity. So, it is of great significance to mine and make an analysis of the structural holes of the network under the multi-granularity. In this paper, a method named HI-SH is proposed to mine multi-granularity structural holes of the network with hierarchical structure. Furthermore, some analysis of structural holes under the multi-granularity are also given based on this method. In this method, firstly, we detect the community of the network in each hierarchical granularity. Then, according to the theory of two-step information diffusion, structural holes mine algorithm is used to mine top-k structural holes in each granularity. Experiments on public data Topic16 and real data show that structural holes of the network are dynamical and structural holes ranking under single granularity can not represent the rank order all granularities of the network.

Select

Information Extraction and Text Mining

Domain Classification Based on Undefined Utterances Detection Optimization

KE Zixuan, HUANG Peijie, ZENG Zhen

2018, 32(4): 105-113.

Abstract ( ) PDF ( )

Knowledge map

Save

Undefined utterances are common in task-oriented corpus, bearing complex components and vague boundaries to the defined utterances. To deal with the misclassification of the undefined utterance which substantially hurt the user experience, This paper proposes an effective domain classification method based on optimizing the undefined utterance detection. We first apply a cluster algorithm to aggregate defined utterances into several super groups. Then, a domain classifier is used to classify the defined utterance super groups and undefined utterance group. The optimization target is the efficiency of undefined utterances detection. Finally, those defined utterances will be re-classified, which is free from the noise of undefined utterances. Specifically, we adopt the deep learning model LSTM as the classifier and trained the word embedding from an unlabeled Weibo dataset for utterance feature representation. The evaluation in the multi-assignment corpus of SMP 2017 domain classification competition shows the proposed method with a clear improvement in both the F1 score of undefined utterances detection and the domain classification accuracy of all utterances.

Select

Information Retrieval and Question Answering

Deep Learning-based Personalized Paper Recommender System

WANG Yan, TANG Jie

2018, 32(4): 114-119.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper, we propose a personalized paper recommender system based on Aminer, an academic search and data mining platform. We propose a hybrid recommender system combining collaborative filtering and content-based recommendation. Further, we boost the performance of our model by incorporating word embedding and word mover distance (WMD) in content-based recommendation. The experiments show that we can signifieantly outperforms competing approches for the paper recommendation(+4% in terms of precision).

Select

Sentiment Analysis and Social Computing

User Emotion Modeling and Anomaly Detection Based on Social Media

SUN Xiao, ZHANG Chen, REN Fuji

2018, 32(4): 120-129.

Abstract ( ) PDF ( )

Knowledge map

Save

For abnormal emotional detection among micro-blog users, this paper proposes ananomaly detection method based on the joint probability density of multivariate Gaussian model and power-law distribution. In the experiments, the anomaly detection accuracy is 83.49% in terms of individual user, and 87.84% in terms of month. Statistics reveals that individual users' neutral, happy and sad emotions fall into the normal distribution, but the amazed and angry emotions are not. Emotions of micro-blogs released by groups confirm to the power-law distribution, but not those by the individual.

Select

Sentiment Analysis and Social Computing

Retweet Behavior Prediction Using Topic Model

GUO Ya, GONG Yeyun, ZHANG Qi, HUANG Xuanjing

2018, 32(4): 130-136.

Abstract ( ) PDF ( )

Knowledge map

Save

In the Microblogging service, retweeting is a key behavior for information diffusion. The task of predicting retweet behavior is an important step for various social network applications, such as social marketing, microblog retrieval, popular event prediction, and so on. We collect a large number of microblogs and the corresponding social networks from Sina Weibo, and discover several factors which affect users' retweet behavior: author of the tweet, user interest and popularity of the tweet. Then we propose a novel retweet behavior prediction method based on LDA model to combine structural, textual and author information. To evaluate the proposed method, we simulate the real user environment on the constructed dataset. Experimental results demonstrate that the proposed method can achieve better performance than state-of-the-art methods. The relative improvement of the proposed over the baseline method is more than 35%—45% in terms of F1-Score.

Select

NLP Application

An Online Data Collecting Framework Via Game for Children Second Language Development

MA Weizhi, ZHANG Min, ZHANG Chenyu, LIU Yiqun, MA Shaoping

2018, 32(4): 137-144.

Abstract ( ) PDF ( )

Knowledge map

Save

The language cognition research is often based on the dataset of children's first language vocabulary development, such as WordBank and other large-scale corpora. However, there is no large-scale second language vocabulary development dataset, and it is very difficult to collect a big dataset with traditional data collecting method. This limits the study of second language learning and the comparison of first language and second language learning. In this paper, we design a data collecting framework for children based on the idea of games with a purpose, to collect children's vocabulary development status and his/her attributes. We have implemented the second language vocabulary development collecting system for children English learning so far, and the system is conducting online data collection now.

Please choose a citation manager

Content to export

2018 Volume 32 Issue 4 Published: 16 April 2018