Journal of Chinese Information Processing

Select

Language Analysis and Calculation

Macro Discourse Relation Classification Based on Macro Semantics Representation

ZHOU Yi, CHU Xiaomin, ZHU Qiaoming, JIANG Feng, LI Peifeng

2019, 33(3): 1-7,24.

Abstract ( ) PDF ( )

Knowledge map

Save

The macro discourse analysis aims to analyze the semantic relations between adjacent paragraphs or paragraph groups, which is a less-addressed fundamental work of in the field of natural language processing. This paper proposes a classification model to decide the relation in macro discourse-level. This model introduces a distribute representation of macro discourse semantics on word vectors and a set of structure features to improve the performance. The experimental results on the Macro Chinese Discourse Tree Bank (MCDTB) show that the F₁ value of our model reaches 68.22%, achieving 4.17% improvement.

Select

Language Analysis and Calculation

A Hybrid Neural Network Model on Constituent Tree Structure

HUO Huan, XUE Yaohuan, HUANG Junyang, JIN Xuancheng, ZOU Yiting

2019, 33(3): 8-16.

Abstract ( ) PDF ( )

Knowledge map

Save

Current methods of combining constituent trees with LSTM (C-TreeLSTM) suffere from low accuracy for text modeling due to withouth computing the words in hidden state of internal nodes. This paper proposes a hybrid neural network model, i.e. SC-TreeLSTM, based on the constituent tree structure. The model enhances nodes memory of text semantics by injecting phrase semantic vectors which is covered by corresponding node during encoding. The experimental results show that the SC-TreeLSTM achieves excellent performance in both sentiment classification and machine reading comprehension tasks.

Select

Language Resources Construction

A Chinese Corpus for Non-task-oriented Dialogue Systems with Five-grade Manual Annotations

LI Jing, ZHANG Haisong, SONG Yan

2019, 33(3): 17-24.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a large-scale corpus for non-task-oriented dialogue systems, which contains over 27K distinct prompts with more than 82K responses collected from social media. To annotate this corpus, we define a 5-grade rating scheme (bad, mediocre, acceptable, good, and excellent) with respect to the relevance, coherence, informativeness, interestingness, and the potential to move a conversation forward. To test the validity and usefulness of the produced corpus, we compare various unsupervised and supervised models for response selection. Experimental results confirm that the proposed corpus is helpful in training response selection models.

Select

Knowledge Representation and Acquisition

Gated Mean-Max Autoencoder for Text Representations

ZHANG Minghua, WU Yunfang, LI Weikang, ZHANG Yangsen

2019, 33(3): 25-32.

Abstract ( ) PDF ( )

Knowledge map

Save

In order to learn distributed representations of text sequences, the previous methods focus on complex recurrent neural networks or supervised learning. In this paper, we propose a gated mean-max autoencoder both for Chinese and English text representations. In our model, we simply rely on the multi-head self-attention mechanism to construct the encoder and decoder. In the encoding we propose a mean-max strategy that applies both mean and max pooling operations over the hidden vectors to capture diverse information of the input. To enable the information to steer the reconstruction process, the decoder employ element-wise gate to select between mean and max representations dynamically. By training our model on a large amount of Chinese and English un-labelled data respectively, we obtain high-quality text encoders for publicl available. Experimental results of reconstructing coherent long texts from the encoded representations demonstrate the superiority of our model over the traditional recurrent neural network, in terms of both performance and complexity.

Select

Knowledge Representation and Acquisition

Semantic Representation Learning Based on HowNet

ZHU Jingwen, YANG Yuji, XU Bin, LI Juanzi

2019, 33(3): 33-41.

Abstract ( ) PDF ( )

Knowledge map

Save

HowNet is a large-scale and high-quality cross-lingual commonsense knowledge base, containing a wealth of semantic information. This paper disassembles HowNets complex structure and obtains HownetGraph in the form of knowledge graph. Then Network Representation Learning and Knowledge Representation Learning methods are applied to obtain cross-lingual vector representation of different semantic units, i.e., word, sense, DEF_CONCEPT and sememe. Two series of experiments (word similarity and word analogy) are conducted on Chinese and English datasets, and the results show the proposed method achieves the best results.

Select

Machine Translation

On Ensemble Learning of Neural Machine Translation

LI Bei, WANG Qiang, XIAO Tong, JIANG Yufan, ZHANG Zheyang, LIU Jiqiang, ZHANG Li, YU Qing

2019, 33(3): 42-51.

Abstract ( ) PDF ( )

Knowledge map

Save

Ensemble learning has been extensively proved valid in machine translation evaluation campaigns, but the sub-model selection and integration strategies are not well addressed. This paper examines the two kinds of ensemble learning methods: parameter averaging and model fusion in machine translation tasks, and investigates the impact of diversity and model quantity on system performance from the perspectives of data and model. Experimental results show that the best result yields improvements of 3.19 BLEU points over the strong Transformer baseline on WMT Chinese-English MT tasks.

Select

Machine Translation

Joint Prediction Model of English Words and Their Cases in Neural Machine Translation

ZHANG Nan, LI Xiang, JIN Xiaoning, CHEN Wei

2019, 33(3): 52-58.

Abstract ( ) PDF ( )

Knowledge map

Save

To realize the case-sensitive English words for machine translation, this paper proposes a method for jointly predicting English words and their case in neural machine translation. It predicts words and their case in the same decoder. respectively, by emplying both the information of the source corpus and that in the target corpus. The method not only decreases the size of the words list, reduces the parameters amount of the model, but also improves the quality of translation. Compared to the baseline system in the WMT 2017 Chinese-English news translation task test set, the proposed method achieves 0.97 BLEU improvement in case-insensitive settings, and 1.01 in case-insensitive settings.

Select

Ethnic Language and Cross Language Information Processing

Research on Korean Sentence Structure Similarity Metric

BI Yude, JIANG Bowen

2019, 33(3): 59-63,101.

Abstract ( ) PDF ( )

Knowledge map

Save

Sentence similarity computing is a fundamental task in the field of natural language processing, e.g. it directly affects the quality of translation in EBMT(Example-based Machine Translation)system. Focused on Korean, this paper puts forward a method for Korean sentence structure similarity computing according to the Korean sentence characteristics. This method first extracts the skeleton of the Korean sentence and then further processed the skeleton with the transformation rules designed in this paper. The final sentence similarity are measured in this kind of structure space, which are validated for the feasibility and efficiency by the experiment.

Select

Ethnic Language and Cross Language Information Processing

Uyghur Named Entity Recognition Based on Deep Neural Network

WANG Lulu, AISHAN Wumaier, TUERGEN Yibulayin, MAIHEMUTI Maimaiti, KAHAERJIANG Abiderexiti

2019, 33(3): 64-70.

Abstract ( ) PDF ( )

Knowledge map

Save

The current Uyghur named entity recognition methods are based on statistical learning methods such as conditional random fields, which depends heavily on manual feature engineering and domain knowledge extraction. To resolve this issue, this paper proposes a method based on deep neural network and introduces different feature vector representation for Uyghur named entity recognition. The word embedding and the character-level embedding are combined by the attention-based method and then fed into the Bi-LSTM-CRF. The experimental results show that the proposed method achieves an F-value of 90.13% for Uyghur named entity recognition.

Select

Ethnic Language and Cross Language Information Processing

A Study on Discrimination Between Identification of Similar Languages on Short Conversational Texts with Out-of-domain Data

HE Junqing, HUANG Xian, ZHAO Xuemin, ZHANG Keliang

2019, 33(3): 71-78.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper aims at identification similar languages such as Uyghur and Kazakh from short conversational texts. To alleviate the severe data imbalance resulted from the low-recource Kazakh, we leverage a compensation strategy and an assimilation method by selecting appropriate out-of-domain data. Then we constructed a maximum entropy MaxEnt classifier based on morphologic features to discriminate between the two languages and investigated the contribution of each feature. Experimental results suggest that the MaxEnt classifier effectively discriminates between Uyghur and Kazakh on the test set with an accuracy of 95.7%, outperforming the champion of the VarDial’2016 DSL shared task on test sets B1 and B2 by 0.6% and 1.2%.

Select

Information Extraction and Text Mining

Mining of Classical Musical Instrument Poetry Based on Topic Model

SHEN Zizhuo, YANG Ying, SHAO Yanqiu

2019, 33(3): 79-86.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents an investigation into the BaYin, the eight classical musical instruments, depicted in the ancient Chinese poetry. Specifically, we applied LDA and NMF to establish the Author-Topic-Model based author similarity. From the corpus of Tang Poetry and Song Poetry, this paper delivers a panoramic view on the poems, poets, the topics the verbs related to Bayin.

Select

Information Extraction and Text Mining

User Relation Extraction via Text Information and Attention Mechanism

ZHAO Yun, WU Fan, WANG Zhongqing, LI Shoushan, ZHOU Guodong

2019, 33(3): 87-93.

Abstract ( ) PDF ( )

Knowledge map

Save

With the development of social media, the relationship network between users has greatly helped the analysis of social media. Focusing on predicting the user relationship based on the text information published by the user, this paper proposes a friendship prediction model based on attention mechanism and Long Short-Term Memory(LSTM), which separates the comments between friends, and determines whether there is a certain friend relationship by analyzing the users comments. This model takes as input the concatenated results of the two friends and applies the attention mechanism to the output of the LSTM. Experiment shows that the proposed model achieved an accuracy of 77% without adding any other non-text features.

Select

Information Extraction and Text Mining

Multi-feature Bi-LSTM-CRF Model for Person Name Recognition from Movie Reviews

XUAN Zhenyu, JIANG Shengyi, Zhang Liming, BAO Rui

2019, 33(3): 94-101.

Abstract ( ) PDF ( )

Knowledge map

Save

Person name in the movie reviews is featured by abbreviations and neologisms, which decreases the performances of classical models (e.g. CRF). To deal with this issue, this paper proposes a novel person name recognition method named Multi-Feature Bi-LSTM-CRF Model. This model extracts relevant character-level features by using external corpora and unlabeled reviews, then applies Bi-LSTM-CRF to identify the sequence of person names. The experimental results show that our model can effectively identify different forms of person names in the movie reviews.

Select

Question Answering, Dialogue System and Machine Reading Comprehension

Integrating Question Understanding in Neural Networks to Answer the Description Problems in Reading Comprehension

TAN Hongye, LIU Bei, WANG Yuanlong

2019, 33(3): 102-109.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper explores the solutions to the description problems in reading comprehension using QU-NNs model whose frameworks are the Embedding layer, the Encoding layer, the Interaction layer, the Prediction layer, and the answer Post-processing layer. To deal with the high degree of semantic generalization of the questions, we integrate three features of question (question type, question topic, question focus) in the Encoding layer and the Interaction layer of the model to better understand the question. Specifically, the question type is identified by a convolutional neural network, and the question topic and question focus are obtained through syntactic analysis. Further, a heuristic method is designed to identify the noise and redundant information in the answer. Experiments show that adding question features and removing redundant information increased the performance by 2%～10%.

Select

NLP Application

Image Caption via Pivot Language

ZHANG Kai, LI Junhui, ZHOU Guodong

2019, 33(3): 110-117.

Abstract ( ) PDF ( )

Knowledge map

Save

Due to the publically available large-scale image dataset with manually labeled English captions, most studies on image caption aim at generating captions in a single language (e.g., English). In this paper, we explore zero-resource image caption to generate Chinese captions via English as the pivot language. Specifically, we propose and compare two approaches by taking advantage of recent advances in neural machine translation. The first approach, called pipeline approach, first generates English caption for a given image and then translates the English caption into Chinese. The second approach, called building pseudo-training set approach, first translates all English captions in training sets and development set into Chinese to obtain image-Chinese caption datasets, and then directly train a model to generate Chinese caption for a given image. Experimental results show that the second approach, i.e., the character-based Chinese caption generation model on the pseudo-training set, is superior to the pipeline approach.

Select

NLP Application

A Case Study on Journey to the West Based on Sentiment Analysis

ZHANG Chenlin, WANG Mingwen, TAN Yiming, CHEN Zhiming, ZUO Jiali, LUO Yuansheng

2019, 33(3): 118-125,135.

Abstract ( ) PDF ( )

Knowledge map

Save

As one of the Four Great Classical Novels, Journey to the West left lots of foreshadowing to interpret. In this paper, we conduct a case study on Monkey King by using sentiment analysis. We apply NLP technologies: automatic segmentation and sentiment lexicon collection to calculate the sentiment of Monkey King. By judging the changes of the sentiment of Monkey King before and after the episode of “Real and Fake Monkey King”, we finally proposed such points as: “Monkey King was not killed by Rulai, the supreme Buddha”, and he changed to bend to obey the authority after the episode. This paper made a tentative exploration on sentiment analysis for literary studies.

Select

NLP Application

Neural Network-Based Jiju Poetry Generation

LIANG Jiannan, SUN Maosong, YI Xiaoyuan, YANG Cheng, CHEN Huimin, LIU Zhenghao

2019, 33(3): 126-135.

Abstract ( ) PDF ( )

Knowledge map

Save

Jiju poetry is a special kind of Chinese classical poetry in which each line is selected from existing poems respectively. As a form of art recreation, the reformed poem should not only obey the structural and phonological constraints, but also have an original theme, integrated content, and coherence. In this paper, we propose a novel automatic Jiju poetry generation model based on neural network. We apply Recurrent Neural Network (RNN) to learn the vector representation of each poetry line, then we investigate different methods to measure the context coherence of two lines. Both automatic and human evaluation results show that our model can generate high-quality Jiju poems, outperforming the baseline models significantly.

Select

NLP Application

Study on Automatic Judgment of Traffic Accidents

YIN Heju, ZAN Hongying, CHEN Junyi, ZHAI Xinli

2019, 33(3): 136-144.

Abstract ( ) PDF ( )

Knowledge map

Save

This article investigates the automatic judgment on the “traffic accidents” in civil cases of the legal field. The 14 000 samples are collected from the “China Jadgment Document Network.” Three models are examined, i.e. SVM-based model, BI-GRU-based model, and Attention+BI-GRU-based model, to classify the cases from the “China Judgment Document Network” into four-class and eight-class, respectively. The experimental results show that: the Attention+BI-GRU top-ranked with 80.26% F1 in the first task, while the BI-GRU model 48.59% F1 in the latter.

Please choose a citation manager

Content to export

2019 Volume 33 Issue 3 Published: 15 March 2019