Journal of Chinese Information Processing

Select

Survey

A Survey of Chinese Zero Anaphora Resolution

JIANG Yuru, ZHANG Yuyao, MAO Teng, ZHANG Yangsen

2020, 34(3): 1-12.

Abstract ( ) PDF ( )

Knowledge map

Save

Zero anaphora resolution is a very important task in natural language processing. For more than two decades, scholars have proposed various methods based on linguistic rules, machine learning, and deep learning, reporting rich findings and empirical results. This paper first introduces the concepts of zero anaphora, following by the current international evaluation resources OntoNotes 5.0 dataset and the evaluation matrix. After that, we systematically examine and summarize the methods used in Chinese zero anaphora resolution at home and abroad. Finally, we disclose the main constraints of the current zero anaphora resolution research, as well as possible future research directions.

Select

Language Analysis and Calculation

Chinese Word Representations Based on HowNet Relevant Concept Field

FENG Yubo, CAI Dongfeng, SONG Yan

2020, 34(3): 13-22.

Abstract ( ) PDF ( )

Knowledge map

Save

Word embeddings are low-dimensional dense real number vectors of words, which play an important role in various natural language processing tasks. This paper proposes three models that can systematically learn embedding for all the relevant concept fields defined in HowNet, obtaining better word vectors especially for low frequency words. Experimental results show that our model can obviously improve the performance of word embedding on the tasks of word similarity and word analogy.

Select

Language Analysis and Calculation

Sentence Semantic Similarity Computation Based on Tree-LSTM with Multi-head Attention

HU Yanxia, WANG Cheng, LI Bicheng, LI Hailin, WU Yiyin

2020, 34(3): 23-33.

Abstract ( ) PDF ( )

Knowledge map

Save

A sentence semantic similarity computation based on Tree-LSTM with multi-head attention (MA-Tree-LSTM) is proposed according to the dependency tree. First, with external instructive features as input, the multi-head attention mechanism is applied to weigh the tree nodes in the Tree-LSTM. Second, the three-layer MA-Tree-LSTM is trained on the sentence semantic similarity calculation task to obtain the multi-layers representation of the semantic features. Finally, the semantic feature of the multi-layer is chosen to establish the semantic similarity calculation model of sentence, to make full use of the semantic structure features in the sentence pairs. The proposed method is robust, interpretable, and insensitive to the order of sentence words without feature engineering. Experimental results on the SICK and STS datasets prove the proposed method is better than Tree-LSTM and BiLSTM.

Select

Language Resources Construction

Chinese Contradictory Blocks: Boundary Recognition and Its Dataset

LI Bohan, JIANG Shan, LIU Chang, YU Dong

2020, 34(3): 34-43.

Abstract ( ) PDF ( )

Knowledge map

Save

Textual contradiction is a fundamental issue in natural language understanding. Most of existing researches are focused on contradiction detection, without exploring the causes of contradictions on a readily available fine-grained Chinese contradictory corpus. According to the contradiction types already established, we further clarify the concept of contradictory blocks, propose a labeling guideline, and finally construct a Chinese Contradiction Block (CCB) dataset. Several sequence labeling models and extractive machine reading comprehension models are examined on the dataset, with the latter performs better. We also analyze the factors affecting the correct identification of block boundaries, providing a baseline for follow-up research on this task.

Select

Machine Translation

Merging Coverage Mechanism for Multimodal Neural Machine Translation

LI Zhifeng, ZHANG Jiashuo, HONG Yu, YU Zhenkai, YAO Jianmin

2020, 34(3): 44-55.

Abstract ( ) PDF ( )

Knowledge map

Save

Multimodal neural machine translation refers in this paper to a machine learning method that directly uses neural networks to translate image and text modal information in an end-to-end system. This paper proposes a multimodal machine translation model based on dual attention decoding with coverage mechanism. This model works on the source language and the image respectively by means of the coverage mechanism, which can reduce the attention to past repeated information. This paper verifies the effectiveness of the proposed method over the official evaluation datasets of WMT16 and WMT17. Experimental results show that the method increases the performance of multimodal neural machine translation with 1.2%, 0.8%, 0.7% and 0.6% on the four benchmark datasets of WMT16 En-De/En-Fr and WMT17 En-De/En-Fr, respectively.

Select

Machine Translation

Integrating BERT Word Embedding into Quality Estimation of Machine Translation

LI Peiyun, LI Maoxi, QIU Bailian, WANG Mingwen

2020, 34(3): 56-63.

Abstract ( ) PDF ( )

Knowledge map

Save

The word embedding of BERT contains semantic, syntactic and context information, pre-trained for a various downstream tasks of natural language processing. We propose to introduce BERT into neural quality estimation of MT outputs by employing stacked BiLSTM (bidirectional long short-term memory), concatenated with the existing the quality estimation network at the output layer. The experiments on the CWMT18 datasets show that the quality estimation can be significantly improved by integrating upper and middle layers of the BERT, with the top-improvement brought by average pooling of the last four layers of the BERT. Further analysis reveals that the fluency in translation is better exploited by BERT in the MT quality estimation task.

Select

Information Extraction and Text Mining

A Joint Learning Model for Multi-prototype Word Embedding and Document Topics

CAO Zhonghua, XIA Jiali, PENG Wenzhong, ZHANG Zhibin

2020, 34(3): 64-71,106.

Abstract ( ) PDF ( )

Knowledge map

Save

Most models of word embedding assign each word with only one vector representation. The polysemy word embedding can be improved through the external information such as the topics of words. Based on the original skip-gram (cbow) and topic model, this paper designs two representation methods of multi-prototype word embedding and one method of text generation via word embedding. The joint learning approach is employed to simultaneously generate the topic information, the word embedding and the topic embedding, leveraging the multi prototype word vector and the document topic for each other. Experiments show that the proposed method can obtain different semantic vector of polysemy words and more coherence topics.

Select

Information Extraction and Text Mining

Distant Supervision for Tibetan Entity Relation Extraction

WANG Like, SUN Yuan, XIA Tianci

2020, 34(3): 72-79.

Abstract ( ) PDF ( )

Knowledge map

Save

Distant supervision for relation extraction is an efficient method to automatically align entities in texts to a given knowledge base (KB), which alleviated the problem of manual labelling. In this paper, we propose an improved distant supervised relation extraction model in Tibetan based on Piecewise Convolutional Neural Network (PCNN). The language model and the selective-attention mechanism are combined to alleviate wrong labelling problems and to extract effective features. Soft-label method is also introduced to dynamically correct the relation label. The experimental results show that our method is effective and outperforms several competitive baseline methods.

Select

Information Extraction and Text Mining

Extracting Evidences from Judgment Document via Entity Boundary Detection

YANG Jian, HUANG Ruizhang, DING Zhiyuan, CHEN Yanping, QIN Yongbin

2020, 34(3): 80-87.

Abstract ( ) PDF ( )

Knowledge map

Save

Evidences in judgment document are fundamental to the human judgment of a case, and then can be applied to access the case quality or support the “Intelligent Court”. To deal with the long and nested evidence extraction, this paper proposes an extraction model base on boundary detection and combination. Firstly, the BiLSTM-CRF model is used to detect begin and end boundaries of the evidences. Then assembling those boundaries into candidate evidences with plenty of fine-grained information. Finally, a tri-channel multi-core CNN classification model is applied to select the correct candidate. Experimental results show our method produces promising results.

Select

Sentiment Analysis and Social Computing

Fine-grained Opinion Analysis of Product Reviews Based on Syntactic Rules and HowNet

WEI Tingting, CHEN Weisheng, HU Yongjun, LUO Wei, BAO Xianyu

2020, 34(3): 88-98.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a fine-grained opinion analysis model based on syntactic rules and HowNet for product reviews. The model consists of three main modules: the target mining, the target-opinion mining, and the overall opinion estimation. Firstly, a target lexicon is constructed in terms of part-of-speech tagging and frequent item set mining, which is convenient to reuse and modify the overall opinion target of products. Secondly, rules are designed for target-opinion extraction based on the real data of e-commerce reviews. Finally, HowNet dictionary is adopted to estimate the overall score of all targets, and then to compare different brands of the same product in each evaluation perspective. The validity of this method is verified on product review corpus.

Select

NLP Application

Correlation Analysis of News and Cases Based on Unbalanced Siamese Network

ZHAO Chengding, GUO Junjun, YU Zhengtao, HUANG Yuxin, LIU Quan, SONG Ran

2020, 34(3): 99-106.

Abstract ( ) PDF ( )

Knowledge map

Save

The correlation analysis of news and cases is to predict whether the news and the cases are correlative, which is of significance to news comments analysis for law practice. Treating this issue as the text similarity estimation, we propose an unbalanced siamese network to conquer its deficiency facing unbalance of texts and the redundancy of news text. To employ the news headline containing main information, we select the sentences in the news text similar to the title so as to remove redundant information. To employ the case elements which represent the main semantic information of the case, we encode the news texts using case element as supervisory information via unbalanced siamese network. Experiment results show that the proposed model improved accuracy by 2.52% compared to baseline.

Select

NLP Application

Automatic Sentencing Prediction for Legal Texts

TAN Hongye, ZHANG Bowen, ZHANG Hu, LI Ru

2020, 34(3): 107-114.

Abstract ( ) PDF ( )

Knowledge map

Save

Large-scale legal documents provide a kind of data for Intelligent Judicial Adjudication research. This paper probes into the prediction for measurement of penalty based on interval partition and multi-models voting method, revealing that the strategy can effectively alleviate the issue of excessive types of penalty and data imbalance. Further, we explore the prediction for measurement of penalty based on case attributes to fully capture the factors in the circumstances of human judgment. Experiments on the dataset provided by 2018 Competition of AI and Law in China (CAIL2018) show the better performance of above models than the baselines.

Please choose a citation manager

Content to export

2020 Volume 34 Issue 3 Published: 15 May 2020