Journal of Chinese Information Processing

Select

Language Analysis and Calculation

An Investigation of Inconsistency in Semantic Role Labeling: A Case Study of Agent

CHEN Long, ZHAN Weidong

2019, 33(1): 1-9.

Abstract ( ) PDF ( )

Knowledge map

Save

Semantic role labeling, which is based on the theories of semantic roles, is an important task in natural language processing. In semantic role labeling, it is required that the semantic role of a verb should be consistent in different sentences. We analyzed the semantic distribution of agents in the semantic role labeling corpus. It is revealed that 85 verbs can't meet the requirement of consistency, which may be accounted by their property of causative verbs instead of volitive verbs. This strategy can be extended to other inconsistency issues in semantic role labeling.

Select

Language Analysis and Calculation

Improved Character-based Chinese Dependency ParsingBased on Stack-Tree LSTM

LIU Hang, LIU Mingtong, ZHANG Yujie, XU Jinan, CHEN Yufeng

2019, 33(1): 10-17.

Abstract ( ) PDF ( )

Knowledge map

Save

In the character-based Chinese dependency parsing, it is crucial to best utilize the intermediate results of word segmentation, POS tagging and dependency parsing. To fully exploit the dependency subtree information, this paper proposes a novel Stack-Tree LSTM, which is essentially a character-based neural network joint model by integrating subtree feature and POS feature in addition to N-gram feature. Experiments on Penn Chinese Treebank 5 show that our model is comparable to the best results, out-performing other neural joint models.

Select

Language Analysis and Calculation

Multi-grained Chinese Word Segmentation with Lattice-LSTM

ZHANG Wenjing, ZHANG Huimeng, YANG Liner, XUN Endong

2019, 33(1): 18-24.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese word segmentation is crucial to Chinese information processing. To achieve the multi-grained word segmentation, we proposed a model based on the lattice-LSTM. Multi-granularity dictionary information is added into our method comparing with the traditional character based LSTM model. With the help of lattice structure, our model has a strong ability to capture word segmentation standards with different granularities, without being confined to any word segmentation standard. Experiments show that the method proposed in this paper has reached the state-of-the-art performance in the field of multi-granularity Chinese word segmentation.

Select

Language Analysis and Calculation

Negation Focus Identification via Bi-directional LSTM-CRF Model

SHEN Longxiang, ZOU Bowei, YE Jing, ZHOU Guodong, ZHU Qiaoming

2019, 33(1): 25-34.

Abstract ( ) PDF ( )

Knowledge map

Save

Negative expressions are common phenomena in natural language text and play a critical role in various applications of natural language processing, such as sentiment analysis, information extraction. Negation focus identification task is a finer-grained negative semantic analysis, which aims at identifying the text fragment modified and emphasized by a negative keyword. Treating the negation focus identification as a sequence labeling task, we propose a bidirectional Long Short-Term Memory network with a Conditional Random Field layer (BiLSTM-CRF). It can not only learn the contextual information from both directions, but also learn the dependency between the output tags by the CRF layer. Experimental results on the *SEM2012 dataset shows that the performance of our approach achieves an accuracy of 69.58%, i.e. 2.44% improvement compared to the state-of-the-art methods.

Select

Language Analysis and Calculation

RNN Based Chinese Parsing for Binary Tree Structure

GU Bo, WANG Ruibo, LI Jihong, LI Guochen

2019, 33(1): 35-45.

Abstract ( ) PDF ( )

Knowledge map

Save

We construct a 30 000 sentences binary Chinese Treebank which is base on Chinese syntactic theory proposed by Zhu DeXi and Lu JianMin, in which each parse is a full binary tree and represented by Huffman coding for simplicity. To deal with its parsing, we propose a sequential labeling model (RNN-Interval, abbr RNN-INT) based on RNN(recurrent neural network) tagging the intervals between words. We compared our model RNN-INT with primary RNN, LSTM and CRF models, employing the m×2 cross-validated sequential t-test. The experiment results show that the proposed model achieves the best performance with window size 1according to constituency F₁ and sentence accuracy, i.e. 71.25% and 43%, respectively.

Select

Language Analysis and Calculation

New Word Detection in Ancient Chinese Corpus

LIU Yutong, WU Bin, XIE Tao, WANG Bai

2019, 33(1): 46-55.

Abstract ( ) PDF ( )

Knowledge map

Save

New word detection, as a fundamental task in natural language processing, is an indispensable step in the computational study of ancient Chinese literature. In this work, we present an AP-LSTM-CRF model to discover new words in ancient Chinese literature. This model consists of three steps. First, the parallelized improved-Apriori algorithm, implemented on Apache Spark (a distributed parallel computing framework), is used to generate candidate character sequences from large-scale raw corpus. Second, a segmentation model which combines recurrent neural network and conditional random field is used to generate segmentation sequences with probabilities. Third, we design a rule based filter to remove noise words in the candidate character sequences. Experimental results demonstrate that the method is capable of detecting new words in large-scale ancient Chinese corpus effectively. The F₁ is up to 89.68% and 81.13% in Song Poetry dataset and History of the Song Dynasty dataset, respectively.

Select

Language Analysis and Calculation

Refining Word Vector Representation with Reliable Lexical Semantic Constraints

LIANG Yongshi, HUANG Peijie, HUANG Peisong, DU Zefeng

2019, 33(1): 56-67.

Abstract ( ) PDF ( )

Knowledge map

Save

Word vector representation is the basis for various natural language processing (NLP) systems. Studies have shown that word vectors trained from large corpora can be refined by semantic constraints in various lexical taxonomies. Based on lexicon-vectors interaction and the heterogeneous taxonomies' interaction, we present the method of extracting reliable lexical semantic constraints to better refine word vectors representation. In this method, the word class knowledge from lexical taxonomies is assessed for reliability based on word vectors' calculation. Experimental results on PKU 500 from the NLPCC-ICCPOL 2016 shared task on Chinese word similarity measurement show that the proposed method outperforms in the word similarity calculation with a Spearman score 0.649 7, which gains 25.4% improvement comparing to the best result in the shared task.

Select

Language Analysis and Calculation

Feature Enhanced CNN for Rhetorical Questions Identification

WEN Zhi, LI Yang, WANG Suge, LIAO Jian, CHEN Xin

2019, 33(1): 68-76.

Abstract ( ) PDF ( )

Knowledge map

Save

The Rhetorical Question is a kind of expression with a strong emotion. To automatically identify Rhetorical Questions, this paper proposes a CNN method combined with the sentence structure of Rhetorical Questions. Firstly, candidate Rhetorical Questions is selected from the microblog according to the feature words and sequence pattern features (>70% confidence). Then, word vectors and the features of the Rhetorical Questions are extracted to generate the representations by multiple convolution kernels. Finally, the softmax classifier is used to classify sentences. The experimental results show that the proposed method achieves 89.5%, 84.2% and 86.7% in terms of accuracy, recall, and F-measure, respectively.

Select

Language Analysis and Calculation

Research on Automatic Summarization Coherence Based on Discourse Rhetoric Structure

LIU Kai, WANG Hongling

2019, 33(1): 77-84.

Abstract ( ) PDF ( )

Knowledge map

Save

In order to improve the readability of automatic summaries, this article attempts to apply the discourse rhetorical structure information to Chinese automatic summarization. First, abstracts are extracted based on the rhetorical structure of Chinese texts. Then the LSTM-based methods are adopted to evaluate the coherence of the abstracts. The experimental results show that, automatic abstraction based on discourse rhetorical structure has better ROUGE value than traditional methods. The coherence evaluation results show that the discourse structure information can help the system extract the subject of the article automatically.

Select

Machine Translation

Application of Sub-word Segmentation in Mongolian-Chinese Neural Machine Translation

REN Zhong, HOU Hongxu, JI Yatu, WU Ziyu, BAI Tiangang, LEI Ying

2019, 33(1): 85-92.

Abstract ( ) PDF ( )

Knowledge map

Save

In the Mongolian-Chinese neural machine translation, the data sparse issue is of substantial effect to the translation quality. This paper applies the sub-word granularity segmentation in the Mongolian-Chinese neural machine translation model. The Byte Pair Encoding algorithm is adopted to alleviate the data sparseness by reducing the low-frequency words into relatively high-frequency sub-units. Experiments show that the sub-word segmentation technique can improve the Mongolian-Chinese neural machine translation, achieving 4.81 and 2.96 improvements in BLEU score, respectively.

Select

Machine Translation

On Storage Compression for Neural Machine Translation

LIN Ye, JIANG Yufan, XIAO Tong, LI Hengyu

2019, 33(1): 93-102.

Abstract ( ) PDF ( )

Knowledge map

Save

The model storage compression is to significantly reduce the storage cost by removing redundant model parameters without quality loss. Previous efforts are mostly devoted to computer vision tasks, leaving neural machine translation less touched. In this paper, we compare the model compression methods including pruning, quantification, and low-precision compression on Transformer and RNN models. Finally, we achieve 5.8× and 11.7× compression ratio on the Transformer and RNN models by a combined approach, while maintaining the same BLEU score.

Select

Ethnic Language Processing and Cross Language Processing

Uyghur Dependency Treebank Based on Chinese-Uyghur Mapping

TURGHUN Osman, YANG Yating, WANG Lei, ZHOU Xi, CHENG Li

2019, 33(1): 103-110.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a novel approach to construct Uyghur dependency treebank using Chinese-Uyghur bilingual corpus. The Chinese dependency information is mapped into Uyghur sentences through word alignment. And then the Uyghur dependency information is further optimize by morphological constraints. Experimental results in CoNLL 2017 Shared Task dataset show that the proposed Uyghur parsing model can achieve 34.38% and 52.53% improvements in terms of LAS and UAS accuracy, respectively.

Select

Ethnic Language Processing and Cross Language Processing

A CNN Model for Tibetan Syllable Spell Checking

SE Chajia, GONG Baocairang, CAI Rangjia

2019, 33(1): 111-117.

Abstract ( ) PDF ( )

Knowledge map

Save

The spelling check of Tibetan syllables is the basic task of Tibetan Natural Language Processing. This paper proposes a method of syllable vectorization, which is called syllable matrix, for the structure of Tibetan syllables. Then, a CNN model for spelling checking is trained using 1 364 880 Tibetan syllables. The final test on the 68 244 Tibetan syllables shows that the CNN model of Tibetan syllable spelling is better than the TSRM, RNN and LSTM models, achieving 99.52%, 99.30% and 99.41% in terms of accuracy, recall and F value, respectively.

Select

Question-answering and Dialogue System

A Ranking Model for Answer Selection with Deep Matching Features

FENG Wenzheng, TANG Jie

2019, 33(1): 118-124.

Abstract ( ) PDF ( )

Knowledge map

Save

Answer Selection is one of the key tasks in question answering system. Its main purpose is to rank the candidate answers according to the similarity between the questions and the candidate answers and select the more relevant answers to users. It can be seen as a text pair matching problem. In this paper, we use the deeplearning model such as word embedding, bidirectional LSTM, 2D neural network and so on to extract the semantic matching features for question-answer pairs, and incorperate these into a ranking model together with traditonal NLP features. The experiments on the Qatar Living community question answering data show that the answer selection model with deep matching features is about 5% higher than only using traditional features on the MAP values.

Select

Question-answering and Dialogue System

An Attention Based Contextual QA Pairing Method

WANG Lu, ZHANG Lu, LI Shoushan, ZHOU Guodong

2019, 33(1): 125-132.

Abstract ( ) PDF ( )

Knowledge map

Save

To deal with the informal texts in social media where a question text has several questions and the answer text has several answers, we propose a new task named QA pairing, which means to identify the answer sentence(s) for each question. First, we build a novel QA pairing corpus with informal text, which is collected from a product reviewing website. Then, in order to resolve the noises in informal text, we propose a novel QA pairing approach, namely contextual QA pairing method based on attention network. Empirical studied demonstrate the effectiveness of the proposed approach to QA pairing.

Select

Sentiment Analysis and Social Computing

Chinese Text Sentiment Orientation Analysis Based on Convolution Neural Network and Hierarchical Attention Network

CHENG Yan, YE Ziming, WANG Mingwen, ZHANG Qiang, ZHANG Guanghe

2019, 33(1): 133-142.

Abstract ( ) PDF ( )

Knowledge map

Save

Text sentiment orientation analysis is a fundamental problem in natural language processing. To further improve the deep learning based models used in this issue, this paper proposes a new model named C-HAN, i.e. Convolutional Neural Network-based and Hierarchical Attention Network-based Chinese Sentiment Classification Model. It utilizes a convolution layer to extract a sequence of higher-level phrase representations, which are then fed into a hierarchical attention network to obtain the final representations. On the Chinese sentiment analysis corpus, the character level C-HAN achieves a sentiment prediction accuracy of 92.34%, slightly better than the word level C-HAN yielding 91.96% accuracy.

Please choose a citation manager

Content to export

2019 Volume 33 Issue 1 Published: 21 January 2019