Journal of Chinese Information Processing

Select

Language Analysis and Calculation

GMM-based Automatic Annotation of Chinese Constructions

HUANG Haibin, CHANG Baobao, ZHAN Weidong

2020, 34(9): 1-8.

Abstract ( ) PDF ( )

Knowledge map

Save

The paper introduces an approach to automatic annotation of Chinese constructions. Without annotated corpora as training data, it is difficult to extract the knowledge of various constructions. To address this issue, we apply the unsupervised method based on Gaussian Mixture Model, the token position features, the linguistic features of construction as well as the regular expressions to capture the structure of the instruction, especially when the boundary is hard to be identified. Comparing to the results annotated by regular expression and part-of-speech, the proposed method achieves improvements on F₁ by 17.9% (for semi-concretionary constructions), 19.3% (for phrasal constructions) and 14.9% (for sentential constructions).

Select

Language Resources Construction

A Fine-grained Evaluation Set for Chinese POS Tagging

TANG Qiantong, CHANG Baobao, ZHAN Weidong

2020, 34(9): 9-18.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a fine-grained evaluation scheme on Chinese POS Tagging. The key to this task is to determine the evaluation items and the samples (words) for each item. This paper presents an evaluation set of 5 873 sentences, totaling 2 326 words for 70 evaluation items. Several common open source POS taggers are evaluated. Finally, this paper discusses the advantages of the merits of this evaluation approach, especially in contrast to the classical methods.

Select

Language Resources Construction

Construction of Semantic Role Bank for Chinese Verbs from the Perspective of Ternary Collocation

WANG Chengwen, QIAN Qingqing, XUN Endong, XING Dan, LI Meng, RAO Gaoqi

2020, 34(9): 19-27.

Abstract ( ) PDF ( )

Knowledge map

Save

The research on semantic roles has always been a significant challenge in the field of linguistics. Some resources on the semantic relations have been constructed; however, most of the domestic researches on Chinese word semantic relations focuses on the labeling. This paper proposes a novel structure, the ternary collocation, to describe the semantic relations with verbs at the core. The paper also puts forward a semantic role classification scheme, under which a semantic role bank for Chinese verbs is constructed. All the verbs involved are exhaustively identified for the possible semantic roles and other related knowledge annotation. Altogether 5,260 verbs are collected, among which 2,685 verbs are assigned with 4,307 semantic roles as well as the guiding word.

Select

Language Resources Construction

Construction and Analysis of Fine-grained Car Review Corpus

CAO Ziyan, FENG Minxuan, MAO Xuefen, CHENG Ning, SONG Yang, LI Bin

2020, 34(9): 28-35.

Abstract ( ) PDF ( )

Knowledge map

Save

The product review is an important research object of sentiment analysis. At present, most of the existing product review corpus are relatively coarse, and the three elements of the target, attribute and polarity are not always annotated. The paper constructs a fine-grained emotional corpus of 9,343 short texts on car reviews. The target, attribute and polarity are all annotated for specific words, and further associated with the ontology tree of the products and attributes. The implicit expressions without sentimental words and special texts (such as suggestion remarks, comparative sentences, etc.) are also annotated by specific labels with corresponding triples. The statistics shows the co-occurrence of the target and attribute is as high as 77.54%, indicating it is necessary to provide complete annotation for sentiment corpus. The experiment on automatic annotation achieves up to 70.82% F1-score.

Select

Information Extraction and Text Mining

An Improved TextRank for Tibetan Summarization

LI Wei, YAN Xiaodong, XIE Xiaoqing

2020, 34(9): 36-43.

Abstract ( ) PDF ( )

Knowledge map

Save

For Tibetan text abstraction, this paper proposes an improved TextRank for Tibetan extractive summarization. This method integrates the information of the external corpus into the TextRank algorithm in the form of word vector. The sentence is represented by each word vector, which means sentence vector is applied for sentence scoring. We select the sentences with the highest scores and reorder them as a summary of the text. The experimental results demonstrate that the method can effectively improve the quality of the abstract according the ROUGE evaluation method.

Select

Information Extraction and Text Mining

TextRank Keyword Extraction Algorithm Based on Rough Data-Deduction

ZHOU Ning, SHI Wenqian, ZHU Zhaozhao

2020, 34(9): 44-52.

Abstract ( ) PDF ( )

Knowledge map

Save

TextRank algorithm based on graph model is an effective keyword extraction algorithm with high accuracy. However, when constructing the edges of a graph, the algorithm adopts the co-occurrence window rule that considers only the association between local words, yielding greater randomness and uncertainty. To address the issue, an improved TextRank keyword extraction algorithm based on rough data-deduction is proposed. In this method, candidate keywords are classified according to word meanings, and the association between candidate words in different classes is deduced by rough data-deduction. The experimental results show that the extraction precision of improved algorithm has been significantly improved.

Select

Information Extraction and Text Mining

Image Captioning Based on Bidirectional Attention Mechanism

ZHANG Jiashuo, HONG Yu, LI Zhifeng, YAO Jianmin, ZHU Qiaoming

2020, 34(9): 53-61.

Abstract ( ) PDF ( )

Knowledge map

Save

The attention-based encoder-decoder framework is widely used in image captioning. In previous methods, the single-directional attention mechanism does not check the consistency between semantic information and image content, causing low accuracy in the generated caption. In order to solve the above problem, this paper proposes an image captioning method based on bi-directional attention mechanism. On the basis of the single-directional attention mechanism, the attention calculation is added from image feature to the semantic information, enabling the interaction between the image and the semantic information in two directions. This paper designs a gated network to fuse information in the above two directions. In contrast to previous studies, this paper uses the historical semantic information to assist in current word generation in the attention module. Using two types of image features, the experimental results show that on MSCOCO dataset, the BLEU4 score is increased by 1.3 and the CIDEr score by 6.3 in average. And on Flickr30k, the BLEU4 score is increased by 0.9 and the CIDEr score by 2.4 in average.

Select

Sentiment Analysis and Social Computing

Feature-extended CNN Based Opinion Sentence Recognition from Case Related Microblog

WANG Xiaohan, YU Zhengtao, XIANG Yan, GUO Xianwei, HUANG Yuxin

2020, 34(9): 62-69.

Abstract ( ) PDF ( )

Knowledge map

Save

In the case related microblogs, the opinion sentence recognition should consider whether the comment discusses the topic of a specific case. To address this issue, this paper proposes an opinion sentence recognition model that combines the microblogs content as the feature. Under the framework of CNN, the vector of keyword in the case related microblog is concatenated with the corresponding comment word vector at the input layer. Experiments show that the accuracy of the model on two datasets of case related microblogs reaches 84.74% and 82.09%, respectively, with a significant improvement compared with the existing benchmarks.

Select

Sentiment Analysis and Social Computing

Helical Attention Networks for Aspect-level Sentiment Classification

DU Chengyu, LIU Pengyuan

2020, 34(9): 70-77.

Abstract ( ) PDF ( )

Knowledge map

Save

Aspect-level sentiment classification is a fine-grained sentiment analysis task, with the purpose to identify the sentiment polarity for a particular aspect. This paper proposes a BERT-based Helical Attention Networks (BHAN) which employ a helical attention mechanism to get a better representation of context and aspect. Specifically, on the basis of the weighted context representation based on averaged aspect vector, we use it to compute the attention weight of aspect. Then we use the new weight aspect representation to compute the context attention weight again. We can get a better representation of context and aspect by iterate above process until convergence. Evaluated on SemEval 2014 Task 4 and Twitter dataset, the proposed method out-performs the existing state-of-the-art methods.

Select

Sentiment Analysis and Social Computing

An Improved Generative Adversarial Network for Rumor Detection

LI Ao, DAN Zhiping, DONG Fangmin, LIU Longwen, FENG Yang

2020, 34(9): 78-88.

Abstract ( ) PDF ( )

Knowledge map

Save

Existing rumor detection algorithms, including general sequential models, are defected in capturing text semantics and key features detection, resulting in poor generalization capability. To address this issue, this paper proposes an improved generative adversarial network model named TGBiA for rumor detection. TGBiA adopts adversarial training method, to capture the development of augmentation, detraction, exaggeration and distortion during its spread. Generator model extracts sequence semantics and features via Transformer instead of RNN. And the discriminator is a classification model based on BiLSTM, with the attention mechanism introduced. Through the mutual promotion of the generator and discriminator, it enables the learning of the indicative features of rumors increasingly. Experimental results on the Weibo and Twitter datasets show that the proposed method is not only outperforms other existing detecting methods but is also more robust.

Select

Sentiment Analysis and Social Computing

Argument Recognition Based on Generative Adversarial Networks

YANG Liang, ZHOU Fengqing, ZHANG Li, MAO Guoqing, YI Bin, LIN Hongfei

2020, 34(9): 89-96.

Abstract ( ) PDF ( )

Knowledge map

Save

In the process of trial in the field of justice, the prosecution and the defense often hold different views around the argument of the case, which is also the key factors to the final judgment of the case. To identify the arguments in the cases, this paper introduce the text summarization model since the composition of the argument mostly depends on the analysis and summary of the case text. We construct the generation model of the argument by combining the generative adversarial network, and then obtain the argument of the case. Experimented on the real judicial data obtained from the website of China Judgements Online, the results show that the proposed model improves the accuracy in the task of argument recognition. This method can be applied as an auxiliary role in the pre-court preplan and trial of the case for procuratorial personnel in real application.

Select

NLP Application

An Analysis of Authorship of A Dream of Red Mansions Based on Optimal Document Embedding

XUE Yang, LIANG Xun, XIE Hualun, DU Wei

2020, 34(9): 97-110.

Abstract ( ) PDF ( )

Knowledge map

Save

A document embedding model is designed and trained over a corpus of 51 contemporary and Ming and Qing literary works including A Dream of Red Mansions.To achieve the optimal high-dimension document embedding vector to represent the semantic characteristics of words and document topics, the document embedding matrix and loss function of different authors are defined according to the unitary invariance of document embedding vector. An authorship identification method is designed by an unsupervised manifold learning dimensionality reduction mapping algorithm and a supervised classification algorithm. The classification accuracy of the known authors reaches 99.6%, even authors with similar styles such as Lu Yao and Chen Zhongshi can be effectively distinguished. The variable-scale sliding window classification model is further proposed to conduct an in-depth analysis of A Dream of Red Mansion. It is found that the first 80 chapters and the last 40 chapters may come from different authors, and there are also some style differences between the first 100 and the last 20 chapters.

Please choose a citation manager

Content to export

2020 Volume 34 Issue 9 Published: 12 October 2020