Journal of Chinese Information Processing

Select

Survey

A Survey of Question Generation

WU Yunfang, ZHANG Yangsen

2021, 35(7): 1-9.

Abstract ( ) PDF ( )

Knowledge map

Save

Question generation (QG) aims to automatically generate fluent and semantically related questions for a given text. QG can be applied to generate questions for reading comprehension tests in the education field, and to enhance question answering and dialog systems. This paper presents a comprehensive survey of related researches on QG. We first describe the significance of QG and its applications, especially in the education field. Then we outline the traditional rule-based methods on QG, and make a detailed description on the neural network based models from different views. We also introduce the evaluation metrics of generated questions. Finally, we discuss the limitations of previous studies and suggest future works.

Select

Survey

Knowledge Enhancement for Pre-trained Language Models: A Survey

SUN Yi, QIU Hangping, ZHENG Yu, ZHANG Chaoran, HAO Chao

2021, 35(7): 10-29.

Abstract ( ) PDF ( )

Knowledge map

Save

Introducing knowledge into data-driven artificial intelligence models is an important way to realize human-machine hybrid intelligence. The current pre-trained language models represented by BERT have achieved remarkable success in the field of natural language processing. However, the pre-trained language models are trained on large scale unstructured corpus data, and it is necessary to introduce external knowledge to alleviate its defects in determinacy and interpretability to some extent. In this paper, the characteristics and limitations of two kinds of pre-trained language models, pre-trained word embeddings and pre-trained context encoders, are analyzed. The related concepts of knowledge enhancement are explained. Four types of knowledge enhancement methods of pre-trained word embeddings are summarized and analyzed, which are pre-trained word embeddings retrofitting, hierarchizing the process of encoding and decoding, attention mechanism optimization and knowledge memory introduction. The knowledge enhancement methods of pre-training context encoders are described from two perspectives: 1) task-specific and task-agnostic; 2) explicit knowledge and implicit knowledge. Through the summary and analysis of the knowledge enhancement methods of the pre-trained language model, the basic pattern and algorithm are provided for the human-machine hybrid artificial intelligence.

Select

Language Analysis and Calculation

Incorporating Word Sense Information for Recognizing Textual Entailment

DU Qianlong, ZONG Chengqing, SU Keh-Yih

2021, 35(7): 30-40.

Abstract ( ) PDF ( )

Knowledge map

Save

The task of recognizing textual entailment detects whether a given text passage can be inferred from another passage. During inference process, the sense of each word plays an important role in understanding the meaning of the passages and predicting the relationship of the passage-pair. In this paper, we propose a novel approach to incorporate word sense information into the inference mechanism. We first use a word sense disambiguation system to generate the sense of each content word, and then use the information of word sense to improve the representations of the passages and enhance the capability of predicting the entailment relationship of the passage-pair. Experimental results show that our proposed approach can improve the performance effectively.

Select

Language Resources Construction

Large-scale Online Corpus Based Classical Integrated Chinese Dictionary Construction and Word Segmentation

XING Fugui, ZHU Tingshao

2021, 35(7): 41-46.

Abstract ( ) PDF ( )

Knowledge map

Save

The classical Chinese word segmentation is an important step to analyze existing ancient documents. In this paper, we first collect unstructured classical Chinese online corpus and accumate a basic dictionary. Then the candidate new words are discovered by a multi-feature fusion strategy, including mutual information, information entropy, and position word probability. Finally, a CCIDict of 349,740 words is applied with the forward maximum matching to segment the words in classical Chinese texts, achieving 14% improvements in F-value compared with the open-source Jiayan.

Select

Language Resources Construction

Semi-supervised Chinese-Burmese Bilingual Dictionary Construction

MAO Cunli, LU Shan, WANG Hongbin, YU Zhengtao, WU Xia, WANG Zhenhan

2021, 35(7): 47-53.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese-Burmese bilingual dictionary is an important data resource for research on machine translation and cross-language retrieval, etc. At present, the iterative self-learning method based on small-scale seed dictionary has achieved good results in extracting bilingual dictionaries from parallel corpus. However, for low-resource languages like Chinese-Burmese bilingual dictionary extraction task, due to the lack of bilingual parallel resources, the method based on iterative self-learning can not get effective bilingual word vector representation, resulting in the low accuracy of bilingual dictionary extraction model. Recent studies suggest that similar words in comparable corpora often have similar contexts. Therefore, this paper proposes a semi-supervised method for constructing Chinese- Burmese bilingual dictionary. By using the pre training language model, the context feature vector of bilingual vocabulary is constructed. The Chinese-Burmese bilingual vocabulary obtained by the iterative self-learning method of comparable corpus and small-scale seed dictionary is semantically enhanced. The experimental results show that the proposed method has a significant improvement comparing with the baseline method.

Select

Machine Translation

Sentence Based Automatic Scoring Method for Chinese-English Oral Translation

LI Xinguang, CHEN Shuai, LONG Xiaolan

2021, 35(7): 54-62.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a sentence-based automatic scoring method for Chinese-English oral translation. Three main indicators are designed for evaluating keywords, general idea of sentences and fluency. As for keywords, this paper applies the synonym analysis to identify synonyms in candidate keywords. At the sentence level, the translation of sentences is evaluated by Unfolding Recursive Auto-Encoder (URAE). Then, fluency is scored by the speed of the speech. Finally, the weighted sum of the three indicators is generated as the overall translation quality score. The experimental results demonstrated that this automatic scoring method bears good consistency with manual scoring method.

Select

Ethnic Language Processing and Cross Language Processing

Uyghur Text Categorization Joint Model Based on Multi-convolution Kernel DPCNN

JIAMILA Wushouer, WU Di, WANG Lulu, GULINIGEER Abudouwaili, MAIHEMUTI Maimaiti, TUERGEN Yibulayin

2021, 35(7): 63-71.

Abstract ( ) PDF ( )

Knowledge map

Save

Uyghur is rich in form and scarce in resources, which challenges the existing deep learning models for Uyghur text classification. This paper proposes a text classification model called MDPLC combining both Bi-LSTM+CNN and DPCNN. Firstly, the pre-trained word vector is fused with the semantic information processed by Bi-LSTM to obtain the semantic dependency of the whole sentence, and the local semantic learning is further strengthened by a layer of pooled CNN. Meanwhile, the text semantic information is captured by using multi-convolution kernel DPCNN in a dual-channel way. Experiments on short and long text data sets of Chinese, English, and Uyghur show that the accuracy of the proposed model is better than that of the existing popular deep learning models.

Select

Ethnic Language Processing and Cross Language Processing

Multi-Head Attention for Domian Specific Chinese Word Segmentation Model — A Case Study on Tibet’s Animal Husbandry

CUI Zhiyuan, ZHAO Erping, LUO Weiqun, WANG Wei, SUN Hao

2021, 35(7): 72-80.

Abstract ( ) PDF ( )

Knowledge map

Save

Domain specific corpora such as Tibetan animal husbandry corpus are rich in direct transliteration or synthesis of unknown words. To improve the word segmentation for such corpora, this paper proposes a Chinese word segmentation model via Multi-Head Attention. To capture the dependence relationship and syncopation point information between key character vectors, the Multi-Head Attention mechanism is applied to calculate the correlation between important character vectors and other character vectors in parallel regardless the distance between them. Then the conditional random fields is employed to model lexeme labels for the optimal word segmentation sequence. Finally, a domain dictionary is constructed to further improve the effect of word segmentation. Experiments on the corpus of animal husbandry in Tibet show that, compared with classical models such as Bi-LSTM-CRF, the accuracy, recall rate and F₁ value of the proposed model are increased by 3.93%, 5.3% and 3.63%, respectively.

Select

Information Extraction and Text Mining

BERT Based Multi-layer Label Pointer Network for Event Extraction

WANG Bingqian, SU Shaoxun, LIANG Tianxin

2021, 35(7): 81-88.

Abstract ( ) PDF ( )

Knowledge map

Save

Event extraction (EE) refers to the technology of extracting events from natural language texts and identifying event types and event elements. This paper proposes an end-to-end multi-label pointer network for event extraction, in which the event detection task is integrated into the event element recognition task to extract event elements and event types at the same time. This method avoids the problem of wrong cascade and task separation in traditional pipeline methods, and alleviates the problem of role overlapping and element overlapping in event extraction. The proposed method achieves 85.9% F₁ score on the test set in 2020 Language and Intelligence Challenge Event Extraction task.

Select

Information Extraction and Text Mining

Few-shot Text Classification with Dual Channel Graph Neural Networks

WANG Yanggang, QIU Xipeng, HUANG Xuanjing, WANG Yining, LI Yunhui

2021, 35(7): 89-97,108.

Abstract ( ) PDF ( )

Knowledge map

Save

Graph neural networks(GNN) recently appears to be an effective method to model the global context representation of samples, but defected in over-smoothing when faced with the noisy few-shot text classification scenario. We propose a dual channel graph neural network to model the full context features while making full use of the label propagation mechanism. A multi-task parameter sharing mechanism is used in the dual channels to effectively constrain the graph iteration process. Compared with the baseline graph neural network, our method achieves an average improvement of 1.51% on the FewRel dataset and 11.1% improvement on the ARSC dataset.

Select

Information Extraction and Text Mining

Social Influence Enhanced Temporal Summarization for Hot Events

LUO Fang, WANG Jinghang, ZHANG Yuheng, HE Daosen, PU Qiumei

2021, 35(7): 98-108.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper designs a new temporal summarization extraction method to improve the model of the evolution overview of numerous hot events in social media. Based on the analysis of the hot events evolution stage, this study explores the time-series and social influence of social texts and proposes a novel temporal summarization method, LexRank Summarization with Timeline-Social Influence (LSTS). Experimental results showed that LSTS achieves optimal results with a weight ratio 0.4 between the time-series and social influence, reading 44.23％, 34.78％, 27.86％ according to ROUGE-1, ROUGE-2 and ROUGE-S4, respectively.

Select

Machine Reading Comprehension

A Reading Comprehension Model for Judical Texts Based on Multi Task Joint Training

LI Fangfang, REN Xingkai, MAO Xingliang, LIN Zhongyao, LIU Xiyao

2021, 35(7): 109-117,125.

Abstract ( ) PDF ( )

Knowledge map

Save

The combination of artificial intelligence with law has become a hot research issue. Focused on the machine reading comprehension task of China AI Law Challenge 2020 (CAIL2020), this paper proposes a multi-task joint training of four sub-modules: word embedding module, answer extraction module, answer classification module and supporting facts discrimination module. This paper proposes a data augmentation method based on TF-IDF ‘question-context’ similarity matching, which re-labels the training set of CAIL2019 for data augmentation. After performing CAIL2020 machine reading comprehension task, the F₁ value of this model achieves 74.49 as the first place in this task.

Select

Machine Reading Comprehension

An Approach to Chinese Idioms Reading Comprehension

XU Jiawei, LIU Ruifang, GAO Sheng, LI Si

2021, 35(7): 118-125.

Abstract ( ) PDF ( )

Knowledge map

Save

To address the machine reading comprehension of some complex linguistic phenomena, such as Chinese idioms, we propose an enhanced global attention module to better perceive the grammatical functions of idioms in different contexts. We adjust original global attention by generating an extra attention factor for each spatial position, so as to enhance the recognition of specific word senses. We integrate this module with the popular BERT language model for Chinese cloze task. Results on a recently released cloze-test dataset ChID show that our method achieves significant improvements, compared with the fine-tuned BERT model and global attention model.

Select

Sentiment Analysis and Social Computing

Feature Dual Distillation Network for Aspect-Based Sentiment Analysis

SONG Wei, WEN Zijian

2021, 35(7): 126-133.

Abstract ( ) PDF ( )

Knowledge map

Save

Current methods of aspect-based sentiment analysis usually utilize the attention mechanism to fulfill the interaction between sentence and aspect. However, the attention mechanism often results in the mismatches between words of sentence and aspect, which will introduce extraneous noise. To address this issue, this paper proposes a feature dual distillation network for aspect-based sentiment analysis. Firstly, BiLSTM is utilized to extract context semantic features, and a context-based aspect embedding is utilized to obtain the semantic feature of aspect. Moreover, a gate mechanism is employed to construct a dual distillation gate where preliminary distillation and fine distillation processes are utilized to fulfill the interaction between the semantic features of sentence and aspect. Finally, Softmax is utilized to predict the sentiment polarities. On commonly used Laptop, Restaurant and Twitter datasets, the proposed method performs better than the state-of-the-art methods with 79.26%, 84.53% and 75.30% accuracy, and 75.77%, 75.63% and 73.21% Macro-F₁, respectively.

Select

Natural Language Understanding and Generation

Response Generation by Retrieved Response Fusion Mechanism

LIU Xikai, LIN Hongfei, XU Bo, YANG Liang, REN Yuqi

2021, 35(7): 134-142.

Abstract ( ) PDF ( )

Knowledge map

Save

Response generation is an important component of the dialogue system. To better combine the retrieval-based model with a generation-based model, this paper proposes a response generative model via the retrieved response fusion mechanism. The model uses bidirectional LSTM to encode the retrieved response, and then this paper proposes a Long Short-Term Memory network with a fusion mechanism (fusion-LSTM). This mechanism fuses retrieval results with dialogue text within the model to better integrate the retrieved information into the generative model. The experimental results show that this method is superior to the baseline methods in both automatic and human evaluation.

Please choose a citation manager

Content to export

2021 Volume 35 Issue 7 Published: 30 July 2021