Journal of Chinese Information Processing

Select

Survey

A Survey of Research on Open Domain Dialogue Systems

CHEN Xin, ZHOU Qiang

2021, 35(11): 1-12.

Abstract ( ) PDF ( )

Knowledge map

Save

As a branch of the dialogue system, open domain dialogue has a good prospect in application. Different from task-based dialogue, it has strong randomness and uncertainty. This paper reviews the researches on open domain dialogue from the perspective of reply method, focusing on the application and improvement of sequence-to-sequence model in dialogue generation scenarios. The researches exhibit a clear clue from single-round dialogue to multi-round dialogue, and we further reveal that the sequence to sequence generation model has some problems that the characteristics of the model implementation and the application scenarios do not exactly match in the multi-round dialogue generation. Finally, we explore the possible improvements for the generation of multi-round dialogues from introducing external knowledge, introducing rewriting mechanism and introducing agent mechanism.

Select

Survey

A Survey of Web-oriented Storyline Mining

ZHAO Xujian, WANG Chongwei, JIN Peiquan, ZHANG Hui, YANG Chunming, LI Bo

2021, 35(11): 13-33.

Abstract ( ) PDF ( )

Knowledge map

Save

The complex Web information makes it difficult for people to quickly and accurately obtain the storyline of news events. Therefore, “storyline mining” has become a valid research issue in the recent years, with a purpose to extract the evolutionary stages of events and further explore the evolution model of events by analyzing the correlation between news events and subsequent related events. Storyline mining can be applied to many applications, such as web news retrieval, text summarization, and public opinion monitoring. This paper first outlines the definition, process and main tasks of storyline mining. Next, from the aspects of storyline construction and event evolution analysis, the main progresses of the current studies on this task are introduced in detail. And then we compare two types of datasets and their evaluation metrics. Finally, several future research directions and technical frameworks of the storyline mining are discussed in the paper.

Select

Language Analysis and Calculation

A Semantic Orthogonal Matching Method for Question Paraphrase Identification

ZHU Mengmeng, WU Kaili, HONG Yu, CHEN Xin, ZHANG Min

2021, 35(11): 34-42.

Abstract ( ) PDF ( )

Knowledge map

Save

Question paraphrase identification aims to identify whether two natural questions are semantically equivalence, with a core issue of semantic understanding. Current approach usually encoded the sentences into a vector representations and then the two representations are manipulated to give the proof to judge the equivalence. To further capture the the same and different points of the two questions, this paper propose a model to integrate the semantic orthogonal information. In this method, two questions are classified into similar and different representations, which enriches the representations of the questions and realizes the multi-granularity fusion. Experiments have been conducted on two real-world public datasets: LCQMC and Quora, and results demonstrate is the effectiveness of this method.

Select

Language Analysis and Calculation

Dynamic Word Masking Based Pre-trained Model for Sentence Matching

SONG Ting, GUO Zhancheng, HE Shizhu, LIU Kang, ZHAO Jun, LIU Shengping

2021, 35(11): 43-50.

Abstract ( ) PDF ( )

Knowledge map

Save

Pre-trained model such as BERT achieves good results in natural language understanding tasks via random masking strategy and next sentence prediction task. To capture the semantic matching relationship between sentences, this paper proposes a pre-trained model based on dynamic word masking. A large-scale sentence-pairs are obtained through sentence embeddings, and then the important words are masked to train a new kind of masked language model. Experimental on four datasets show that, the performance of RBT3 and BERT base are improved by the proposed method by 1.03% and 0.61%, respectively, according to average accuracy.

Select

Language Resources Construction

Constraction of High-quality and Open Source Tibetan-Chinese Parallel Corpus Judicial Domain

SHA Jiu, FENG Chong, ZHOU Luqin, LI Hongzheng, ZHANG Tianfu, HUI Hui

2021, 35(11): 51-59.

Abstract ( ) PDF ( )

Knowledge map

Save

The current Tibetan-Chinese (Ti-Zh) Machine Translation in the judicial domain suffers from a severe data-sparse issue. The high-quality Ti-Zh corpus in the judicial domain is obstructed by two issues: 1) rigorous logical expression and professional terminology vocabulary in judicial domain, and 2) unique lexical expression and specific syntactic structure of Tibetan. In this paper, we propose a lightweight Ti-Zh parallel corpus construction method for the judicial domain. First, we construct a medium-scale Tibetan-Chinese terminology glossary of the judicial domain to as the prior knowledge to avoid the missing of logical expression and domain terminology. Secondly, we collect the case data, such as judgment documents, from the official websites of Chinese courts in various places, with a priority of Tibetan case data. Finally, we build a high-quality Tibetan-Chinese parallel corpus 160,000-sentence Ti-Zh parallel corpus of the judicial domain, and we evaluate its quality and robustness via a variety of translation models and cross-validation experiments.This corpus will be provided as an open-source to for related research.

Select

Information Extraction and Text Mining

LI Yunlong, YU Zhengtao, GAO Shengxiang, GUO Junjun, PENG Renjie

2021, 35(11): 60-69.

Abstract ( ) PDF ( )

Knowledge map

Save

The case related news detection is to cluster the news on a specific case for puspose of public opinion analysis. To deal with the cluster divergence issue, this paper proposes a method of case element guided deep clustering method for news. Firstly, the method extracts key sentences to representation the text. Secondly, the case elements are adopted to represent the case, and to initialize the clustering center. Finally, the convolutional auto-encoder is applied to those key sentences, with a joint loss of the reconstruction loss plus clustering loss, to model the text representation close to the corresponding case. The auto-encoder parameters and clustering model parameters are updated alternately, so as to achieve text clustering. Experiments show that the proposed method improves the accuracy by 4.61% compared with the baseline method.

Select

Information Extraction and Text Mining

BERT-CNN: A Hierarchical Patent Classifier Based on Pre-trained Language Model

LU Xiaolei, NI Bin

2021, 35(11): 70-79.

Abstract ( ) PDF ( )

Knowledge map

Save

An accurate automatic patent classifier is crucial to patent inventors and patent examiners, and is of potential application in the fields of intellectual property protection, patent management, and patent information retrieval. This paper presents BERT-CNN, a hierarchical patent classifier based on pre-trained language model, which is trained by the national patent application documents collected from the State Information Center, China. The experimental results show that the proposed method achieves 84.3% accuracy, much better than the two compared baseline methods, Convolutional Neural Networks and Recurrent Neural Networks. In addition, this article also discusses the differences between hierarchical and flat strategies in multi-layer text classification.

Select

Information Extraction and Text Mining

Chinese Definition Modeling Based on BERT and Beam Search

FAN Qinan, KONG Cunliang, YANG Liner, YANG Erhong

2021, 35(11): 80-90.

Abstract ( ) PDF ( )

Knowledge map

Save

Definition modeling task refers to generate a corresponding definition for a target word. This paper introduces the context information of the target word and proposes a definition generation model based on BERT and beam search. A CWN Chinese definition modeling dataset is constructed with context of the target word. Experiments on this Chinese dataset and the English Oxford dataset show that the model achieves significant improvements in both dataset. Especially in CWN dataset, compared with the baseline model, the BLEU score is improved by 10.47, and the semantic similarity is improved by 0.105.

Select

Information Extraction and Text Mining

Case Factor Recognition Based on Pre-trained Language Models

LIU Haishun, WANG Lei, SUN Yuanyuan, CHEN Yanguang, ZHANG Shuchen, LIN Hongfei

2021, 35(11): 91-100.

Abstract ( ) PDF ( )

Knowledge map

Save

As an important research issue in legal intelligence, case factor recognition aims to automatically extract the important fact descriptions from the legal case texts and classify them into the factor system designed by the domain experts. Text encoding based on classical neural networks is difficult to extract deep-level features, and threshold based multi-label classification is difficult to capture the dependencies between labels. To deal with this issue, a multi-label text classification model based on pre-trained language model is proposed. The encoder is the language model fine-tuned with the strategy of Layer-attentive, and the decoder is LSTM based sequence generation model. Experiment on the CAIL2019 dataset reveals that the proposed method can improve the F₁ score by 7.4% over the Recurrent Neural Network, and 3.2% over the basic language model(BERT) under the same hyper parameter settings.

Select

Information Extraction and Text Mining

Mixed Convolution Network Based Entity Disambiguation for Short Text

JIANG Liting, Gulila ALTENBEK, MA Yajing

2021, 35(11): 101-108.

Abstract ( ) PDF ( )

Knowledge map

Save

Entity disambiguation for short text has some limitations that short text can not fully express semantic relations, provide less context information, and so on. This paper proposes a new method named mixed convolution network (MCN). In this method, firstly, preprocess the data in the dataset; Secondly, the BERT model proposed by Google is applied to feature extraction, and the features are further extracted through the attention mechanism as the input of CNN model. The sentence dependent features are obtained through CNN model. At the same time, GCN model obtains text semantic features. The semantic information extracted from them is fused and the results are output. The experimental results on the ccks2019 evaluation data set show that the MCN proposed by this paper achieves an accuracy of 86.57%, which verifies the effectiveness of the method.

Select

Information Retrieval and Question Answering

A Knowledge Base Question Answering System Based on Query Path Ranking

SONG Pengcheng, SHAN Lili, SUN Chengjie, LIN Lei

2021, 35(11): 109-117,126.

Abstract ( ) PDF ( )

Knowledge map

Save

We proposed a new Knowledge Base Question Answering System based on the technology of query path ranking in this paper. The system is able to handle both simple and complex multi-constraint questions. In order to improve the performance of the system, we use Lambda Rank algorithm to sort candidate query paths according to their correlation degree with a question. The candidate path with the highest correlation degree with a question is chosen and used to extract answers. Moreover, the system also adopted a kind of novel fusion method which improved the accuracy of the entity recognition problem. The system has achieved promising results in both CCKS2019 and CCKS2020 KBQA tasks.

Select

Information Retrieval and Question Answering

Improved DAM Model Based on Multi-head Attention and BiLSTM for Chinese Question Answering

QIN Hanzhong, YU Chongchong, JIANG Weijie, ZHAO Xia

2021, 35(11): 118-126.

Abstract ( ) PDF ( )

Knowledge map

Save

To effectively match response details and avoid semantic confusion, this paper proposes to improve the Deep Attention Matching Network(DAM) viamulti-head attention and Bi-directional Long Short-Term Memory (BiLSTM). This method can model longer multi-round of dialogue and handle the matching relationship between the response selection and the context. In addition, the BiLSTM Network applied in the feature fusion process can improve the accuracy of multi-turn response selection tasks by capturing the time-dependent relation. Tested on two public multi-turn response selection datasets, the Douban Conversion Corpus and the E-commerce Dialogue Corpus, our model is revealed to outperform the baseline model by 1.5% in R₁₀@1 with the word vector enhancement.

Select

Information Retrieval and Question Answering

Data Augmentation for Domain Specific Reading Comprehension

LV Zhengwei, YANG Lei, SHI Zhizhong, LIANG Xiao, LEI Tao, LIU Duoxing

2021, 35(11): 127-134.

Abstract ( ) PDF ( )

Knowledge map

Save

Reading comprehension as an advanced form of question answering system develops semantic understanding to analyze unstructured documents and generate answers, which has important research value and vast application prospects. Due to the high cost of obtaining training samples, reading comprehension for specific domain suffers from poor accuracy and robustness. In this paper we propose a data augmentation method for domain specific reading comprehension modeling, which constructs training samples based on real user questions. The experiments in the automobile field show that the method can effectively improve the accuracy and robustness of reading comprehension model.

Select

Sentiment Analysis and Social Computing

Conversational Sentiment Analysis Based on Heterogeneous Bipartite Graphs

PENG Tao, YANG Liang, SANG Zhongyi, TANG Yu, LIN Hongfei

2021, 35(11): 135-142.

Abstract ( ) PDF ( )

Knowledge map

Save

Conversational sentiment analysis aims to analyze and detect the emotional state of a person in a conversation when it is terminated. Different from traditional textual sentiment analysis, the context of dialogues and interactions between the speakers will have an important impact on their emotions. Meanwhile, the syntactic structure of the dialogue text is generally complex, and there is a long-range dependency of syntactic components in many cases. It is therefore a very challenging task. To address this issue, this paper introduces the syntactic dependence of text into the model. We firstly extract the syntactic structure information through the graph convolution network, and then combine it with the text sentiment analysis model. Finally, two models named H-BiLSTM+HGCL and BERT+HGCL are proposed for modeling semantic and syntactic structure simultaneously. Experiments on the Chinese conversation sentiment analysis dataset we constructed show that the proposed model achieves better performance than the baseline models without dependency relation.

Please choose a citation manager

Content to export

2021 Volume 35 Issue 11 Published: 20 November 2021