2021 Volume 35 Issue 11 Published: 20 November 2021
  

  • Select all
    |
    Survey
  • Survey
    CHEN Xin, ZHOU Qiang
    2021, 35(11): 1-12.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a branch of the dialogue system, open domain dialogue has a good prospect in application. Different from task-based dialogue, it has strong randomness and uncertainty. This paper reviews the researches on open domain dialogue from the perspective of reply method, focusing on the application and improvement of sequence-to-sequence model in dialogue generation scenarios. The researches exhibit a clear clue from single-round dialogue to multi-round dialogue, and we further reveal that the sequence to sequence generation model has some problems that the characteristics of the model implementation and the application scenarios do not exactly match in the multi-round dialogue generation. Finally, we explore the possible improvements for the generation of multi-round dialogues from introducing external knowledge, introducing rewriting mechanism and introducing agent mechanism.
  • Survey
    ZHAO Xujian, WANG Chongwei, JIN Peiquan, ZHANG Hui, YANG Chunming, LI Bo
    2021, 35(11): 13-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    The complex Web information makes it difficult for people to quickly and accurately obtain the storyline of news events. Therefore, “storyline mining” has become a valid research issue in the recent years, with a purpose to extract the evolutionary stages of events and further explore the evolution model of events by analyzing the correlation between news events and subsequent related events. Storyline mining can be applied to many applications, such as web news retrieval, text summarization, and public opinion monitoring. This paper first outlines the definition, process and main tasks of storyline mining. Next, from the aspects of storyline construction and event evolution analysis, the main progresses of the current studies on this task are introduced in detail. And then we compare two types of datasets and their evaluation metrics. Finally, several future research directions and technical frameworks of the storyline mining are discussed in the paper.
  • Language Analysis and Calculation
  • Language Analysis and Calculation
    ZHU Mengmeng, WU Kaili, HONG Yu, CHEN Xin, ZHANG Min
    2021, 35(11): 34-42.
    Abstract ( ) PDF ( ) Knowledge map Save
    Question paraphrase identification aims to identify whether two natural questions are semantically equivalence, with a core issue of semantic understanding. Current approach usually encoded the sentences into a vector representations and then the two representations are manipulated to give the proof to judge the equivalence. To further capture the the same and different points of the two questions, this paper propose a model to integrate the semantic orthogonal information. In this method, two questions are classified into similar and different representations, which enriches the representations of the questions and realizes the multi-granularity fusion. Experiments have been conducted on two real-world public datasets: LCQMC and Quora, and results demonstrate is the effectiveness of this method.
  • Language Analysis and Calculation
    SONG Ting, GUO Zhancheng, HE Shizhu, LIU Kang, ZHAO Jun, LIU Shengping
    2021, 35(11): 43-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    Pre-trained model such as BERT achieves good results in natural language understanding tasks via random masking strategy and next sentence prediction task. To capture the semantic matching relationship between sentences, this paper proposes a pre-trained model based on dynamic word masking. A large-scale sentence-pairs are obtained through sentence embeddings, and then the important words are masked to train a new kind of masked language model. Experimental on four datasets show that, the performance of RBT3 and BERT base are improved by the proposed method by 1.03% and 0.61%, respectively, according to average accuracy.
  • Language Resources Construction
  • Language Resources Construction
    SHA Jiu, FENG Chong, ZHOU Luqin, LI Hongzheng, ZHANG Tianfu, HUI Hui
    2021, 35(11): 51-59.
    Abstract ( ) PDF ( ) Knowledge map Save
    The current Tibetan-Chinese (Ti-Zh) Machine Translation in the judicial domain suffers from a severe data-sparse issue. The high-quality Ti-Zh corpus in the judicial domain is obstructed by two issues: 1) rigorous logical expression and professional terminology vocabulary in judicial domain, and 2) unique lexical expression and specific syntactic structure of Tibetan. In this paper, we propose a lightweight Ti-Zh parallel corpus construction method for the judicial domain. First, we construct a medium-scale Tibetan-Chinese terminology glossary of the judicial domain to as the prior knowledge to avoid the missing of logical expression and domain terminology. Secondly, we collect the case data, such as judgment documents, from the official websites of Chinese courts in various places, with a priority of Tibetan case data. Finally, we build a high-quality Tibetan-Chinese parallel corpus 160,000-sentence Ti-Zh parallel corpus of the judicial domain, and we evaluate its quality and robustness via a variety of translation models and cross-validation experiments.This corpus will be provided as an open-source to for related research.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    LI Yunlong, YU Zhengtao, GAO Shengxiang, GUO Junjun, PENG Renjie
    2021, 35(11): 60-69.
    Abstract ( ) PDF ( ) Knowledge map Save
    The case related news detection is to cluster the news on a specific case for puspose of public opinion analysis. To deal with the cluster divergence issue, this paper proposes a method of case element guided deep clustering method for news. Firstly, the method extracts key sentences to representation the text. Secondly, the case elements are adopted to represent the case, and to initialize the clustering center. Finally, the convolutional auto-encoder is applied to those key sentences, with a joint loss of the reconstruction loss plus clustering loss, to model the text representation close to the corresponding case. The auto-encoder parameters and clustering model parameters are updated alternately, so as to achieve text clustering. Experiments show that the proposed method improves the accuracy by 4.61% compared with the baseline method.
  • Information Extraction and Text Mining
    LU Xiaolei, NI Bin
    2021, 35(11): 70-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    An accurate automatic patent classifier is crucial to patent inventors and patent examiners, and is of potential application in the fields of intellectual property protection, patent management, and patent information retrieval. This paper presents BERT-CNN, a hierarchical patent classifier based on pre-trained language model, which is trained by the national patent application documents collected from the State Information Center, China. The experimental results show that the proposed method achieves 84.3% accuracy, much better than the two compared baseline methods, Convolutional Neural Networks and Recurrent Neural Networks. In addition, this article also discusses the differences between hierarchical and flat strategies in multi-layer text classification.
  • Information Extraction and Text Mining
    FAN Qinan, KONG Cunliang, YANG Liner, YANG Erhong
    2021, 35(11): 80-90.
    Abstract ( ) PDF ( ) Knowledge map Save
    Definition modeling task refers to generate a corresponding definition for a target word. This paper introduces the context information of the target word and proposes a definition generation model based on BERT and beam search. A CWN Chinese definition modeling dataset is constructed with context of the target word. Experiments on this Chinese dataset and the English Oxford dataset show that the model achieves significant improvements in both dataset. Especially in CWN dataset, compared with the baseline model, the BLEU score is improved by 10.47, and the semantic similarity is improved by 0.105.
  • Information Extraction and Text Mining
    LIU Haishun, WANG Lei, SUN Yuanyuan, CHEN Yanguang, ZHANG Shuchen, LIN Hongfei
    2021, 35(11): 91-100.
    Abstract ( ) PDF ( ) Knowledge map Save
    As an important research issue in legal intelligence, case factor recognition aims to automatically extract the important fact descriptions from the legal case texts and classify them into the factor system designed by the domain experts. Text encoding based on classical neural networks is difficult to extract deep-level features, and threshold based multi-label classification is difficult to capture the dependencies between labels. To deal with this issue, a multi-label text classification model based on pre-trained language model is proposed. The encoder is the language model fine-tuned with the strategy of Layer-attentive, and the decoder is LSTM based sequence generation model. Experiment on the CAIL2019 dataset reveals that the proposed method can improve the F1 score by 7.4% over the Recurrent Neural Network, and 3.2% over the basic language model(BERT) under the same hyper parameter settings.
  • Information Extraction and Text Mining
    JIANG Liting, Gulila ALTENBEK, MA Yajing
    2021, 35(11): 101-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    Entity disambiguation for short text has some limitations that short text can not fully express semantic relations, provide less context information, and so on. This paper proposes a new method named mixed convolution network (MCN). In this method, firstly, preprocess the data in the dataset; Secondly, the BERT model proposed by Google is applied to feature extraction, and the features are further extracted through the attention mechanism as the input of CNN model. The sentence dependent features are obtained through CNN model. At the same time, GCN model obtains text semantic features. The semantic information extracted from them is fused and the results are output. The experimental results on the ccks2019 evaluation data set show that the MCN proposed by this paper achieves an accuracy of 86.57%, which verifies the effectiveness of the method.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    SONG Pengcheng, SHAN Lili, SUN Chengjie, LIN Lei
    2021, 35(11): 109-117,126.
    Abstract ( ) PDF ( ) Knowledge map Save
    We proposed a new Knowledge Base Question Answering System based on the technology of query path ranking in this paper. The system is able to handle both simple and complex multi-constraint questions. In order to improve the performance of the system, we use Lambda Rank algorithm to sort candidate query paths according to their correlation degree with a question. The candidate path with the highest correlation degree with a question is chosen and used to extract answers. Moreover, the system also adopted a kind of novel fusion method which improved the accuracy of the entity recognition problem. The system has achieved promising results in both CCKS2019 and CCKS2020 KBQA tasks.
  • Information Retrieval and Question Answering
    QIN Hanzhong, YU Chongchong, JIANG Weijie, ZHAO Xia
    2021, 35(11): 118-126.
    Abstract ( ) PDF ( ) Knowledge map Save
    To effectively match response details and avoid semantic confusion, this paper proposes to improve the Deep Attention Matching Network(DAM) viamulti-head attention and Bi-directional Long Short-Term Memory (BiLSTM). This method can model longer multi-round of dialogue and handle the matching relationship between the response selection and the context. In addition, the BiLSTM Network applied in the feature fusion process can improve the accuracy of multi-turn response selection tasks by capturing the time-dependent relation. Tested on two public multi-turn response selection datasets, the Douban Conversion Corpus and the E-commerce Dialogue Corpus, our model is revealed to outperform the baseline model by 1.5% in R10@1 with the word vector enhancement.
  • Information Retrieval and Question Answering
    LV Zhengwei, YANG Lei, SHI Zhizhong, LIANG Xiao, LEI Tao, LIU Duoxing
    2021, 35(11): 127-134.
    Abstract ( ) PDF ( ) Knowledge map Save
    Reading comprehension as an advanced form of question answering system develops semantic understanding to analyze unstructured documents and generate answers, which has important research value and vast application prospects. Due to the high cost of obtaining training samples, reading comprehension for specific domain suffers from poor accuracy and robustness. In this paper we propose a data augmentation method for domain specific reading comprehension modeling, which constructs training samples based on real user questions. The experiments in the automobile field show that the method can effectively improve the accuracy and robustness of reading comprehension model.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    PENG Tao, YANG Liang, SANG Zhongyi, TANG Yu, LIN Hongfei
    2021, 35(11): 135-142.
    Abstract ( ) PDF ( ) Knowledge map Save
    Conversational sentiment analysis aims to analyze and detect the emotional state of a person in a conversation when it is terminated. Different from traditional textual sentiment analysis, the context of dialogues and interactions between the speakers will have an important impact on their emotions. Meanwhile, the syntactic structure of the dialogue text is generally complex, and there is a long-range dependency of syntactic components in many cases. It is therefore a very challenging task. To address this issue, this paper introduces the syntactic dependence of text into the model. We firstly extract the syntactic structure information through the graph convolution network, and then combine it with the text sentiment analysis model. Finally, two models named H-BiLSTM+HGCL and BERT+HGCL are proposed for modeling semantic and syntactic structure simultaneously. Experiments on the Chinese conversation sentiment analysis dataset we constructed show that the proposed model achieves better performance than the baseline models without dependency relation.