2021 Volume 35 Issue 12 Published: 15 December 2021
  

  • Select all
    |
    Survey
  • Survey
    QIANG Jipeng,LI Yun,WU Xindong
    2021, 35(12): 1-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic Lexical Simplification (LS) is the process of replacing complex words in a given sentence with simpler alternatives of equivalent meaning, which is an important research direction in text simplification. With the rapid development of natural language processing technology, LS methods are rapidly updated and changed. This paper surveys the existing works on lexical simplification. After introducing the framework of lexical simplification, we summary the query linguistic databases, automatic rules, word embeddings, merging model, and BERT for LS methods. Finally we discuss the difficulties of the study of lexical simplification, and provide the future developments of LS and draw our conclusions.
  • Language Analysis and Calculation
  • Language Analysis and Calculation
    XU Shaoyang, JIANG Feng, LI Peifeng
    2021, 35(12): 17-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    Topic segmentation, as one of the classic tasks in the field of natural language processing, is to segment the input discourse into paragraphs with continuous semantics. Previous works used word frequencybased, latentbased, sequentialbased, and Transformerbased methods to encode sentences, ignoring modeling global semantic information of the discourse. This paper proposes to use Discourse Structure Graph Network to encode sentences for a sentence representation with the global information of the discourse. In detail, the model firstly constructs a single graph for each discourse, which contains all sentences and word nodes of it as well as the adjacency information between them. The model then uses Gated Graph Neural Networks to iterate the graph that gets the sentence representation with the global information of the discourse. They are finally fed to the Bi-LSTM layer to predict the segmentation points. The experimental results demonstrate that the model gets a more suitable sentence representation than other baselines for topic segmentation and achieves the best performance on various popular datasets.
  • Language Analysis and Calculation
    2021, 35(12): 28-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Grammatical Error Correction (GEC) task is to realize automatic error detection and correction of text through natural language processing technology, such as word order, spelling and other grammatical errors. Many existing Chinese GEC methods have achieved good results, but these methods have not taken into characteristics of learners, such as level,native language and so on. Therefore, this paper proposes to personalize the GEC model to the characteristics of Chinese as a Second Language (CSL) learners and correct the mistakes made by CSL learners with different characteristics. To verify our method, we construct domain adaptation datasets. Experiment results on the domain adaptation datasets demonstrate that the performance of the GEC model is greatly improved after adapting to various domains of CSL learners.
  • Language Analysis and Calculation
    SU Yulan, CHEN Xin, HONG Yu, ZHU Mengmeng, ZHANG Min
    2021, 35(12): 36-46.
    Abstract ( ) PDF ( ) Knowledge map Save
    To address the low efficeiency limited by the binary classification made in the scenario of question answering system, this paper proposes a similar question identification method based on semantic space distance measure (SSDM), which is inspired by related research on face identification. This method obtains a semantic encoder by similar question multiclassification process via the Margin Softmax introduced from face identification community. The semantic encoder can aggregate similar question in the semantic space, and make dissimilar questions to be far away from each other in semantic space. SSDM method transforms similar questions identification into vector distance calculation in semantic space, and breaks the binary question matching and guarantees a certain high efficiency. We test the SSDM method in the ASQD dataset from Biendata, and the experimental results show that the SSDM method is better in performance than the baseline method.
  • Knowledge Representatoin and Acquisition
  • Knowledge Representatoin and Acquisition
    LIU Mengdi, LIANG Xun
    2021, 35(12): 47-59.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper proposes a method for calculating the similarity of character glyphs, which aims to solve the problem of identifying similar Chinese characters. First, we construct a radical knowledge graph according to the character's composition. Then, based on the knowledge graph and structure features, the paper proposes 2CTransE to learn the semantic representation of entities. Finally, we calculate the character similarity by the entity vector. Results show that the method are effective in similar characters identification. And the component library can be used in the subsequent related researches. We also propose a novel method for Japanese and other similar languages in character similarity calculation.
  • Machine Translation
  • Machine Translation
    LUO Qi, LI Maoxi
    2021, 35(12): 60-67.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic evaluation of machine translation is a key issue in machine translation. In existing work, the source sentence information is completely ignored and only the reference is used to measure the translation quality. This paper presents a novel automatic evaluation metric incorporating the source information: extracting the quality embeddings that describes the translation quality from a tuple consist of the machine translations and their corresponding source sentences, and incorporating it into the automatic evaluation method based on contextual embeddings by using a deep neural network. The experimental results on the dataset of WMT-19 Metrics task show that the proposed method can effectively enhance the evaluation correlation with the human judgments. Deep analysis further reveals that the information of the source sentences plays an important role in automatic evaluation of machine translation.
  • Machine Translation
    PU Liuqing,YU Zhengtao,WEN Yonghua,GAO Shengxiang,LIU Yiyang
    2021, 35(12): 68-75.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese Vietnamese neural machine translation is a typical lowresource task. Due to the lack of largescale parallel corpus, the model may not learn enough bilingual differences and the translation quality is not good. A ChineseVietnamese neural machine translation method based on dependency graph network is proposed. This method uses dependency syntactic relations to construct a dependency graph network and incorporates neural machine translation. In the framework of the Transformer, a graph encoder is introduced to capture the dependency structure diagram of the source language, which is then integrated with the sequence embedding via multihead attention mechanism. When decoding , structured and sequence encoding are used to guide the decoder to generate translations. The experimental results show that in the ChineseVietnamese translation task, incorporating the dependency syntax graph can improve the performance of the translation model.
  • Mathine Translation
  • Mathine Translation
    YOU Xindong,YANG Haixiang,CHEN Haitao,SUN Tian,LV Xueqiang
    2021, 35(12): 76-83.
    Abstract ( ) PDF ( ) Knowledge map Save
    The traditional neural machine translation is a black box and cannot effectively add terminology information. It is of practical significance to use term provided by the user to jointly train the neural machine translation model. Accordingly, we propose a new energy transformer patent machine translation model with terminology information incorporated. The source term is replaced with the target term and the target term is added after the source term to fusing the terminology information. Experimentsal results on the ChineseEnglish task with patent termbase in the field of new energy show that the proposed patent translation model is better than the Transformer baseline model, as well as the translation quality analysis on three datasets.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    JIANG Haoquan, ZHANG Ruqing, GUO Jiafeng, FAN Yixing, CHENG Xueqi
    2021, 35(12): 84-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    Graph Convolutional Networks has drawn much attention recently, and the self-attention mechanism has been widely applied as the core of the Transformer and many pre-trained models. We disclose that the selfattention mechanism can be seen as a generalization of Graph Convolutional Networks, in that it takes all input samples as nodes and then constructs a directed fully connected graph with learnable edge weights for convolution. Experiments show that the selfattention mechanism achieves better text classification accuracy than many state of the art Graph Convolutional Networks. Meanwhile, the performance gap of classification widens as the data size increases. These show that the self-attention mechanism is more expressive, and may surpass Graph Convolutional Networks with potential performance improvements on the task of text classification.
  • Information Extraction and Text Mining
    WANG Mingtao, FANG Yewei, CHEN Wenliang
    2021, 35(12): 94-102.
    Abstract ( ) PDF ( ) Knowledge map Save
    Mining events in E-commerce reviews is of great help to analyze customer shopping behavior and commodity scene classification. This paper presents the definition of E-commerce event and treats the event detectionas a sequence labeling issue. Besides, It constructs an event detection corpus based on E-commerce comments. Firstly, this paper extends the characterbased BiLSTM-CRF model with the Embeddings from Language Models (ELMo) to improve the performance. Then, it considers the characteristics of Chinese characters, including five-strokes(Wubi) and common strokes. Two novel models are proposed to add glyph features into ELMo by using the glyph information of events. Experimental results show that the proposed models can improve performance on a newly built dataset. Finally, this paper uses two large text corpus from news and E-commerce domains to train language models. The results show that the E-commerce corpus is more helpful to the system.
  • Machine Reading Comprehension
  • Machine Reading Comprehension
    QIAN Jin, HUANG Rongtao, ZOU Bowei,HONG Yu
    2021, 35(12): 103-111.
    Abstract ( ) PDF ( ) Knowledge map Save
    Generative reading comprehension is a novel and challenging issue in machine reading comprehension. Compared with the mainstream extractive reading comprehension, generative reading comprehension model is aimed for combining questions and paragraphs to generate natural and complete statements as answers. To understand of the boundary information of answers in paragraphs and the question type information, this paper proposes a generative reading comprehension model based on multi-task learning. In the training phase, the model takes the answer generation as the main task, and the answer extraction and question classification tasks as auxiliary tasks for multi-task learning. The model simultaneously learns and optimizes the parameters of the encoding layer, then it loads the encoding layer in the test phase to decode and generate the answers. The experimental results show that the answer extraction model and the question classification model can effectively improve the performance of the generative reading comprehension model.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    GUO Yu, DOU Zhicheng, WEN Jirong
    2021, 35(12): 112-121.
    Abstract ( ) PDF ( ) Knowledge map Save
    Dialogue system is an important downstream task in the field of natural language processing (NLP) receiving more and more attention in recent years. In order to make the dialogue models more in line with the way of human dialogue and have better personalized modeling capabilities, this paper proposes a new personalized model PCC(a Personalized Chatbot with Convolution mechanism)to model a single user. At the encoder, text convolutional neural network (TextCNN) is used to process user history posts to obtain user interest information. At the decoder, we search for the reply that best matches the current question in the users historical answers through similarity, so as to guide the models generation together with user ID. Experimental results show that our model can improve the accuracy and diversity of the generation, and reveal the effectiveness of historical information in personalized modeling.
  • Information Retrieval and Question Answering
    JIN Jihao, RUAN Tong, GAO Daqi, YE Qi, LIU Xuli, XUE Kui
    2021, 35(12): 122-132.
    Abstract ( ) PDF ( ) Knowledge map Save
    The existing knowledgebased question answering is difficult to handle natural language questions with complex logical relationships. This paper proposes a semantic graph driven natural language QA framework. The core of the framework is composed of primary chain structure, auxiliary chain structure, ring structure to express events in the field and the semantic relationship between events. Furthermore, the linear coding form of the semantic graph is constructed. The path generation model is used to translate the complex natural language question into a linear sequence of the semantic graph. In order to verify the validity of the framework, the paper constructed 3,000 natural language questions and answers with complex logical relationships through the open graph dataset in the medical field. The results indicate that the accuracy of the sequence-to-sequence model based on the attention mechanism is improved to 97.67%, accuracy of the slot filing with the heuristic rule 94.88%, and the accuracy of the overall system 91.5%.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    LUO Yunsong,HUANG Muyu,JIATao
    2021, 35(12): 133-148.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the increasing number of microblog robot accounts, its identification has become a prominent problem in the current data mining field. To deal with the imbalance data issue in this task, we choose a large data set to explore the influence of resampling on the supervised learning algorithms and propose a novel microblog robot recognition framework combined with resampling. A variety of indexes have been used to evaluate the performance of 7 supervised learning algorithms on imbalanced validation sets based on 5 different resampling methods. The experimental results show that the Recall of the trained model from the small balanced training set will be seriously reduced in real situations, while the framework combined with resampling can significantly improve the recognition of robot accounts. The NearMiss undersampling method can increase the Recall, while the ADASYN oversampling method will improve the G_mean measure. Generally speaking, the release time, publishing region, and release interval are important features to distinguish normal users from robots. At the same time, resampling can adjust the rank of features that the machine learning algorithm depends on so that the model can get better performance.