2021 Volume 35 Issue 7 Published: 30 July 2021
  

  • Select all
    |
    Survey
  • Survey
    WU Yunfang, ZHANG Yangsen
    2021, 35(7): 1-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    Question generation (QG) aims to automatically generate fluent and semantically related questions for a given text. QG can be applied to generate questions for reading comprehension tests in the education field, and to enhance question answering and dialog systems. This paper presents a comprehensive survey of related researches on QG. We first describe the significance of QG and its applications, especially in the education field. Then we outline the traditional rule-based methods on QG, and make a detailed description on the neural network based models from different views. We also introduce the evaluation metrics of generated questions. Finally, we discuss the limitations of previous studies and suggest future works.
  • Survey
    SUN Yi, QIU Hangping, ZHENG Yu, ZHANG Chaoran, HAO Chao
    2021, 35(7): 10-29.
    Abstract ( ) PDF ( ) Knowledge map Save
    Introducing knowledge into data-driven artificial intelligence models is an important way to realize human-machine hybrid intelligence. The current pre-trained language models represented by BERT have achieved remarkable success in the field of natural language processing. However, the pre-trained language models are trained on large scale unstructured corpus data, and it is necessary to introduce external knowledge to alleviate its defects in determinacy and interpretability to some extent. In this paper, the characteristics and limitations of two kinds of pre-trained language models, pre-trained word embeddings and pre-trained context encoders, are analyzed. The related concepts of knowledge enhancement are explained. Four types of knowledge enhancement methods of pre-trained word embeddings are summarized and analyzed, which are pre-trained word embeddings retrofitting, hierarchizing the process of encoding and decoding, attention mechanism optimization and knowledge memory introduction. The knowledge enhancement methods of pre-training context encoders are described from two perspectives: 1) task-specific and task-agnostic; 2) explicit knowledge and implicit knowledge. Through the summary and analysis of the knowledge enhancement methods of the pre-trained language model, the basic pattern and algorithm are provided for the human-machine hybrid artificial intelligence.
  • Language Analysis and Calculation
  • Language Analysis and Calculation
    DU Qianlong, ZONG Chengqing, SU Keh-Yih
    2021, 35(7): 30-40.
    Abstract ( ) PDF ( ) Knowledge map Save
    The task of recognizing textual entailment detects whether a given text passage can be inferred from another passage. During inference process, the sense of each word plays an important role in understanding the meaning of the passages and predicting the relationship of the passage-pair. In this paper, we propose a novel approach to incorporate word sense information into the inference mechanism. We first use a word sense disambiguation system to generate the sense of each content word, and then use the information of word sense to improve the representations of the passages and enhance the capability of predicting the entailment relationship of the passage-pair. Experimental results show that our proposed approach can improve the performance effectively.
  • Language Resources Construction
  • Language Resources Construction
    XING Fugui, ZHU Tingshao
    2021, 35(7): 41-46.
    Abstract ( ) PDF ( ) Knowledge map Save
    The classical Chinese word segmentation is an important step to analyze existing ancient documents. In this paper, we first collect unstructured classical Chinese online corpus and accumate a basic dictionary. Then the candidate new words are discovered by a multi-feature fusion strategy, including mutual information, information entropy, and position word probability. Finally, a CCIDict of 349,740 words is applied with the forward maximum matching to segment the words in classical Chinese texts, achieving 14% improvements in F-value compared with the open-source Jiayan.
  • Language Resources Construction
    MAO Cunli, LU Shan, WANG Hongbin, YU Zhengtao, WU Xia, WANG Zhenhan
    2021, 35(7): 47-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese-Burmese bilingual dictionary is an important data resource for research on machine translation and cross-language retrieval, etc. At present, the iterative self-learning method based on small-scale seed dictionary has achieved good results in extracting bilingual dictionaries from parallel corpus. However, for low-resource languages like Chinese-Burmese bilingual dictionary extraction task, due to the lack of bilingual parallel resources, the method based on iterative self-learning can not get effective bilingual word vector representation, resulting in the low accuracy of bilingual dictionary extraction model. Recent studies suggest that similar words in comparable corpora often have similar contexts. Therefore, this paper proposes a semi-supervised method for constructing Chinese- Burmese bilingual dictionary. By using the pre training language model, the context feature vector of bilingual vocabulary is constructed. The Chinese-Burmese bilingual vocabulary obtained by the iterative self-learning method of comparable corpus and small-scale seed dictionary is semantically enhanced. The experimental results show that the proposed method has a significant improvement comparing with the baseline method.
  • Machine Translation
  • Machine Translation
    LI Xinguang, CHEN Shuai, LONG Xiaolan
    2021, 35(7): 54-62.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a sentence-based automatic scoring method for Chinese-English oral translation. Three main indicators are designed for evaluating keywords, general idea of sentences and fluency. As for keywords, this paper applies the synonym analysis to identify synonyms in candidate keywords. At the sentence level, the translation of sentences is evaluated by Unfolding Recursive Auto-Encoder (URAE). Then, fluency is scored by the speed of the speech. Finally, the weighted sum of the three indicators is generated as the overall translation quality score. The experimental results demonstrated that this automatic scoring method bears good consistency with manual scoring method.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    JIAMILA Wushouer, WU Di, WANG Lulu, GULINIGEER Abudouwaili, MAIHEMUTI Maimaiti, TUERGEN Yibulayin
    2021, 35(7): 63-71.
    Abstract ( ) PDF ( ) Knowledge map Save
    Uyghur is rich in form and scarce in resources, which challenges the existing deep learning models for Uyghur text classification. This paper proposes a text classification model called MDPLC combining both Bi-LSTM+CNN and DPCNN. Firstly, the pre-trained word vector is fused with the semantic information processed by Bi-LSTM to obtain the semantic dependency of the whole sentence, and the local semantic learning is further strengthened by a layer of pooled CNN. Meanwhile, the text semantic information is captured by using multi-convolution kernel DPCNN in a dual-channel way. Experiments on short and long text data sets of Chinese, English, and Uyghur show that the accuracy of the proposed model is better than that of the existing popular deep learning models.
  • Ethnic Language Processing and Cross Language Processing
    CUI Zhiyuan, ZHAO Erping, LUO Weiqun, WANG Wei, SUN Hao
    2021, 35(7): 72-80.
    Abstract ( ) PDF ( ) Knowledge map Save
    Domain specific corpora such as Tibetan animal husbandry corpus are rich in direct transliteration or synthesis of unknown words. To improve the word segmentation for such corpora, this paper proposes a Chinese word segmentation model via Multi-Head Attention. To capture the dependence relationship and syncopation point information between key character vectors, the Multi-Head Attention mechanism is applied to calculate the correlation between important character vectors and other character vectors in parallel regardless the distance between them. Then the conditional random fields is employed to model lexeme labels for the optimal word segmentation sequence. Finally, a domain dictionary is constructed to further improve the effect of word segmentation. Experiments on the corpus of animal husbandry in Tibet show that, compared with classical models such as Bi-LSTM-CRF, the accuracy, recall rate and F1 value of the proposed model are increased by 3.93%, 5.3% and 3.63%, respectively.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    WANG Bingqian, SU Shaoxun, LIANG Tianxin
    2021, 35(7): 81-88.
    Abstract ( ) PDF ( ) Knowledge map Save
    Event extraction (EE) refers to the technology of extracting events from natural language texts and identifying event types and event elements. This paper proposes an end-to-end multi-label pointer network for event extraction, in which the event detection task is integrated into the event element recognition task to extract event elements and event types at the same time. This method avoids the problem of wrong cascade and task separation in traditional pipeline methods, and alleviates the problem of role overlapping and element overlapping in event extraction. The proposed method achieves 85.9% F1 score on the test set in 2020 Language and Intelligence Challenge Event Extraction task.
  • Information Extraction and Text Mining
    WANG Yanggang, QIU Xipeng, HUANG Xuanjing, WANG Yining, LI Yunhui
    2021, 35(7): 89-97,108.
    Abstract ( ) PDF ( ) Knowledge map Save
    Graph neural networks(GNN) recently appears to be an effective method to model the global context representation of samples, but defected in over-smoothing when faced with the noisy few-shot text classification scenario. We propose a dual channel graph neural network to model the full context features while making full use of the label propagation mechanism. A multi-task parameter sharing mechanism is used in the dual channels to effectively constrain the graph iteration process. Compared with the baseline graph neural network, our method achieves an average improvement of 1.51% on the FewRel dataset and 11.1% improvement on the ARSC dataset.
  • Information Extraction and Text Mining
    LUO Fang, WANG Jinghang, ZHANG Yuheng, HE Daosen, PU Qiumei
    2021, 35(7): 98-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper designs a new temporal summarization extraction method to improve the model of the evolution overview of numerous hot events in social media. Based on the analysis of the hot events evolution stage, this study explores the time-series and social influence of social texts and proposes a novel temporal summarization method, LexRank Summarization with Timeline-Social Influence (LSTS). Experimental results showed that LSTS achieves optimal results with a weight ratio 0.4 between the time-series and social influence, reading 44.23%, 34.78%, 27.86% according to ROUGE-1, ROUGE-2 and ROUGE-S4, respectively.
  • Machine Reading Comprehension
  • Machine Reading Comprehension
    LI Fangfang, REN Xingkai, MAO Xingliang, LIN Zhongyao, LIU Xiyao
    2021, 35(7): 109-117,125.
    Abstract ( ) PDF ( ) Knowledge map Save
    The combination of artificial intelligence with law has become a hot research issue. Focused on the machine reading comprehension task of China AI Law Challenge 2020 (CAIL2020), this paper proposes a multi-task joint training of four sub-modules: word embedding module, answer extraction module, answer classification module and supporting facts discrimination module. This paper proposes a data augmentation method based on TF-IDF ‘question-context’ similarity matching, which re-labels the training set of CAIL2019 for data augmentation. After performing CAIL2020 machine reading comprehension task, the F1 value of this model achieves 74.49 as the first place in this task.
  • Machine Reading Comprehension
    XU Jiawei, LIU Ruifang, GAO Sheng, LI Si
    2021, 35(7): 118-125.
    Abstract ( ) PDF ( ) Knowledge map Save
    To address the machine reading comprehension of some complex linguistic phenomena, such as Chinese idioms, we propose an enhanced global attention module to better perceive the grammatical functions of idioms in different contexts. We adjust original global attention by generating an extra attention factor for each spatial position, so as to enhance the recognition of specific word senses. We integrate this module with the popular BERT language model for Chinese cloze task. Results on a recently released cloze-test dataset ChID show that our method achieves significant improvements, compared with the fine-tuned BERT model and global attention model.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    SONG Wei, WEN Zijian
    2021, 35(7): 126-133.
    Abstract ( ) PDF ( ) Knowledge map Save
    Current methods of aspect-based sentiment analysis usually utilize the attention mechanism to fulfill the interaction between sentence and aspect. However, the attention mechanism often results in the mismatches between words of sentence and aspect, which will introduce extraneous noise. To address this issue, this paper proposes a feature dual distillation network for aspect-based sentiment analysis. Firstly, BiLSTM is utilized to extract context semantic features, and a context-based aspect embedding is utilized to obtain the semantic feature of aspect. Moreover, a gate mechanism is employed to construct a dual distillation gate where preliminary distillation and fine distillation processes are utilized to fulfill the interaction between the semantic features of sentence and aspect. Finally, Softmax is utilized to predict the sentiment polarities. On commonly used Laptop, Restaurant and Twitter datasets, the proposed method performs better than the state-of-the-art methods with 79.26%, 84.53% and 75.30% accuracy, and 75.77%, 75.63% and 73.21% Macro-F1, respectively.
  • Natural Language Understanding and Generation
  • Natural Language Understanding and Generation
    LIU Xikai, LIN Hongfei, XU Bo, YANG Liang, REN Yuqi
    2021, 35(7): 134-142.
    Abstract ( ) PDF ( ) Knowledge map Save
    Response generation is an important component of the dialogue system. To better combine the retrieval-based model with a generation-based model, this paper proposes a response generative model via the retrieved response fusion mechanism. The model uses bidirectional LSTM to encode the retrieved response, and then this paper proposes a Long Short-Term Memory network with a fusion mechanism (fusion-LSTM). This mechanism fuses retrieval results with dialogue text within the model to better integrate the retrieved information into the generative model. The experimental results show that this method is superior to the baseline methods in both automatic and human evaluation.