2022 Volume 36 Issue 6 Published: 25 July 2022
  

  • Select all
    |
    Survey
  • Survey
    CUI Lei, XU Yiheng, LYU Tengchao, WEI Furu
    2022, 36(6): 1-19.
    Abstract ( ) PDF ( ) Knowledge map Save
    Document AI, or Document Intelligence, is a relatively new research topic that refers to the techniques to automatically read, understand and analyze business documents. It is an important interdisciplinary study involving natural language processing and computer vision. In recent years, the popularity of deep learning technology has greatly advanced the development of Document AI tasks, such as document layout analysis, document information extraction, document visual question answering, and document image classification etc. This paper briefly introduces the early-stage heuristic rule-based document analysis, statistical machine learning based algorithms, as well as the deep learning-based approaches especially the pre-training approaches. Finally, we also look into the future direction of Document AI.
  • Survey
    ZHANG Rujia, DAI Lu, WANG Bang, GUO Peng
    2022, 36(6): 20-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese named entity recognition (CNER) is one of the basic tasks of natural language processing applications such as question answering systems, machine translation and information extraction. Although traditional CNER system has achieved satisfactory experiment results with the help of manually designed domain-specific features and grammatical rules, it is still defected in aspects such as weak generalization ability, poor robustness and difficult maintenance. In recent years, deep learning techniques have been adopted to deal with the above shortcomings by automatically extracting text features in an end-to-end learning manner. This article surveys the recent advances of deep learning-based CNER. It first introduces the concepts, difficulties and applications of CNER, and introduces the common datasets and evaluation metrics. Recent neural network models for the CNER task are then grouped according to their network architectures, and representative models in each group are detailed. Finally, future research directions are discussed.
  • Machine Translation
  • Machine Translation
    WANG Tao, XIONG Deyi
    2022, 36(6): 36-43.
    Abstract ( ) PDF ( ) Knowledge map Save
    Integrating pre-defined bilingual pairs into Neural Machine Translation (NMT) has always been a challenging task with substantial application scenarios. Limited by the word-by-word decoding strategy, the explicit integration of external bilingual pairs into NMT often requires modifying the beam search decoding algorithm or even the model itself. This paper proposes a simple method of incorporating pre-defined bilingual pairs into NMT: (1)preprocessing the training data to add information about pre-defined bilingual pairs; (2) using partially shared embeddings help the model distinguish between pre-defined bilingual pairs and other texts. Experiments and analysis in multiple language pairs show that the method can improve the probability of successful translation of pre-defined bilingual pairs, reaching nearly 99% (the Chinese-English benchmark is 73.8%).
  • Machine Translation
    ZHU Junguo, YANG Fuan, YU Zhengtao, ZOU Xiang, ZHANG Zefeng
    2022, 36(6): 44-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    In neural machine translation, the low-frequency word is a key factor affecting the quality of the translation output, which is more prominent in low-resource scenario. This paper proposes a low-resource neural machine translation method with enhanced the representation of low-frequency words. The main idea is to use monolingual data context information to learn the probability distribution of low-frequency words, and recalculate the word embeddings of low-frequency words based on this distribution. The Transformer model is then re-trained by the new word embeddings, thereby effectively alleviating the problem of representing low-frequency words inaccurately. The experimental results in the four directions between Chinese and Vietnamese, Chinese and Mongolian translation tasks show that the method proposed in this paper has a significant improvement over the baseline model.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    LYU Haotian, MA Zhiqiang, WANG Hongbin, XIE Xiulan
    2022, 36(6): 52-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    Focused on the low-resource corpus for the training of Mongolian speech recognition models, this paper proposes a layer transfer method based on transfer learning, and describes a variety of transfer strategies for Mongolian speech recognition based on CNN-CTC(Convolutional Neural Networks and Connectionist Temporal Classification). Using the English corpus with 10,000 sentences and the Mongolian corpus with 5000 sentences, we conducted an empirical study on the selection of learning rate in the model training, the verification of the effectiveness of layer transfer, the selection of the best transfer layer strategy, and the impact of high-resource model training data on the layer transfer model. The experimental results show that the layer transfer model can accelerate the training speed, and the bottom-up transfer layer selection strategy can achieve, under the limited Mongolian corpus resources, 10.18% lower WER than the ordinary Mongolian speech recognition model based on CNN-CTC.
  • Ethnic Language Processing and Cross Language Processing
    TANG Lixin, ZHOU Lanjiang, ZHANG Li, ZHANG Jian‘an
    2022, 36(6): 61-68,89.
    Abstract ( ) PDF ( ) Knowledge map Save
    The identification of noun phrases is of fundamental significance to natural language processing tasks such as syntactic analysis. At present, the study on the identification of Lao noun phrases is still in its infancy. Compared with other languages, the Lao has the problems such as fuzzy boundary, ambiguous definition description, limited corpus and excessively long sentences. This paper studies the structure of Lao noun phrases and builds the multi-channel model to identify Lao noun phrases. This model forms different channels by combining characters, words and POS features, and extract more hidden information from different aspects with multi BiLSTM networks, so as to alleviate the unenrolled noun phrases issue in low-resource corpus. To deal with the excessively long sentences in Lao, the model introduces the Attention mechanism to assign higher weight of important features, effectively abating the interference from useless information. The experimental results show that the F1 value of the model is up to 85.25% on a limited annotated corpus, which is better than other models and methods.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    LIU Yin, ZHANG Kai, WANG Huijian, YANG Guanqun
    2022, 36(6): 69-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes an unsupervised method for the low-resource named entity recognition in the electric power domain. We collect the target domain corpus and use string statistics techniques to update the domain vocabulary. We also obtain a small scale of entity words with their types by parsing the structured electric power maintenance manuals, and the representative words for each entity type are selected according to the word embedding based similarity. At the same time, we pre-train the electric power BERT model with the whole word masking technique, and predict the entity words in the text and their possible entity types by calculating their semantic similarities with those representative words. Experiments show that our method is feasible for low-source data and can be easily reused in other domains.
  • Information Extraction and Text Mining
    ZHANG Hu, JI Ze, WANG Yujie, LI Ru
    2022, 36(6): 80-89.
    Abstract ( ) PDF ( ) Knowledge map Save
    The research on intelligent justice service based on natural language understanding has attracted more and more attention. To better provide the judge with the focal points of the disputes in the case, this paper is focused on identifying the logical interactive argument pairs between the prosecution and the defense in the judgment documents. We investigate the semantic representation of the interactive argument, the interaction between interactive argument pairs, and so on. We present a method combining the pre-trained language model, the attention mechanism, and the adversarial training to identify interactive argument pairs. Experimental results show that the proposed method improves the accuracy in the identification of as well as the robustness of the model.
  • Information Extraction and Text Mining
    BAO Zhenshan, SONG Bingyan, ZHANG Wenbo, SUN Chao
    2022, 36(6): 90-100.
    Abstract ( ) PDF ( ) Knowledge map Save
    The named entity recognition of traditional Chinese medicine books is a less addressed topic. Considering the difficulty and cost in annotating such professional text in classical Chinese, this paper proposes a method for identifying traditional Chinese medicine entities based on a combination of semi-supervised learning and rules. Under the framework of the conditional random fields model, supervised features such as lexical features and dictionary features are introduced together with the unsupervised semantic features derived from word vectors. The optimal semi-supervised learning model is gained by examining the performance of different feature combinations. Finally, the recognition results of the model are analyzed and a rule based post-processing is established with the linguistic characteristics of ancient books. Experiments results reveals 83.18% F-score, which proves the validity of this method.
  • Information Extraction and Text Mining
    GAN Zifa, ZAN Hongying, GUAN Tongfeng, LI Wenxin, ZHANG Huan, ZHU Tiantian, SUI Zhifang, CHEN Qingcai
    2022, 36(6): 101-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    The 6th China conference on Health Information Processing (CHIP 2020) organized six shared tasks in Chinese medical information processing. The second task was entity and relation extraction that automatically extracts the triples consist of entities and relations from Chinese medical texts. A total of 174 teams signed up for the task, and eventually 17 teams submitted 42 system runs. According to micro-average F1 which was the key evaluation criterion in the task, the top performance of the submitted results reaches 0.648 6.
  • Machine Reading Comprehension
  • Machine Reading Comprehension
    JI Yu, WANG Xiaoyue, LI Ru, GUO Shaoru, GUAN Yong
    2022, 36(6): 109-116.
    Abstract ( ) PDF ( ) Knowledge map Save
    In machine reading comprehension, the task of multiple choice reading comprehensioniis challenged by evidence sentence extraction due to the absence of clue annotation and questions involve multi-hop reasoning. This paper proposes a model of evidence sentence extraction based on multi-module combination. We first apply some labeled data to fine-tune the pre-training model. Then the evidence sentences in the multi-hop reasoning problem are extracted recursively through TF-IDF. Finally, the unsupervised method is combined to further filter the model prediction results to reduce redundancy Tested on the Chinese Gaokao and the RACE data set, the proposed method achieves an increase of 3.44% in F1 value compared with the optimal baseline model in evidence sentence extracton. Meanwhile, the final question-answering accuracy with above identified evidence sentences is improved by 3.68% and 3.6%, respectively, compared with that of full text as input.
  • Machine Reading Comprehension
    LI Zezheng, TIAN Zhixing, ZHANG Yuanzhe, LIU Kang, ZHAO Jun
    2022, 36(6): 117-124.
    Abstract ( ) PDF ( ) Knowledge map Save
    The current knowledge-enhanced machine reading comprehension is focused on how to integrate external knowledge into the existing MRC model, while ignores the selection for the source of external knowledge. This article first uses the attention mechanism to encode external knowledge, then scores external knowledge from different sources, and finally selects the most helpful knowledge with respect to different questions. Compared with the baseline models, our method improves the accuracy by 1.2 percent.
  • Information Retrieval
  • Information Retrieval
    JI Xinting, NUO Minghua
    2022, 36(6): 125-134.
    Abstract ( ) PDF ( ) Knowledge map Save
    The existing recommendation methods mostly adopt the interactive behavior of users and items, such as purchase records or ratings, to complete recommendation. To avoid the sparse interactions which affect the accuracy of the recommendation results, this paper proposes a recommendation method that combining tag and knowledge graph. The tag with rich content and inherited semantic information can reflect the user’s subjective evaluation for items, and it can play a key role in recommendation. The knowledge graph with a large number of entities which can provide more effective features for items. In addition, this paper also design a hybrid attention model that combines attention and self-attention to assign hybrid attention weights to item features based on tags and entities. Experiments on MovieLens and Last. FM datasets indicate an improved performance of the proposed model compared with other recommendation algorithms.
  • Information Retrieval
    WANG Baocheng, LIU Lijun, HUANG Qingsong
    2022, 36(6): 135-145.
    Abstract ( ) PDF ( ) Knowledge map Save
    To improve the existing key word based retrieval in medical question-answering platform, a hash generation model based on an improved text convolution neural network is used for the semantic detection of similar problems to better deal with the phenomena of diversified expressions and more negative words in the text. Then, the detection set is filtered and sorted with a more accurate text matching model. The whole model is constructed within the ensemble learning framework. First, the Siamese-BERT model is adopted to better extract semantics. Then, the BERT-Match model is applied to better capture the local correlation between questions with the help of multi-attention mechanism of BERT. Finally, the gradient descent boosting tree is used to combine the semantic features and statistical features. Experiments show that this method can get better results in similar problem detection and text matching.
  • Natural Language Generation
  • Natural Language Generation
    CUI Zhuo, LI Honglian, ZHANG Le, LYU Xueqiang
    2022, 36(6): 146-154.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text summarization aims at generating a brief and accurate summary from lengthy text without changing the original semantics of the text. A novel summarization method called Add Sememe-Pointer Model (ASPM) is proposed in this paper. The ASPM applies the pointer network in the Seq2Seq framework to solve the out-of-vocabulary problem. Considering the polysemous phenomenon in Chinese, the pointer network model does not fully understand the text semantics, leading to the poor performance of the model. Our method also uses the sememe knowledge bases to train the word vector representation of polysemous words, which can accurately capture the specific meaning of a word in the context, and we annotate some polysemous words in the LCSTS dataset so that the method can better understand the semantic information of the words in the dataset. The experimental results show that the ASPM can achieve higher ROUGE scores and make the Chinese summary more readable.
  • NLP Application
  • NLP Application
    YANG Bingbing, ZHAO Huizhou, WANG Zhimin
    2022, 36(6): 155-161.
    Abstract ( ) PDF ( ) Knowledge map Save
    The COVID-19 has made online teaching an inevitable trend. This paper presents the materials suitable for automatic pushing of Chinese textbooks. Firstly, we analyze the overall characteristics of vocabulary based on 10341 spoken Chinese corpus. On this basis, according to the Chinese word vector data published by Tencent AL LAB, we use the K-means algorithm to cluster the spoken words. And we construct a Chinese spoken topic-scene material library with reference to the word clustering results and the investigation of spoken corpus topics and scenes. The database containing 15 primary topics, 102 secondary topics and 81 communicative scenes. We also summarize the common words of topics at all levels. This paper can provide resource support for the material library of automatic customization of teaching materials.
  • NLP Application
    ZHOU Ai, SANG Chen, ZHANG Yijia, LU Mingyu
    2022, 36(6): 162-170.
    Abstract ( ) PDF ( ) Knowledge map Save
    Authorship attribution involving the analysis of individuals’ writing styles has been extensively studied among a wide range of languages. To address the Chinese authorship attribution related to classical poetry which is less touched,this paper proposes a dual-channel Cap-Transformer model. The capsule model in the upper channel can extract features and reduce the loss of information, better capturing the semantics in each imagery in Tang poetry. The transformer model in the lower channel captures the global semantic information reflected by all imagery in Tang poetry with the help of multi-head self-attention mechanism. The experimental results suggest that our model is appropriate for authorship attribution on classical poetry in Tang Dynasty. And the error analysis further probes into the problems and challenges in this task.