2020 Volume 34 Issue 4 Published: 01 June 2020
  

  • Select all
    |
    Language Analysis and Calculation
  • Language Analysis and Calculation
    CHENG Ning, LI Bin, GE Sijia, HAO Xingyue, FENG Minxuan
    2020, 34(4): 1-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    The basic tasks of ancient Chinese information processing include automatic sentence segmentation, word segmentation, part-of-speech tagging and named entity recognition. To avoid the error accumulation in the pipeline processing, this paper proposes a joint approach to sentence segmentation and lexical analysis. The BiLSTM-CRF neural network model is used to verify the generalization ability and the effect of sentence segmentation and lexical analysis on different label levels on four cross-age test sets. Experiments show that the joint model achieves improvements on the F1-score of sentence segmentation, word segmentation and part-of-speech tagging: yielding 78.95% for sentence segmentation (with an average increase of 3.5%), 85.73% for word segmentation (with an average increase of 0.18%), and 72.65% for part-of-speech tagging (with an average increase of 0.35%).
  • Language Analysis and Calculation
    LIU Yahui, YANG Haoping, LI Zhenghua, ZHANG Min
    2020, 34(4): 10-20.
    Abstract ( ) PDF ( ) Knowledge map Save
    As the main formalism of shallow semantic parsing, semantic role labeling is one of the hot research topics in natural language processing (NLP). There are three main problems in current existing annotation guidelines (i.e., the PropBank annotation guideline and the Peking University guideline). First, the span-based argument representation complicates the annotation process. Second, it is difficult to define the frames of the predicates in the PropBank annotation guideline. Third, the Peking University guideline does not annotate omitted arguments. Through thorough investigation of existing Chinese and English annotation guidelines, we develop a lightweight annotation guideline for Chinese semantic role labeling suitable for ordinary annotators by combining the advantages of existing guidelines and considering the real problems during our annotation process. First, we choose the word-based argument representation to avoid determination of span boundary and thus reduce annotation difficulty. Second, annotators can directly annotate the arguments of a predicate word according to the sentential context information, without pre-defining all semantic frames of the predicate word. Third, we explicitly annotate the omitted core arguments to more precisely describe the semantic information of sentences. Additionally, in order to ensure the annotation consistency and improve the quality of annotation, the proposed guideline gives clear priority and difficulty analysis for various complex linguistic phenomena.
  • Language Analysis and Calculation
    DAI Yuling, DAI Rubing, FENG Minxuan, LI Bin, QU Weiguang
    2020, 34(4): 21-29.
    Abstract ( ) PDF ( ) Knowledge map Save
    Function words have rich grammatical meanings and are crucial to sentence comprehension. The existing linguistic researches on function words cannot be directly adopted by computational linguistics due to lack of formal representation. In this paper, to represent their syntactic and semantic information, we align words and conceptual relations in the abstract meaning representation (AMR) based on concept graphs, so that function words correspond to nodes or arcs between conceptual nodes. Then, 8,587 sentences from PEP primary school Chinese textbooks are selected for AMR annotation. Among the total 24,801 tokens of function words in this corpus, 58.80% are prepositions, conjunctions and structural auxiliaries which are correspond to relations between concepts, and 41.20% are modals and aspects which express concepts. This shows that AMR represents function words dynamically, providing better theory and resources for the syntactic and semantic analysis of whole sentences.
  • Language Resources Construction
  • Language Resources Construction
    ZAN Hongying, HAN Yangchao, FAN Yaxin, NIU Chengzhi, ZHANG Kunli, SUI Zhifang
    2020, 34(4): 30-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    Building a large-scale knowledge base is an essential task in the fields of artificial intelligence and natural language understanding. As an important basis for describing the subjective feelings of patients and diagnosing diseases, symptoms are important factors in optimizing tasks such as intelligent consultation and medical question answering. Based on the existing researches, this paper constructs an open Chinese symptom knowledge base according to the concept and characteristics of symptoms and their roles in medical diagnosis. The knowledge base describes the relevant attributes such as ontology taxonomy of symptoms, related diseases, body parts stroke, and the suffering populations, covering a total of 146,631 attribute relationships of 8,772 symptoms. The constructed symptom knowledge base is an important part of the Chinese medical knowledge graph, providing a data foundation for applications such as KBQA, knowledge reasoning and supporting decision making.
  • Language Resources Construction
    WU Ting, LI Mingyang, KONG Fang
    2020, 34(4): 38-46.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid development of the information age, the data resources nowadays in the network show a spurt growthtrend. To build a certain scale of available knowledge base is a great significance to natural language processing related tasks,which was based mining deep structured information from disordered-numerous data. The hypernymy relation is a basic framework of knowledge base, but most of the existing corpora are limited to general field and neglect the hypernymy across sentences or discourses. This paper proposes a discourse-level hypernymy labeling strategy based on synonymous reasoning, and constructs a discourse-level corpus with news and technology literatures in the field of the defense science and technology. In total, we annotate 11020 semantic relationships in 962 texts, and the consistency of the entity relationship labeling reaches 0.82. Our work lays a corpus foundation for the research on the hypernymy detection in the field of national defense science and technology.
  • Machine Translation
  • Machine Translation
    MING Yuqin, XIA Tian, PENG Yanbing
    2020, 34(4): 47-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    A subtle perturbation in the input can decline the performance of the Neural Machine Translation (NMT). This work proposes a neural machine translation method incorporating adversarial learning. Given a source sentence sequence, we construct a new sequence by adding subtle noise to the source sentence, and the two sequences have the similar semantics. Then we submit the two sentences of the last step to encoder so as to generate their vector representations respectively. Next, we submit the processing results to Generator and Discriminator for further processing. Lastly, we compare the translation performance before and after adding the noise. The final results show that the method of using this model both improves the translation performance, and shows the robustness to the noise input.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    LI Jin, GAO Jing, CHEN Junjie, WANG Yongjun
    2020, 34(4): 55-59,68.
    Abstract ( ) PDF ( ) Knowledge map Save
    Each morpheme of Mongolian has a different writing form at different positions of the word, which makes the structure of Mongolian script glyphs diverse and enormous. As a result, it takes a lot of manpower and material resources to design Mongolian script using computer-assisted or manual methods. This paper proposes the application of conditional generative adversarial network model to Mongolian font style transfer. The model uses generative loss and discriminative loss measurement models. Adam Optimizer automatically adjusts the learning rate and gradually reduces the difference until the generator and discriminator reach the Nash equilibrium state. Experimented on the Mongolian font data set, it can be observed that Mongolia can be generated directly from the Mongolian title font, and the generated fonts are basically similar to the real font styles.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    LIU Suwen, SHAO Yifan, QIAN Longhua
    2020, 34(4): 60-68.
    Abstract ( ) PDF ( ) Knowledge map Save
    Biomedical causality extraction is an evaluation task proposed by the BioCreative community to explore the rich semantic relationships between biomedical entities. Unlike traditional entity relation extraction focusing only on binary relationships, this task includes the identification of function acting on one or more entities. Based on the idea of multi-task learning, a joint learning model sharing decision-making by both binary relation extraction and unary function detection is proposed. On the shared word embeddings, LSTM with gated mechanism are employed to learn the interactive representation between two tasks, and the final predictions are performed respectively. The experimental results show that this method can exploit the information of two tasks, achieving 45.3% F-score on the 2015 BC-V dataset.
  • Information Extraction and Text Mining
    CHENG Yusi, SHI Yuntao
    2020, 34(4): 69-76.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese person name recognition is restricted by the domain and size of the existing annotated corpus and the issue of class imbalance. Person name dictionaries and domain dictionaries are more easily achieved than humanly annotated training corpus. This article incorporates dictionaries into bi-directional long short-term memory (Bi-LSTM) networks with weighted conditional random field layer (WCRF). The model extracts the possibility of family name and given name from personal name dictionaries. The domain dictionaries provide information on human names. Bi-LSTM captured context information and weighted conditional random field improved recall of personal name recognition. Experiments on People's Daily corpus and construction law corpus show that, compared with the existing method based on hidden Markov model, the F1 value of personal name recognition is improved by 18.34%; compared with traditional Bi-LSTM-CRF model, Recall value increases by 15.53% and F1 value increases by 8.83%.
  • Machine Reading Comprehension and Text Generation
  • Machine Reading Comprehension and Text Generation
    ZHENG Jie, KONG Fang, ZHOU Guodong
    2020, 34(4): 77-84.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a common linguistic phenomenon, ellipsis is common in texts, especially in short texts such as QA and dialogue. In order to understand the semantic information of short texts, we propose a multi-attention fusion model for Chinese ellipsis recovery. This model combines the context and the text information by gate mechanism, multi-attention and self-attention. Experiments on several short text corpora show that this model can efficiently detect ellipsis position and recover ellipsis content, facilitating better comprehension of short text.
  • Machine Reading Comprehension and Text Generation
    TAN Hongye, LI Xuanying, LIU Bei
    2020, 34(4): 85-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    Reading Comprehension (RC) refers to automatically answering questions on the given text, which has become a popular issue in natural language processing. Many deep learning RC methods have been proposed. However, these methods do not fully understand questions and the discourse, leading to poor performance of the model. In order to solve the problem, this paper proposes a reading comprehension method based on external knowledge and hierarchical discourse representation. The method uses the external knowledge and question types to enhance question comprehension. And the method utilizes the hierarchical discourse representation to improve the understanding of the discourse. Moreover, the two subtasks of the question type prediction and the answer prediction are jointly optimized in an unified framework. Experiments performed on the DuReader dataset show that the proposed method increased the performance by 8.2% at most.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    CHENG Yan, ZHU Hai, XIANG Guoxiong, TANG Tianwei, ZHONG Linhui, WANG Guowei
    2020, 34(4): 92-100.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text emotion classification is a well-addressed task in the field of natural language processing. To deal with the unbalanced data which hurt the classification performance, this paper proposes an emotion classification method combining CNN and EWC algorithms. First, the method uses the random under-sampling method to obtain multiple sets of balanced data for training. Then it feeds each balanced dataset to CNN training in sequence, introducing EWC algorithm in the training process to overcome the catastrophic forgetting issue in CNN. Finally, the CNN model trained by the last data set is treated as the final classification model. The experimental results show that the proposed method is superior to the ensemble learning framework based on under-sampling and multi-classification algorithms, and outperforms the multi-channel LSTM neural network with 1.9% and 2.1% improvements in accuracy and G-mean, respectively.
  • NLP Application
  • NLP Application
    CHENG Yong, XU Dekuan, DONG Jun
    2020, 34(4): 101-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic grading of text reading difficulty is to automatically judge the difficulty level according to text features. In this paper, we propose a novel difficulty grading method based on multi-linguistic features and deep features. In this method, various linguistic features are taken into account from characters level, vocabulary level and sentences level, in terms of frequency, length, complexity, richness and coherence. On the other hand, this paper uses the BERT-based pre-trained neural network model to extract the deep features of text sentences. On this basis, an end-to-end neural network is constructed to fuse the multi-linguistic features and deep features. Our methods achieve good performance in automatic grading, outperforming the methods based on traditional linguistic features and on popular neural networks.