2020 Volume 34 Issue 5 Published: 15 June 2020
  

  • Select all
    |
    Language Analysis and Calculation
  • Language Analysis and Calculation
    DU Jiaju, QI Fanchao, SUN Maosong, LIU Zhiyuan
    2020, 34(5): 1-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sememes, defined as the minimum semantic units of human languages in linguistics, have been proven useful in many NLP tasks. Since manual construction and update of sememe knowledge bases (KBs) are costly, the task of automatic sememe prediction has been used to assist sememe annotation. In this paper, we explore the method of applying dictionary definitions to predicting sememes for unannotated words. We find that sememes of each word are usually semantically related to different words in its dictionary definition, and we name this matching relationship local semantic correspondence. Accordingly, we propose a Sememe Correspondence Pooling (SCorP) model which is able to capture this kind of matching to predict sememes. Evaluated on HowNet, our model is revealed with state-of-the-art performance, capable of properly learning local semantic correspondence between sememes and words in dictionary definitions.
  • Language Analysis and Calculation
    HE Yuhong, HUANG Peijie, DU Zefeng, LIU Wei, ZHU Jiankai, ZHANG Jinchuan
    2020, 34(5): 10-18.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recently recurrent neural networks with an attention mechanism have achieved strong results on text classification. However, when the labeled training data is not enough, such as in Chinese utterance domain classification (DC) task, the data sparseness of domain entity mentions remains a significant challenge. To address this issue, this paper proposes knowledge-based neural DC (K-NDC) models that incorporate domain knowledge from external sources to enrich the representations of utterances. Firstly, domain entities and types are obtained by distant supervision from CN-Probase. Then domain-special named entity recognition (NER) and complement KB are exploited to further complement the knowledge coverage. Finally we design a novel mechanism for merging knowledge with utterance representations at fine-grained (Chinese word level). Experiments on the SMP-ECDT benchmark corpus show that, compared with the state of the art text classification models, the proposed method achieves a better performance, especially in knowledge-intensive domains.
  • Language Resources Construction
  • Language Resources Construction
    ZAN Hongying, LIU Tao, NIU Changyong, ZHAO Yueshu, ZHANG Kunli, SUI Zhifang
    2020, 34(5): 19-26.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the current medical corpus, the classification system of entities and entity relations is difficult to meet the development requirement of precision medicine. This paper conducts the research about pediatric diseases. In particular, this paper constructs an annotation system and detailed annotation schemes for named entity and entity relations under the guidance of medical experts. By fusing the relevant medical standard, annotation tools are applied for machine pre-annotation, manual annotation and manual proofreading of entities and entity relations in pediatric medical texts with more than 2.98 million words, thus constructing a medical entities and entity relations corpus for 504 common pediatric diseases. In this corpus, 23 603 named entities and 36 513 entity relationships were annotated, and for them the consistency accuracies of multiple-around annotation are 0.85 and 0.82, respectively. Based on the annotated corpus, this paper also constructs a pediatric medical knowledge graph and develops a pediatric medical knowledge QA system.
  • Language Resources Construction
    WU Xian, HU Junfeng
    2020, 34(5): 27-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    Corpus linguistics is a research to discover linguistic phenomena by means of large-scale corpus. At present, there are many online corpora to assist linguists. This paper provides a corpus managed by time slice, and further proposes a community-maintained online lexicography system dynamically combining the corpus query results into edited terms. This paper also introduces a polysemous word meaning discovery and hierarchical clustering algorithm to automatically generate a default term frame. This article reviews the overall lexicographic system and highlight the design and use of the system.
  • Machine Translation
  • Machine Translation
    CAO Qian, XIONG Deyi
    2020, 34(5): 36-43.
    Abstract ( ) PDF ( ) Knowledge map Save
    Neural machine translation is currently the most popular method in the field of machine translation, while translation memory is a tool to help professional translators avoid repeated translations. This paper proposes two methods to integrate the translation memory into neural machine translation via data augmentation: (1) directly stitching translation memory after the source sentence; (2) stitching translation memory by tag embedding. Experiments on Chinese-English and English-German datasets show that proposed methods can achieve significant improvements.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    CAI Zhijie, CAI Rangzhuoma, SUN Maosong
    2020, 34(5): 44-49.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word Embedding representation is to represent words as an optimized vector so that computers can understand natural language. The study of Tibetan word embedding representation technology is of great significance for the analysis of Tibetan features and the use of deep learning techniques to process Tibetan. This paper proposes a Tibetan word embedding representation method for joint training of components, characters and words as multi-primitives, named as multi-primitives joint training model (TCCWE). This method is verified by words similarity/relevance task, and the results shows the proposed method improves the performance by 3.35% on TWordSim215, and 4.36% on TWordRel215.
  • Ethnic Language Processing and Cross Language Processing
    HUA Danzhaxi, CAI Zhijie, BAN Mabao
    2020, 34(5): 50-55.
    Abstract ( ) PDF ( ) Knowledge map Save
    The spelling check task aims at detecting errors in text quickly and improving the efficiency of text proofreading. On the basis of an extensive research and analysis on Tibetan spell-checking and language modeling, we utilize LSTM neural architecture which is well-known for being able to capture long-distance dependencies to build TC_LSTM language model, and design Tibetan spell checking algorithm based on the aforementioned language model. Experiments show that our approach surpasses the baseline model significantly, indicating the effectiveness of the proposed model.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    HAN Pengyu, GAO Shengxiang, YU Zhengtao, HUANG Yuxin, GUO Junjun
    2020, 34(5): 56-63,73.
    Abstract ( ) PDF ( ) Knowledge map Save
    The summary task of the public opinion news on a judical case is to obtain important information on public comments on the case in a short summary. Compared with the task of text summarization in open domain, this kind of summary usually involves specific case elements that are of great guiding effect in the process of summary generation. Therefore, a case-related news text summarization method is proposed based on deep learning framework. First, a dataset of the public opinion news summary is collected, and the case elements are defined. Then, through the attention mechanism, the case element information is integrated into the double-layer coding process of words and sentences in the news text to generate the news text representation that contains the case element information. Finally, the multi-feature classification layer is used to classify the sentences. Experiments are conducted on the public opinion news summary datasetand show that the proposed method has better performance than the base model.
  • Information Extraction and Text Mining
    LUO Zhunchen, ZHAO He, YE Yuming, LIU Xiaopeng
    2020, 34(5): 64-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    The hyperlinks of resources in scientific literature include data, code, documents, and web pages. The citation context information of the resources reflects the relationship between scientific research subjects and scientific resources in scientific research activities. Based on the fine-grained analysis of the citation information in the literature, this paper proposes a novel knowledge modeling method to characterize the resource categories and the resource citation purposes, and conducts an empirical evaluation on large-scale scientific literature datasets. Upon detailed analysis into the utilization of scientific resources at home and abroad, we explore the importance, the direction of development and the risks of use for such resources. This framework can be used to understand the progress of the advanced technologies at home and abroad, and can further provide scientific evidence for the assessment of the fateful resources in China’s scientific research activities.
  • Machine Reading Comprehension and Text Generation
  • Machine Reading Comprehension and Text Generation
    TAN Hongye, SUN Xiuqin, YAN Zhen
    2020, 34(5): 74-81.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text-based question generation is to generate related questions from a given sentence or paragraph. In recent years, sequence-to-sequence neural network models have been used to generate questions for sentences containing answers. However, these methods have the limitations: (1) the generated interrogatives do not match the answer type; and (2) the relevance of questions and the answer is not strong. This paper proposes a question generation model that based on answers and the contextual information. The model first determines interrogatives that match the answer type according to the relationship between the answer and the context information. Then, the model uses the answer and the context information to determine words related to questions, so that questions use words in the original text as much as possible. Finally, the model combines answer features, interrogatives, words related to questions with original sentences as inputs to generate a question. Experiments show that the proposed model is significantly better than the baseline systems.
  • Machine Reading Comprehension and Text Generation
    ZHOU Qi’an, LI Zhoujun
    2020, 34(5): 82-90.
    Abstract ( ) PDF ( ) Knowledge map Save
    The purpose of natural language understanding in task-oriented dialog system is to parse sentences entered by the user in natural language, extracting structured information for subsequent processing. This paper proposes an improved natural language understanding model, using BERT as encoder, while the decoder is built with LSTM and attention mechanism. Furthermore, this paper proposes two tuning techniques on this model: training method with fixed model parameters, and using case-sensitive version of pretrained model. Experiments show that the improved model and tuning techniques can bring 0.8833 and 0.9251 sentence level accuracy on ATIS and Snips datasets, respectively.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    ZHAO Yixin, NIU Shuzi, JI Chunyan, LU Fei, XU Rui
    2020, 34(5): 91-99.
    Abstract ( ) PDF ( ) Knowledge map Save
    Entity retrieval from knowledge graph is of substantial significance as the large scale knowledge graphs appear, and the industry demand on effectively managing the domain knowledge graphs. Given a certain knowledge graph and a user query, entity retrieval aims at obtaining a ranking list of entities from the knowledge graph accor-ding to its relevance to the query. Being treated as the matching between the query and entities, traditional entity retrieval models map both user queries and entities into the word feature space. However, it does not work when two words in the name of an entity are assumed to be independent. In this paper, we propose to project both user queries and entities into a dual feature space, namely entity-word feature space. First, we represent entities as multiple domains and extract ranking features from them. Then, learning to rank models are employed to train a ranking model from this dual feature space. Experimental results on benchmark datasets show that our proposed method outperform state-of-the-art baselines significantly.
  • Information Retrieval and Question Answering
    WANG Jiaqian, GONG Zihan, XUE Yun, PANG Shiguan, GU Donghong
    2020, 34(5): 100-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    Aspect-based sentiment analysis aims to determine the sentimental polarity expressed by context under a given target. At present, most of methods such as recurrent neural network or attention mechanism cant fully capture semantic information over long distances and ignore the importance of position information. Regarding that the semantic, positional information and multi-level information fusion of sentences are crucial to this task, we propose a model based on hybrid multi-head attention and capsule networks. Firstly, multi-head self-attention is used to encode long sentences based on position word vectors and target words based on Bi-GRU, respectively; Then, capsule network is used to encode the position based on the interactive splicing of semantic information; Finally, on the basis of the original semantic information, the results are obtained by integrating context with target entity using multi-head interactive attention. Experiments on SemEval 2014 Task4 and ACL 14 Twitter show that the performance of this model is significantly improved compared with classical deep learning and popular attention methods.