Journal of Chinese Information Processing

Select

Language Analysis and Calculation

Lexical Sememe Prediction by Dictionary Definitions and LocalSemantic Correspondence

DU Jiaju, QI Fanchao, SUN Maosong, LIU Zhiyuan

2020, 34(5): 1-9.

Abstract ( ) PDF ( )

Knowledge map

Save

Sememes, defined as the minimum semantic units of human languages in linguistics, have been proven useful in many NLP tasks. Since manual construction and update of sememe knowledge bases (KBs) are costly, the task of automatic sememe prediction has been used to assist sememe annotation. In this paper, we explore the method of applying dictionary definitions to predicting sememes for unannotated words. We find that sememes of each word are usually semantically related to different words in its dictionary definition, and we name this matching relationship local semantic correspondence. Accordingly, we propose a Sememe Correspondence Pooling (SCorP) model which is able to capture this kind of matching to predict sememes. Evaluated on HowNet, our model is revealed with state-of-the-art performance, capable of properly learning local semantic correspondence between sememes and words in dictionary definitions.

Select

Language Analysis and Calculation

Distant Supervision Based Utterance Domain Classification with Domain-Specific NER

HE Yuhong, HUANG Peijie, DU Zefeng, LIU Wei, ZHU Jiankai, ZHANG Jinchuan

2020, 34(5): 10-18.

Abstract ( ) PDF ( )

Knowledge map

Save

Recently recurrent neural networks with an attention mechanism have achieved strong results on text classification. However, when the labeled training data is not enough, such as in Chinese utterance domain classification (DC) task, the data sparseness of domain entity mentions remains a significant challenge. To address this issue, this paper proposes knowledge-based neural DC (K-NDC) models that incorporate domain knowledge from external sources to enrich the representations of utterances. Firstly, domain entities and types are obtained by distant supervision from CN-Probase. Then domain-special named entity recognition (NER) and complement KB are exploited to further complement the knowledge coverage. Finally we design a novel mechanism for merging knowledge with utterance representations at fine-grained (Chinese word level). Experiments on the SMP-ECDT benchmark corpus show that, compared with the state of the art text classification models, the proposed method achieves a better performance, especially in knowledge-intensive domains.

Select

Language Resources Construction

Construction and Application of Named Entity and Entity Relations Corpus for Pediatric Diseases

ZAN Hongying, LIU Tao, NIU Changyong, ZHAO Yueshu, ZHANG Kunli, SUI Zhifang

2020, 34(5): 19-26.

Abstract ( ) PDF ( )

Knowledge map

Save

In the current medical corpus, the classification system of entities and entity relations is difficult to meet the development requirement of precision medicine. This paper conducts the research about pediatric diseases. In particular, this paper constructs an annotation system and detailed annotation schemes for named entity and entity relations under the guidance of medical experts. By fusing the relevant medical standard, annotation tools are applied for machine pre-annotation, manual annotation and manual proofreading of entities and entity relations in pediatric medical texts with more than 2.98 million words, thus constructing a medical entities and entity relations corpus for 504 common pediatric diseases. In this corpus, 23 603 named entities and 36 513 entity relationships were annotated, and for them the consistency accuracies of multiple-around annotation are 0.85 and 0.82, respectively. Based on the annotated corpus, this paper also constructs a pediatric medical knowledge graph and develops a pediatric medical knowledge QA system.

Select

Language Resources Construction

Design of An Online Lexicographic System Based on Diachronic Corpus

WU Xian, HU Junfeng

2020, 34(5): 27-35.

Abstract ( ) PDF ( )

Knowledge map

Save

Corpus linguistics is a research to discover linguistic phenomena by means of large-scale corpus. At present, there are many online corpora to assist linguists. This paper provides a corpus managed by time slice, and further proposes a community-maintained online lexicography system dynamically combining the corpus query results into edited terms. This paper also introduces a polysemous word meaning discovery and hierarchical clustering algorithm to automatically generate a default term frame. This article reviews the overall lexicographic system and highlight the design and use of the system.

Select

Machine Translation

Integrating Translation Memory into Neural Machine Translation via Data Augmentation

CAO Qian, XIONG Deyi

2020, 34(5): 36-43.

Abstract ( ) PDF ( )

Knowledge map

Save

Neural machine translation is currently the most popular method in the field of machine translation, while translation memory is a tool to help professional translators avoid repeated translations. This paper proposes two methods to integrate the translation memory into neural machine translation via data augmentation: (1) directly stitching translation memory after the source sentence; (2) stitching translation memory by tag embedding. Experiments on Chinese-English and English-German datasets show that proposed methods can achieve significant improvements.

Select

Ethnic Language Processing and Cross Language Processing

A Tibetan Word Embedding Representation Method Based on Multi-Primitives Joint Training

CAI Zhijie, CAI Rangzhuoma, SUN Maosong

2020, 34(5): 44-49.

Abstract ( ) PDF ( )

Knowledge map

Save

Word Embedding representation is to represent words as an optimized vector so that computers can understand natural language. The study of Tibetan word embedding representation technology is of great significance for the analysis of Tibetan features and the use of deep learning techniques to process Tibetan. This paper proposes a Tibetan word embedding representation method for joint training of components, characters and words as multi-primitives, named as multi-primitives joint training model (TCCWE). This method is verified by words similarity/relevance task, and the results shows the proposed method improves the performance by 3.35% on TWordSim215, and 4.36% on TWordRel215.

Select

Ethnic Language Processing and Cross Language Processing

A TC_LSTM Based Method for Tibetan Spelling Check

HUA Danzhaxi, CAI Zhijie, BAN Mabao

2020, 34(5): 50-55.

Abstract ( ) PDF ( )

Knowledge map

Save

The spelling check task aims at detecting errors in text quickly and improving the efficiency of text proofreading. On the basis of an extensive research and analysis on Tibetan spell-checking and language modeling, we utilize LSTM neural architecture which is well-known for being able to capture long-distance dependencies to build TC_LSTM language model, and design Tibetan spell checking algorithm based on the aforementioned language model. Experiments show that our approach surpasses the baseline model significantly, indicating the effectiveness of the proposed model.

Select

Information Extraction and Text Mining

Case-involved Public Opinion News Summarization with Case Elements Guidance

HAN Pengyu, GAO Shengxiang, YU Zhengtao, HUANG Yuxin, GUO Junjun

2020, 34(5): 56-63,73.

Abstract ( ) PDF ( )

Knowledge map

Save

The summary task of the public opinion news on a judical case is to obtain important information on public comments on the case in a short summary. Compared with the task of text summarization in open domain, this kind of summary usually involves specific case elements that are of great guiding effect in the process of summary generation. Therefore, a case-related news text summarization method is proposed based on deep learning framework. First, a dataset of the public opinion news summary is collected, and the case elements are defined. Then, through the attention mechanism, the case element information is integrated into the double-layer coding process of words and sentences in the news text to generate the news text representation that contains the case element information. Finally, the multi-feature classification layer is used to classify the sentences. Experiments are conducted on the public opinion news summary datasetand show that the proposed method has better performance than the base model.

Select

Information Extraction and Text Mining

Risk Assessment of Scientific Resources Based on Hyperlink Information Analysis in Literature

LUO Zhunchen, ZHAO He, YE Yuming, LIU Xiaopeng

2020, 34(5): 64-73.

Abstract ( ) PDF ( )

Knowledge map

Save

The hyperlinks of resources in scientific literature include data, code, documents, and web pages. The citation context information of the resources reflects the relationship between scientific research subjects and scientific resources in scientific research activities. Based on the fine-grained analysis of the citation information in the literature, this paper proposes a novel knowledge modeling method to characterize the resource categories and the resource citation purposes, and conducts an empirical evaluation on large-scale scientific literature datasets. Upon detailed analysis into the utilization of scientific resources at home and abroad, we explore the importance, the direction of development and the risks of use for such resources. This framework can be used to understand the progress of the advanced technologies at home and abroad, and can further provide scientific evidence for the assessment of the fateful resources in China’s scientific research activities.

Select

Machine Reading Comprehension and Text Generation

Question Generation Model Based on the Answer and Its Contexts

TAN Hongye, SUN Xiuqin, YAN Zhen

2020, 34(5): 74-81.

Abstract ( ) PDF ( )

Knowledge map

Save

Text-based question generation is to generate related questions from a given sentence or paragraph. In recent years, sequence-to-sequence neural network models have been used to generate questions for sentences containing answers. However, these methods have the limitations: (1) the generated interrogatives do not match the answer type; and (2) the relevance of questions and the answer is not strong. This paper proposes a question generation model that based on answers and the contextual information. The model first determines interrogatives that match the answer type according to the relationship between the answer and the context information. Then, the model uses the answer and the context information to determine words related to questions, so that questions use words in the original text as much as possible. Finally, the model combines answer features, interrogatives, words related to questions with original sentences as inputs to generate a question. Experiments show that the proposed model is significantly better than the baseline systems.

Select

Machine Reading Comprehension and Text Generation

BERT Based Improved Model and Tuning Techniques for Natural Language Understanding in Task-oriented Dialog System

ZHOU Qi’an, LI Zhoujun

2020, 34(5): 82-90.

Abstract ( ) PDF ( )

Knowledge map

Save

The purpose of natural language understanding in task-oriented dialog system is to parse sentences entered by the user in natural language, extracting structured information for subsequent processing. This paper proposes an improved natural language understanding model, using BERT as encoder, while the decoder is built with LSTM and attention mechanism. Furthermore, this paper proposes two tuning techniques on this model: training method with fixed model parameters, and using case-sensitive version of pretrained model. Experiments show that the improved model and tuning techniques can bring 0.8833 and 0.9251 sentence level accuracy on ATIS and Snips datasets, respectively.

Select

Information Retrieval and Question Answering

Learning to Rank Entities from Dual Feature Spaces

ZHAO Yixin, NIU Shuzi, JI Chunyan, LU Fei, XU Rui

2020, 34(5): 91-99.

Abstract ( ) PDF ( )

Knowledge map

Save

Entity retrieval from knowledge graph is of substantial significance as the large scale knowledge graphs appear, and the industry demand on effectively managing the domain knowledge graphs. Given a certain knowledge graph and a user query, entity retrieval aims at obtaining a ranking list of entities from the knowledge graph accor-ding to its relevance to the query. Being treated as the matching between the query and entities, traditional entity retrieval models map both user queries and entities into the word feature space. However, it does not work when two words in the name of an entity are assumed to be independent. In this paper, we propose to project both user queries and entities into a dual feature space, namely entity-word feature space. First, we represent entities as multiple domains and extract ranking features from them. Then, learning to rank models are employed to train a ranking model from this dual feature space. Experimental results on benchmark datasets show that our proposed method outperform state-of-the-art baselines significantly.

Select

Information Retrieval and Question Answering

Aspect-based Sentiment Analysis Based on Hybrid Multi-Head Attention and Capsule Networks

WANG Jiaqian, GONG Zihan, XUE Yun, PANG Shiguan, GU Donghong

2020, 34(5): 100-110.

Abstract ( ) PDF ( )

Knowledge map

Save

Aspect-based sentiment analysis aims to determine the sentimental polarity expressed by context under a given target. At present, most of methods such as recurrent neural network or attention mechanism cant fully capture semantic information over long distances and ignore the importance of position information. Regarding that the semantic, positional information and multi-level information fusion of sentences are crucial to this task, we propose a model based on hybrid multi-head attention and capsule networks. Firstly, multi-head self-attention is used to encode long sentences based on position word vectors and target words based on Bi-GRU, respectively; Then, capsule network is used to encode the position based on the interactive splicing of semantic information; Finally, on the basis of the original semantic information, the results are obtained by integrating context with target entity using multi-head interactive attention. Experiments on SemEval 2014 Task4 and ACL 14 Twitter show that the performance of this model is significantly improved compared with classical deep learning and popular attention methods.

Please choose a citation manager

Content to export

2020 Volume 34 Issue 5 Published: 15 June 2020