Journal of Chinese Information Processing

Select

Survey

A Survey on Stance Detection

LIU Wei, PENG Xin, LI Chao, WANG Pin, WANG Lihong

2020, 34(12): 1-8.

Abstract ( ) PDF ( )

Knowledge map

Save

The stance detection aims to identify the attitude (i.e., in favor of, against, or none) towards a given target, such as an event, a product, a policy, a person, a service, etc. Mining users' stances on social media is important to public opinion monitoring and information recommendation. This paper presents a survey on stance detection: introducing the conception of stance detection, summarizing various learning based methods, and describing the data sets. Finally, this paper discusses the future directions of stance detection.

Select

Survey

A Survey to Knowledge Graph and Its Military Application

LIN Wangqun, WANG Miao, WANG Wei, WANG Chongnan, JIN Songchang

2020, 34(12): 9-16.

Abstract ( ) PDF ( )

Knowledge map

Save

Knowledge graph describes the concept, entity and their relationship in the form of semantic network. In this paper, we formally describe the basic concepts and the hierarchical architecture of knowledge graph. Then we review the state-of-the-art technologies of information extraction, knowledge fusion, schema, knowledge management. Finally, we probes into the application of knowledge graph in the military field, revealing challenges and trends of the future development.

Select

Language Analysis and Calculation

Quantitative Analysis of Chinese Vocabulary Comprehensive Complexity Based on AHP

ZHANG Yinbing, SONG Jihua, PENG Weiming, GUO Dongdong, SONG Tianbao

2020, 34(12): 17-29.

Abstract ( ) PDF ( )

Knowledge map

Save

In the international Chinese language teaching, the quantitative analysis of Chinese vocabulary comprehensive complexity benefit such aspects as the determination of vocabulary acquisition order for Chinese second language learners, the vocabulary selection in the process of textbook compiling, and so on. Based on the analysis of the Chinese character attributes, general attributes and statistical attributes of words, this study constructs a quantitative analysis model of Chinese word difficulty based on analytic hierarchy process (AHP). By comparing the consistency with the vocabulary grading the existing syllabus, the validity of the model is verified. It provides a possible solution to the vocabulary grading, text difficulty analysis and text simplification, etc.

Select

Language Analysis and Calculation

Recognizing Chinese Macro Discourse Nuclearity Based on Discourse Topic

SUN Zhenhua, ZHOU Yi, ZHU Qiaoming, JIANG Feng, LI Peifeng

2020, 34(12): 30-38.

Abstract ( ) PDF ( )

Knowledge map

Save

Discourse analysis is a hot topic in the field of Natural Language Processing. Discourse nuclearity recognition, a subtask of discourse analysis, focuses on recognizing the main and secondary content of a discourse, to better understand and grasp its core content. This paper focuses on the task of macro Chinese discourse nulcearity recognition and proposes a recognition method based on discourse topic. This method introduces the semantic interaction between different discourse units and that between the discourse unit and its topic to identify the nuclearity. Moreover, it applies the selection mechanism of the discourse topic to further improve the performance of nuclearity recognition.Experimental results on MCDTB show that the proposed method outperforms the state-of-the-art baselines.

Select

Ethnic Language Processing and Cross Language Processing

Bilingual Topic Word Embedding for Chinese-Korean Cross-lingual Text Classification

WANG Qi, TIAN Mingjie, CUI Rongyi, ZHAO Yahui

2020, 34(12): 39-47.

Abstract ( ) PDF ( )

Knowledge map

Save

A bilingual topical word embedding model is proposed for the Chinese-Korean cross-lingual text classification task. The model combines the topic model with the bilingual word embedding to solve the influence of the ambiguity caused by polysemy on the accuracy to cross-lingual text classification. Firstly, the word embedding representation of bilingual words is trained in a large scale parallel sentence pairs with word-alignment. Secondly, the dataset of classification task is processed and represented by topic model, and the topic words in both languages are obtained. Finally, the word embeddings of these topic words are input into the traditional text classifier and the deep learning text classifier. The experimental results show that the accuracy reach 91.76% in the Chinese-Korean cross-lingual text classification task.

Select

Ethnic Language Processing and Cross Language Processing

Tibetan Character Error Detection Based on Neural Network

SECHA Jia, CIZHEN Jiacuo, CAIRANG Jia, HUAGUO Cairang

2020, 34(12): 48-53,64.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper puts the Tibetan character error detection task as a classification problem. First of all, the syllable confusion subset is built according to the language knowledge and each Tibetan sentence is add with noise. Then a deep bi-direction representation based BERT is applied in the classification model. Two baseline model and test sets of different domains are then constructed. The experimental results show that this method is superior to the two baseline models. The accuracy of sentence classification in the same method can reach 93.74%, and achieve 83.6% in test from different fields. In the syllable level, the performance of true negative s is 74.53%, and false negative is 2.30%.

Select

Information Extraction and Text Mining

Radical-Aware Named Entity Recognition for Chinese Medical Records

LI Dan, XU Tong, ZHENG Yi, WANG Zhefeng, CHEN Enhong

2020, 34(12): 54-64.

Abstract ( ) PDF ( )

Knowledge map

Save

The general named entity recognition fails to capture the features in Chinese characters as well as Chinese medical records. In this paper, we integrate the BERT into a joint model of bi-directional long short-term memory and conditional random fields for better performance. Considering the unique feature of radicals for medical entities, we encode the radical information into the word vector, and then modify the scoring function of the CRF layer. Experiments on two real-world electronic medical record datasets validate that the proposed method outperforms the state-of-the-art baseline methods, especially for the disease-related named entities.

Select

Information Extraction and Text Mining

Fine-Grained Entity Typing with Prototypical Networks

REN Quan

2020, 34(12): 65-72.

Abstract ( ) PDF ( )

Knowledge map

Save

As an extension of named entity recognition task, fine-grained entity typing task aims to assign more fine-grained types to entities according to mention and contexts. Due to the high cost and error-prone of the fine-grained types annotation, we study the fine-grained entity typing only by a small number of samples. This paper first proposes a feature extraction method which can extract entity information from word-level and character-level, respectively. Then, combining with prototype network, the method transforms the multi-class classification task into single-class classification task, and realizes fine-grained entity classification by calculating the distances from prototypes in metric space. Tested on the public dataset FIGER (GOLD) under the settings of the few-shot learning and the zero-shot learning, the proposed method achieves ideal results. Under the setting of the few-shot learning, the proposed method out-performs the baseline on all metrics, in particular the macro-F₁ is increased by 2.4%.

Select

Question Answering and Dialogue System

Question Answering for Overview Questions Based on CFN and Discourse Topic

YANG Zhizhuo, LI Chunzhuan, ZHANG Hu, QIAN Yili, LI Ru

2020, 34(12): 73-81.

Abstract ( ) PDF ( )

Knowledge map

Save

Reading comprehension QA for College Entrance Examination on Chinese is much challenging due to the fact that the questions are more abstract. In addition to the question similarity analysis, the extraction of answer candidate sentences should also pay more attention to the topic and opinion sentences. This paper proposes to extract the candidate answer sentences by frame semantic match and frame semantic relation. By identifying the discourse topic sentences, the topic and opinion sentences related to the questions are generated. Then the top-six candidate answers are selected based on ranking results. In the experiment, the recall of the method on the College Entrance Examination of Beijing in recent twelve years is 68.69%, which verifies the effectiveness of the method.

Select

NLP Application

Neural Network-Based Poetry Retrieval

LIANG Jiannan, SUN Maosong, YI Xiaoyuan

2020, 34(12): 82-91.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese classical poetry, with its long history, is one of the representatives of Chinese classical literature and a treasure of Chinese traditional culture. Poetry retrieval is a comparison of the content between poetry, finding poems that are similar in semantics and artistic conception, which demands requires an in-depth understanding of the content and mood of the whole poem. This paper applies the recurrent neural network (RNN) to automatically learn the semantic representation of ancient poems. A variety of methods is designed to automatically calculate the correlation between two poems to further calculate the semantic distance between them, achieving the recommendation of poetry. The experimental results of automatic and manual evaluation show that the model can generate good quality poetry retrieval results.

Select

NLP Application

Distributed Representation of Fictional Characters and Its Applications

JIA Yuxiang, WANG Lu, LIU Pengcheng, WANG Qian, ZHANG Yue, ZAN Hongying

2020, 34(12): 92-99.

Abstract ( ) PDF ( )

Knowledge map

Save

Novel is a literary genre that centers on character creation, depicting social life through complete plots and specific environmental descriptions. Modeling fictional characters is essential for literary text understanding and literary text mining. In this paper, we construct a large-scale novel corpus and extract characters and their dependency features. We propose a skip-gram based model to train character embeddings, with the character as the target while the dependency features as the contexts. Based on the trained character embeddings, we further investigate the tasks of character similarity computation, character clustering, and character profiling. The experimental results show a good performance of the distributed representation of fictional characters in the above tasks.

Select

NLP Application

Construction of Clinic Indicator Terminology Base and Its Application in Medical Record Mining

ZHANG Zhixing, ZHANG Jiaying, GAO Daqi, RUAN Tong,
WANG Jun, HE Ping, YAO Huayan

2020, 34(12): 100-110.

Abstract ( ) PDF ( )

Knowledge map

Save

On Shanghai Regional Health Platform with electronic medical record data of 38 tertiary hospitals, the diversity and ambiguity of clinic indicators have seriously affected medical data mining. In this paper, we propose a semi-automatic terminology base construction solution based on the following four steps: schema design, information extraction, knowledge fusion and knowledge verification. We first build a standard indicator sub-base according to the medical insurance standard provided by Shanghai Municipal Health Commission. Then we use BERT-based clinical indicator alignment model to integrate indicators in 38 hospitals as synonyms into the standard. The constructed terminology base contains 23, 495 entities and 47, 746 factual triples, with potential applications in medical data cleaning, medical record retrieve and other tasks. Experiments show that the F₁-score of our alignment model reaches 95.78%, and its application in colorectal cancer data mining task can improve the record up to 94%. In addition, a part of this terminology database related to colorectal cancer has been published in dcazb.ecustnlplab.com.

Please choose a citation manager

Content to export

2020 Volume 34 Issue 12 Published: 20 January 2021