Journal of Chinese Information Processing

Select

Language Analysis and Calculation

A Joint Model of Automatic Sentence Segmentation and Lexical Analysis for Ancient Chinese Based on BiLSTM-CRF Model

CHENG Ning, LI Bin, GE Sijia, HAO Xingyue, FENG Minxuan

2020, 34(4): 1-9.

Abstract ( ) PDF ( )

Knowledge map

Save

The basic tasks of ancient Chinese information processing include automatic sentence segmentation, word segmentation, part-of-speech tagging and named entity recognition. To avoid the error accumulation in the pipeline processing, this paper proposes a joint approach to sentence segmentation and lexical analysis. The BiLSTM-CRF neural network model is used to verify the generalization ability and the effect of sentence segmentation and lexical analysis on different label levels on four cross-age test sets. Experiments show that the joint model achieves improvements on the F₁-score of sentence segmentation, word segmentation and part-of-speech tagging: yielding 78.95% for sentence segmentation (with an average increase of 3.5%), 85.73% for word segmentation (with an average increase of 0.18%), and 72.65% for part-of-speech tagging (with an average increase of 0.35%).

Select

Language Analysis and Calculation

A Lightweight Annotation Guideline of Chinese Semantic Role Labeling

LIU Yahui, YANG Haoping, LI Zhenghua, ZHANG Min

2020, 34(4): 10-20.

Abstract ( ) PDF ( )

Knowledge map

Save

As the main formalism of shallow semantic parsing, semantic role labeling is one of the hot research topics in natural language processing (NLP). There are three main problems in current existing annotation guidelines (i.e., the PropBank annotation guideline and the Peking University guideline). First, the span-based argument representation complicates the annotation process. Second, it is difficult to define the frames of the predicates in the PropBank annotation guideline. Third, the Peking University guideline does not annotate omitted arguments. Through thorough investigation of existing Chinese and English annotation guidelines, we develop a lightweight annotation guideline for Chinese semantic role labeling suitable for ordinary annotators by combining the advantages of existing guidelines and considering the real problems during our annotation process. First, we choose the word-based argument representation to avoid determination of span boundary and thus reduce annotation difficulty. Second, annotators can directly annotate the arguments of a predicate word according to the sentential context information, without pre-defining all semantic frames of the predicate word. Third, we explicitly annotate the omitted core arguments to more precisely describe the semantic information of sentences. Additionally, in order to ensure the annotation consistency and improve the quality of annotation, the proposed guideline gives clear priority and difficulty analysis for various complex linguistic phenomena.

Select

Language Analysis and Calculation

Representation and Analysis of Abstract Meaning of Chinese Function Words Based on Relation Alignment

DAI Yuling, DAI Rubing, FENG Minxuan, LI Bin, QU Weiguang

2020, 34(4): 21-29.

Abstract ( ) PDF ( )

Knowledge map

Save

Function words have rich grammatical meanings and are crucial to sentence comprehension. The existing linguistic researches on function words cannot be directly adopted by computational linguistics due to lack of formal representation. In this paper, to represent their syntactic and semantic information, we align words and conceptual relations in the abstract meaning representation (AMR) based on concept graphs, so that function words correspond to nodes or arcs between conceptual nodes. Then, 8,587 sentences from PEP primary school Chinese textbooks are selected for AMR annotation. Among the total 24,801 tokens of function words in this corpus, 58.80% are prepositions, conjunctions and structural auxiliaries which are correspond to relations between concepts, and 41.20% are modals and aspects which express concepts. This shows that AMR represents function words dynamically, providing better theory and resources for the syntactic and semantic analysis of whole sentences.

Select

Language Resources Construction

Construction and Analysis of Symptom Knowledge Base in Chinese

ZAN Hongying, HAN Yangchao, FAN Yaxin, NIU Chengzhi, ZHANG Kunli, SUI Zhifang

2020, 34(4): 30-37.

Abstract ( ) PDF ( )

Knowledge map

Save

Building a large-scale knowledge base is an essential task in the fields of artificial intelligence and natural language understanding. As an important basis for describing the subjective feelings of patients and diagnosing diseases, symptoms are important factors in optimizing tasks such as intelligent consultation and medical question answering. Based on the existing researches, this paper constructs an open Chinese symptom knowledge base according to the concept and characteristics of symptoms and their roles in medical diagnosis. The knowledge base describes the relevant attributes such as ontology taxonomy of symptoms, related diseases, body parts stroke, and the suffering populations, covering a total of 146,631 attribute relationships of 8,772 symptoms. The constructed symptom knowledge base is an important part of the Chinese medical knowledge graph, providing a data foundation for applications such as KBQA, knowledge reasoning and supporting decision making.

Select

Language Resources Construction

Construction of Textual Entity Hypernymy Corpus Based on Synonymy Reasoning

WU Ting, LI Mingyang, KONG Fang

2020, 34(4): 38-46.

Abstract ( ) PDF ( )

Knowledge map

Save

With the rapid development of the information age, the data resources nowadays in the network show a spurt growthtrend. To build a certain scale of available knowledge base is a great significance to natural language processing related tasks,which was based mining deep structured information from disordered-numerous data. The hypernymy relation is a basic framework of knowledge base, but most of the existing corpora are limited to general field and neglect the hypernymy across sentences or discourses. This paper proposes a discourse-level hypernymy labeling strategy based on synonymous reasoning, and constructs a discourse-level corpus with news and technology literatures in the field of the defense science and technology. In total, we annotate 11020 semantic relationships in 962 texts, and the consistency of the entity relationship labeling reaches 0.82. Our work lays a corpus foundation for the research on the hypernymy detection in the field of national defense science and technology.

Select

Machine Translation

Neural Machine Translation Based on GAN Optimization

MING Yuqin, XIA Tian, PENG Yanbing

2020, 34(4): 47-54.

Abstract ( ) PDF ( )

Knowledge map

Save

A subtle perturbation in the input can decline the performance of the Neural Machine Translation (NMT). This work proposes a neural machine translation method incorporating adversarial learning. Given a source sentence sequence, we construct a new sequence by adding subtle noise to the source sentence, and the two sequences have the similar semantics. Then we submit the two sentences of the last step to encoder so as to generate their vector representations respectively. Next, we submit the processing results to Generator and Discriminator for further processing. Lastly, we compare the translation performance before and after adding the noise. The final results show that the method of using this model both improves the translation performance, and shows the robustness to the noise input.

Select

Ethnic Language Processing and Cross Language Processing

Mongolian Font Style Transfer Model Based on Conditional Generative Adversarial Network

LI Jin, GAO Jing, CHEN Junjie, WANG Yongjun

2020, 34(4): 55-59,68.

Abstract ( ) PDF ( )

Knowledge map

Save

Each morpheme of Mongolian has a different writing form at different positions of the word, which makes the structure of Mongolian script glyphs diverse and enormous. As a result, it takes a lot of manpower and material resources to design Mongolian script using computer-assisted or manual methods. This paper proposes the application of conditional generative adversarial network model to Mongolian font style transfer. The model uses generative loss and discriminative loss measurement models. Adam Optimizer automatically adjusts the learning rate and gradually reduces the difference until the generator and discriminator reach the Nash equilibrium state. Experimented on the Mongolian font data set, it can be observed that Mongolia can be generated directly from the Mongolian title font, and the generated fonts are basically similar to the real font styles.

Select

Information Extraction and Text Mining

Biomedical Causality Relation Extraction Based on Joint Learning

LIU Suwen, SHAO Yifan, QIAN Longhua

2020, 34(4): 60-68.

Abstract ( ) PDF ( )

Knowledge map

Save

Biomedical causality extraction is an evaluation task proposed by the BioCreative community to explore the rich semantic relationships between biomedical entities. Unlike traditional entity relation extraction focusing only on binary relationships, this task includes the identification of function acting on one or more entities. Based on the idea of multi-task learning, a joint learning model sharing decision-making by both binary relation extraction and unary function detection is proposed. On the shared word embeddings, LSTM with gated mechanism are employed to learn the interactive representation between two tasks, and the final predictions are performed respectively. The experimental results show that this method can exploit the information of two tasks, achieving 45.3% F-score on the 2015 BC-V dataset.

Select

Information Extraction and Text Mining

Bi-LSTM-WCRF Incorporating Dictionary Feature for Chinese Person Name Recognition

CHENG Yusi, SHI Yuntao

2020, 34(4): 69-76.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese person name recognition is restricted by the domain and size of the existing annotated corpus and the issue of class imbalance. Person name dictionaries and domain dictionaries are more easily achieved than humanly annotated training corpus. This article incorporates dictionaries into bi-directional long short-term memory (Bi-LSTM) networks with weighted conditional random field layer (WCRF). The model extracts the possibility of family name and given name from personal name dictionaries. The domain dictionaries provide information on human names. Bi-LSTM captured context information and weighted conditional random field improved recall of personal name recognition. Experiments on People's Daily corpus and construction law corpus show that, compared with the existing method based on hidden Markov model, the F₁ value of personal name recognition is improved by 18.34%; compared with traditional Bi-LSTM-CRF model, Recall value increases by 15.53% and F₁ value increases by 8.83%.

Select

Machine Reading Comprehension and Text Generation

A Study of Ellipsis Recovery for Short Text Comprehension

ZHENG Jie, KONG Fang, ZHOU Guodong

2020, 34(4): 77-84.

Abstract ( ) PDF ( )

Knowledge map

Save

As a common linguistic phenomenon, ellipsis is common in texts, especially in short texts such as QA and dialogue. In order to understand the semantic information of short texts, we propose a multi-attention fusion model for Chinese ellipsis recovery. This model combines the context and the text information by gate mechanism, multi-attention and self-attention. Experiments on several short text corpora show that this model can efficiently detect ellipsis position and recover ellipsis content, facilitating better comprehension of short text.

Select

Machine Reading Comprehension and Text Generation

Reading Comprehension Based on External Knowledge and Hierarchical Discourse Representation

TAN Hongye, LI Xuanying, LIU Bei

2020, 34(4): 85-91.

Abstract ( ) PDF ( )

Knowledge map

Save

Reading Comprehension (RC) refers to automatically answering questions on the given text, which has become a popular issue in natural language processing. Many deep learning RC methods have been proposed. However, these methods do not fully understand questions and the discourse, leading to poor performance of the model. In order to solve the problem, this paper proposes a reading comprehension method based on external knowledge and hierarchical discourse representation. The method uses the external knowledge and question types to enhance question comprehension. And the method utilizes the hierarchical discourse representation to improve the understanding of the discourse. Moreover, the two subtasks of the question type prediction and the answer prediction are jointly optimized in an unified framework. Experiments performed on the DuReader dataset show that the proposed method increased the performance by 8.2% at most.

Select

Sentiment Analysis and Social Computing

Emotion Classification Based on CNN and EWC Algorithm for Unbalanced Texts

CHENG Yan, ZHU Hai, XIANG Guoxiong, TANG Tianwei, ZHONG Linhui, WANG Guowei

2020, 34(4): 92-100.

Abstract ( ) PDF ( )

Knowledge map

Save

Text emotion classification is a well-addressed task in the field of natural language processing. To deal with the unbalanced data which hurt the classification performance, this paper proposes an emotion classification method combining CNN and EWC algorithms. First, the method uses the random under-sampling method to obtain multiple sets of balanced data for training. Then it feeds each balanced dataset to CNN training in sequence, introducing EWC algorithm in the training process to overcome the catastrophic forgetting issue in CNN. Finally, the CNN model trained by the last data set is treated as the final classification model. The experimental results show that the proposed method is superior to the ensemble learning framework based on under-sampling and multi-classification algorithms, and outperforms the multi-channel LSTM neural network with 1.9% and 2.1% improvements in accuracy and G-mean, respectively.

Select

NLP Application

Automatic Grading of Chinese Text Reading Difficulty Based on Multiple Linguistic Features and Deep Features

CHENG Yong, XU Dekuan, DONG Jun

2020, 34(4): 101-110.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic grading of text reading difficulty is to automatically judge the difficulty level according to text features. In this paper, we propose a novel difficulty grading method based on multi-linguistic features and deep features. In this method, various linguistic features are taken into account from characters level, vocabulary level and sentences level, in terms of frequency, length, complexity, richness and coherence. On the other hand, this paper uses the BERT-based pre-trained neural network model to extract the deep features of text sentences. On this basis, an end-to-end neural network is constructed to fuse the multi-linguistic features and deep features. Our methods achieve good performance in automatic grading, outperforming the methods based on traditional linguistic features and on popular neural networks.

Please choose a citation manager

Content to export

2020 Volume 34 Issue 4 Published: 01 June 2020