Journal of Chinese Information Processing

Select

Language Analysis and Calculation

Automatic Recognition of Sentence Boundary Based on Clause Complex

HE Xiaowen, LUO Zhiyong, HU Zijuan, WANG Ruiqi

2021, 35(5): 1-8.

Abstract ( ) PDF ( )

Knowledge map

Save

The grammatical structure of natural language text consists of words, phrases, sentences, clause complexes and texts. This paper re-examines the definition of sentences in linguistics and the segmentation of sentences in natural language processing, and puts forward the task of Chinese sentence segmentation. Based on the theory of clause complex, the sentence is defined as the smallest topic self-sufficient punctuation sequence, and a sentence boundary recognition model based on BERT is designed and implemented. The experimental results show that the accuracy and F₁ value of the model are 88.37% and 83.73%, respectively, much better than that of mechanical segmentation according to punctuation marks.

Select

Language Analysis and Calculation

Measurement and Statistical Analysis of Emotional Vocabulary Based on CSL Learners' Cognition

ZHANG Yiyang, WANG Zhimin, WU Di, ZHANG Xuan

2021, 35(5): 9-16,26.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper compares and analyses the emotional classification, types of parts of speech, polarity and intensity of the words used in the four novels through the extraction of emotional vocabulary. Meanwhile, we put forward the measurement of Chinese as a second language (CSL) learners' familiarity with the affective vocabulary of "receptive vocabulary", and the output performance of the affective vocabulary of "productive vocabulary", which are both examined by Kohler-Rapp hypothesis test. Finally, we find that the use of emotional vocabulary in modern Chinese novels does not vary significantly in accordance with the author, subject matter or content. In the 21 types of emotional words, praise and derogation account for half of the total vocabulary, respectively. CSL learners are not familiar with high-frequency emotional vocabulary, possessing less words with strong feeling. As the result, they produce much less verbs and adjectives of emotional words, and much less words for sadness.

Select

Language Analysis and Calculation

Measurement and Application of Chinese Component Semantic Ability Based on Distributed Representation

LIANG Shichen, TANG Xuemei, HU Renfen, WU Jinshan, LIU Zhiying

2021, 35(5): 17-26.

Abstract ( ) PDF ( )

Knowledge map

Save

The semantic representation of Chinese characters is one of the characteristics that distinguishes them from phonetic characters. As a unit of character construction, components are closely related to the meaning of Chinese characters. However, how to measure the meaning of Chinese character components is an issue remains to be discussed. In this paper, we focus on components in Chinese character and train a multi-granularity Chinese word embedding, which are proved positive in the internal evaluation task of word embedding and the motivation mea-surement of Chinese character. Based on this model, we further put forward a formula to calculate the semantic ability of components, revealing that components in Chinese characters have certain but limited semantic ability. Meanwhile, we further establish the grading system of components by taking the semantic ability of components into account. Finally, for the teaching of Chinese as a foreign language, We establish the scope of component teaching, and put forward a scheme of teaching sequence of Chinese characters.

Select

Language Analysis and Calculation

How Brain Acquire New Language Rules

GENG Libo, YANG Li, FANG Jiaoyan, YANG Yiming

2021, 35(5): 27-37,62.

Abstract ( ) PDF ( )

Knowledge map

Save

Whether and how human brains can master new grammar rules has been hotly debated in linguistic research. There is a lack of consensus regarding what the most important factors are in grammar rule learning (e.g., age of acquisition and amount of input) and their influences. This question yielded the current study, which utilized Artificial Grammar Learning (AGL) paradigm and Event Related Potentials (ERPs) to examine longitudinal changes in the neural mechanism underlying processing artificial grammar among adult Mandarin native speakers. We manipulated the amount of the input, and created three artificial grammars, each featuring a different level of similarity to the Chinese Mandarin grammar. The results showed that (a) within the framework of small data learning, adults can use unsupervised learning to master new grammar rules; (b) different grammar rules can be acquired with a relatively small amount of input and processed to a native-like level; and (c) grammar rules are acquired through competitive interactions between brain mechanisms. These findings contribute to learning theories using AGL paradigm and inform future research on Natural Language Processing.

Select

Language Analysis and Calculation

CPLM-CSC: Character-based Pre-trained Language Model for Chinese Spelling Checking and Correction

XIE Haihua, LI Aolin, LI Yabo, CHEN Zhiyou, CHENG Jing, LV Xiaoqing, TANG Zhi

2021, 35(5): 38-45.

Abstract ( ) PDF ( )

Knowledge map

Save

Due to the variability and complexity of Chinese semantic expression, Chinese spelling checking and correction is a challenging task. This paper proposes an approach based on pre-trained language models for Chinese spelling checking and correction, named as CPLM-CSC, which significantly improves the correction performance. In CPLM-CSC, the character-based pre-trained language model is employed for spelling checking, and a masked language model is applied for spelling correction. To enhance the correction performance, CPLM-CSC employs several ways of final result filtering, and applies data enhancement means for certain special errors such as misuse of “的”, “地” and “得”. Tested on the dataset of SIGHAN 2015, the proposed method achieves the state-of-the-art performance of 0.654 F₁ score.

Select

Knowledge Representation and Acquisition

Neighborhood Aggregation for Knowledge Graph Representation

PENG Min, HUANG Ting, TIAN Gang, ZHANG Ding, LUO Juan, YIN Yuan

2021, 35(5): 46-54.

Abstract ( ) PDF ( )

Knowledge map

Save

Knowledge representation learning, which aims to encode entities and relations into a dense, real-valued and low-dimensional semantic space, has drawn massive attention in natural language processing tasks, such as relation extraction and question answering. To better capture the neighbor information, we propose a model named TransE-NA (Neighborhood Aggregation on TransE) based on TransE, which determines the number of neighbors according to sparse degrees of entities and then aggregates the most relevant attributes of neighbors according to the corresponding relations. Experimental results on link prediction and triplet classification show that our approach outperforms baselines, alleviating the data sparsity issue and improving the performance effectively.

Select

Knowledge Representation and Acquisition

Entity Attribute Completion Based on Bayesian Network

SHE Qixing, JIANG Tianwen, LIU Ming, QIN Bing

2021, 35(5): 55-62.

Abstract ( ) PDF ( )

Knowledge map

Save

Attribute is an important part of entity, and the acquisition of entity attribute is a key step of knowledge graph construction. To complete the attributes related to entities in open domain Chinese knowledge graph "BigCilin",this paper proposes to employ the dependency relationships between 1) hypernym conception and attribute, and 2) entity and hypernym conception to add attributes to entities based on the Bayesian network. Compared with similarity measures, this method are proved for its validity in terms of significantly improving the attribute coverage of "BigCilin".

Select

Information Extraction and Text Mining

Biomedical Mutation Entity Recognition Method Based on Character Convolution Neural Network

SONG Yawen, YANG Zhihao, LUO Ling, WANG Lei, ZHANG Yin,LIN Hongfei, WANG Jian

2021, 35(5): 63-69.

Abstract ( ) PDF ( )

Knowledge map

Save

Mining mutation entities from massive biomedical literature is of great significance to the research of complex biomedical diseases. To improve the current solution based conditional random field, this paper proposes a method based on character-level convolutional neural network, i.e. CharCNN-CNN-CRF for short. In this method, we utilize a multi-window convolutional neural network to obtain the character-level word representation. Then we encode the context information with a multi-layer convolutional neural network and obtain the label sequence through the conditional random field layer. The experimental results show that the proposed method achieves state-of-the-art results on both the tmVar and MutationFinder datasets with 88.34% and 93.57% in F-measure, respectively.

Select

Information Extraction and Text Mining

Chinese Biomedical Entity Relation Extraction System Based on Deep Learning

DING Zeyuan, YANG Zhihao, LUO Ling, WANG Lei, ZHANG Yin, LIN Hongfei, WANG Jian

2021, 35(5): 70-76.

Abstract ( ) PDF ( )

Knowledge map

Save

In the field of biomedical text mining, biomedical named entity recognition and relations extraction are of great significance. This paper builds a Chinese biomedical entity relation extraction system based on deep learning technology. Firstly, Chinese biomedical entity relation corpus is construction from the publicly available English biomedical annotated corpora via translation and manual annotation. Then this paper applies the ELMo (Embedding from Language Model) trained in Chinese biomedical text to the Bi-directional LSTM (BiLSTM) combined conditional random fields (CRF) model for Chinese entity recognition. Finally, the relation between entities is extracted using BiLSTM combined with the Attention mechanism. The experimental results show that the system can accurately extract biomedical entities and inter-entity relation from Chinese text.

Select

Information Extraction and Text Mining

Clinical Entity Normalization Using Deep Generative Model

YAN Jinghui, XIANG Lu, ZHOU Yu, SUN Jian, CHEN Si, XUE Chen

2021, 35(5): 77-85.

Abstract ( ) PDF ( )

Knowledge map

Save

Clinical entity normalization is an indispensable part of medical statistics. In practice, a standard clinical term entity has several kinds of colloquialisms and non-standardized mentions, and for some applications such as the a clinical knowledge base construction, how to normalize these mentions is an issue that has to address. This paper is focused on the Chinese clinical entity normalization, i.e., linking non-standard Chinese clinical entity to the standard words which are in the given clinical terminology base. Specifically, we treat the clinical entity normalization task as a translation task, and employ a deep learning model to generate the core semantics of the clinical mentions and obtain the candidate set of the standard entity. The final standard words were obtained by re-ranking the candidate set by using a BERT-based semantic similarity model. Experiments on the data of the 5th China Conference on Health Information Processing (CHIP2019) achieve good results.

Select

Information Extraction and Text Mining

Term Normalization System Based on BERT Entailment Reasoning

CHONG Weifeng, LI Hui, LI Xue, REN He, YU Dong, WANG Yehan

2021, 35(5): 86-90.

Abstract ( ) PDF ( )

Knowledge map

Save

The normalization of clinical terms is to assign a corresponding term in the standard term set to any term written by the doctor. This task is challenged by large amount of standard terms with high mutual similarity, as well as insufficient training data known as Zero-shot or Few-shot learning. This paper designs and implements a clinical term normalization system based on BERT entailment ranking. The system consists of four modules: data preprocessing, BERT entailment scoring, BERT quantity prediction, and logistic regression-based reordering.Tested in CHIP 2019 Track 1 "Evaluation of Chinese Clinical Term Normalization", it achieves a final accuracy of 0.948 25 as the top score in this campaign.

Select

Information Extraction and Text Mining

Neural Tensor Factorization Recommendation Model Based on Attention LSTM

LI Jingjing, XIA Hongbin, LIU Yuan

2021, 35(5): 91-100.

Abstract ( ) PDF ( )

Knowledge map

Save

Current collaborative filtering algorithms combined with deep learning models fail to consider the problem of dynamical change over time of multi-dimension interaction of linked data. This paper proposes a tensor factorization recommendation model that combines time interaction learning and long short-term memory networks with attention (LA-NTF). Firstly, the long short-term memory network with attention mechanism is applied to extract the latent vector of the item from the item text information. Secondly, the multi-dimension interaction of user-item relational data in time is characterized by the long short-term memory networks with attention mechanism. Finally, the user-item-time 3D tensor is embedded in the multi-layer perceptron to learn the non-linear structural features between different latent factors, to predict the user's rating of the item. Experiments on two real-world datasets show that RMSE and MAE indicators significantly outperform neural network based factorization models and other traditional methods, indicating that the significant improvement in rating prediction task on various dynamic relational data by our LA-NTF model.

Select

Question Answering and Dialogue System

Multi-party Dialogue Character Identification Method Based on Multi-scale Self-attention Enhancement

ZHANG Yuyao, JIANG Yuru, ZHANG Yangsen

2021, 35(5): 101-109.

Abstract ( ) PDF ( )

Knowledge map

Save

The character identification task aims at mapping the person mentions in the dialogue to specific person entities in the dialogue scenarios involving multiple parties. This paper proposes a method based on multi-scale self-attention enhancement, which uses self-attention at different scales to obtain better information representation. First, the global dialog information in the scene is captured through global attention with a large scope. Then, through the small-scale local attention, the dialog in the local area is calculated to capture the association relationship between the information at close range. Finally, the information obtained at different scales is fused to enhance the encoded information. The experimental results on SemEval2018 Task4 show the effectiveness of the method by 18.94% in F₁ compared with the current optimal system.

Select

Question Answering and Dialogue System

Knowledge-aware Multi-type Conversational Recommendation

ZHANG Jun, YANG Yan, HUO Pei, SUN Yuxiang, LI Chengfeng, LI Yong

2021, 35(5): 110-117.

Abstract ( ) PDF ( )

Knowledge map

Save

An intelligent recommended dialogue system communicates with users in rich interactive ways, which usually covers multiple types of dialogue, such as question answering, chit-chat, recommendation dialogs, etc. In contrast to the current pipeline model, we propose a knowledge-aware dialogue generation model based on Transformer to accomplish conversational recommendation over multi-type dialog task. We use a Transformer decoder to implicitly learn dialogue goal path and generate a reply. Additionally, we introduce a knowledge encoder and a copy mechanism to enhance the model's ability to perceive knowledge. Experimental results on the DuRecDial dataset show that the proposed model achieves a significant improvement in terms of F₁, BLEU and Distinct over the baseline models by 59.08%, 110.00%, 66.14%, respectively. Our model ranked at the third place in 2020 Language and Intelligent Technology Competition: Conversational Recommendation Task.

Select

Sentiment Analysis and Social Computing

Text Sentiment Analysis Capsule Model Combining Convolutional Neural Network and Bidirectional GRU

CHENG Yan, SUN Huan, CHEN Haomai, LI Meng, CAI Yingying, CAI Zhuang

2021, 35(5): 118-129.

Abstract ( ) PDF ( )

Knowledge map

Save

Text sentiment analysis is an important branch in the field of natural language processing. This paper proposes a text sentiment analysis capsule model that combines convolutional neural networks and bidirectional GRU networks. Firstly, the multi-head attention is used to learn the dependency between words and capture the emotional words in the text. Then, the convolutional neural network and bidirectional GRU network are used to extract emotional features of different granularities in the text. After the feature fusion, the global average pooling is used to get the instance feature representation of the text, and the attention mechanism is combined to generate feature vectors for each emotion category to construct an emotion capsule. Finally, the emotion category of the text is judged by the capsule attributes. Tested on the MR, IMDB, SST-5 and Tan Songbo hotel review datasets, the proposed model achieves better classification effect than other baseline models.

Select

Sentiment Analysis and Social Computing

A Synchronic and Diachronic Study of Unconscious Gender Bias in Occupations Based on Corpus

ZHU Shucheng, SU Qi, LIU Pengyuan

2021, 35(5): 130-140.

Abstract ( ) PDF ( )

Knowledge map

Save

Gender bias is a hot topic in sociology. In recent years, machine learning algorithms have learnt bias from data, which have arouse much more attention on this topic. Based on the markedness theory, this paper examines the unconscious gender bias of 63 occupations in BCC and DCC corpora from both synchronic and diachronic perspectives. Firstly, the gender preference of 63 occupations among different age and gender groups is investigated via questionnaires. There is a significant positive correlation between the questionnaire and the occupation gender bias word frequency indicators in the BCC corpus. Then, from the perspective of synchronic study, most of the occupations are found with a growing gender bias against women from the corpus of different fields in the BCC corpus, and the newspaper texts of the 31 provincial administrative units in the DCC corpus in 2018, There also are differences in occupational gender bias in different regions. Finally, from a diachronic perspective, it is found that the occupational gender unconscious bias phenomenon shows an overall weakening trend form the DCC corpus from 2005 to 2018 newspaper texts for statistical analysis.

Please choose a citation manager

Content to export

2021 Volume 35 Issue 5 Published: 20 May 2021