Journal of Chinese Information Processing

Select

Language Analysis and Calculation

Automatic Partition of Gan Dialect in Jiangxi Province Based on Spectrogram

YAN Weizhi, WANG Mingwen, XU Fan, DAN Yangjie, LUO Jian

2021, 35(4): 1-7,15.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese dialect partition is a vital issue in linguistics. In contrast to the traditional manual dialect partition according to the vocabulary and the grammar, this paper studies how to effectively use the features of speech itself to automatically partition the dialect. This paper first constructs 1,223 speech corpora of 1,500 minutes from the 11 municipalities and 91 county-level administrative regions in Jiangxi Province. Then a deep learning feature extraction model based on CNN self-encoding dimension reduction spectrogram has been put forward. The k-means clustering, Gaussian mixture clustering and hierarchical clustering are examined, respectively. The results revealed that, according to the cluster performance metrics DBI and DI index, the proposed language spectrogram features significantly outperform traditional MFCC features. Under the 16-dimension, the clustering effect of the concatenation of the spectrogram feature and the MFCC feature is found to be close to that of the traditional artificial dialect partition.

Select

Knowledge Representation and Acquisition

Knowledge Representation and Sentence Segmentation of Ancient Chinese Based on Deep Language Models

HU Renfen, LI Shen, ZHU Yuchen

2021, 35(4): 8-15.

Abstract ( ) PDF ( )

Knowledge map

Save

Sentence segmentation of ancient Chinese texts is a very difficult task even for experts in this area, since it not only relies on the sentence meaning and the contextual information, but also requires historical and cultural knowledge. This paper proposes to build knowledge representation of ancient Chinese with BERT, a deep language model, and then construct the sentence segmentation model with Conditional Random Field and Convolutional Neural Networks. Our model achieves significant improvements in all of the three ancient text styles. It achieves 99%, 95% and 92% F₁ scores for poems, lyrics and prose texts, respectively, out-performing Bi-GRU by 10% in lyrics and proses which are more difficult to segment. In further case studies, the method achieves good results in the difficult cases in published ancient books.

Select

Knowledge Representation and Acquisition

An Improved Learning Method for GloVe Word Vector Representation

SHI Junfeng, LI Jihong, WANG Ruibo

2021, 35(4): 16-22.

Abstract ( ) PDF ( )

Knowledge map

Save

GloVe model is a popular word vector model, which is revealed a better performance with the increase of word vector dimensions. To avoid intolerable time cost in training high-dimension word vector, we propose an improved method of GloVe which is easily implemented in a parallel manner. We first construct a co-occurrence matrix with symmetrical windows and a co-occurrence matrix with asymmetrical windows on a corpus respectively, and apply the original GloVe model for two low-dimension word vectors. Then we concatenate the word vectors as the final high-dimension vectors. Tested in large-scale corpus, the 600-dimension word vector trained by the proposed method achieves better performance in the Chinese and English word analogy task and word clustering task, compared with the 600-dimension word vector trained directly by GloVe model.

Select

Knowledge Representation and Acquisition

Research on Consistency Check of Sememe Annotations in HowNet

LIU Yangguang, QI Fanchao, LIU Zhiyuan, SUN Maosong

2021, 35(4): 23-34.

Abstract ( ) PDF ( )

Knowledge map

Save

Sememes are defined as the minimum semantic units of human languages that cannot be subdivided. The meaning of a word can be defined by a combination of multiple sememes. Sememe-based linguistic knowledge bases(KBs), in which words are manually annotated with sememes, have been successfully constructed and utilized in many NLP tasks. However, the manual annotation of sememes is time-consuming and labor-intensive, and person bias will be inevitably introduced, which prejudices annotation consistency and accuracy. In this paper, we for the first time propose a method to conduct automatic consistency check of sememe annotations in HowNet. Experimental results demonstrate the effectiveness of out method, which show that our method can be applied to the annotation consistency check and extension of HowNet.

Select

Knowledge Representation and Acquisition

Attribute Alignment Based on Multi-Similarity Measure and Set Encoding

WU Jiahao, CHEN Bo , HAN Xianpei, SUN Le

2021, 35(4): 35-43.

Abstract ( ) PDF ( )

Knowledge map

Save

The goal of attribute alignment is to find the corresponding relationship which representing the same concept in heterogeneous knowledge graph. It is one of the key technologies to knowledge fusion. The existing models based on rules and word embedding are defected in incomplete similarity measurement and insufficient using of attribute instance information. To address this issue, this paper proposes an attribute alignment model based on multi similarity measures. We design similarity measures from multiple perspectives, and use machine learning model to aggregate this kind of features. At the same time, this paper proposes the attribute instance set representation learning algorithm. We extract the topic similarity between sets by encoding the attribute instance set as vectors, so as to assist attribute alignment. Experiments prove the validity of the model, and show that the set representation learning algorithm can effectively capture the subject feature of attribute instances and significantly improve the attribute alignment results.

Select

Information Extraction and Text Mining

A Method for Surgery Term Normalization Based on Text Similarity and BERT Model

YANG Feihong, SUN Haixia, LI Jiao

2021, 35(4): 44-50.

Abstract ( ) PDF ( )

Knowledge map

Save

To explore the method for surgery term normalization, this paper proposes a method of combining text similarity and BERT model. The model scheme is the text similarity ranking + BERT sentence pair matching model. This paper also analyzes the characteristics of the normalized surgery terms, and provides the related methods of clinical term normalization. In the CHIP2019 surgical term normalization task, the accuracy of this method on the verification set is 88.35%, and the accuracy on the test set is 88.51%, and the system based on this method ranked 5th among all participating teams.

Select

Information Extraction and Text Mining

Named Entity Recognition Based on Partially Labelled Data and Empirical Distribution

SONG Yexuan, CHEN Zhao, WU Gang

2021, 35(4): 51-57.

Abstract ( ) PDF ( )

Knowledge map

Save

In recent years, data-driven named entity recognition(NER) methods have achieved great success in many fields such as news, biomedical and so on. In order to reduce the cost of labeling for a new domain, a NER method based on partially labelled data and empirical distribution is proposed. We describe the modeling method based on partially labelled data, and then introduce the hypothesis of label empirical distribution. By adding the empirical distribution to the model, the noise in the data is effectively reduced. Tested on the datasets of plant diseases &insect pests and Youku video, and the results show that the proposed method is better than other methods.

Select

Information Extraction and Text Mining

Chemical Protein Relation Extraction Based on Shortest Dependency Path and Ensemble Learning

CHENG Wei, SHAO Yifan, QIAN Longhua, ZHOU Guodong

2021, 35(4): 58-65.

Abstract ( ) PDF ( )

Knowledge map

Save

The extraction of interaction between chemical and protein plays an important role in the research of precision medicine and drug discovery. This paper proposes a Bi-LSTM model based on the shortest dependency path and attention mechanism, and applies it to chemical protein relation extraction. In terms of features, part-of-speech, position and dependency type on the shortest dependent path are considered. Experiments on the BioCreative VI CHEMPROT task show that the proposed method achieves better F₁-value performance than systems based on dependency information. At the same time, the ensemble method further improves the performance of chemical protein relation extraction.

Select

Information Extraction and Text Mining

The Construction of the Eventic Graph for the Political Field

BAI Lu, ZHOU Ziya, LI Binyang, LIU Yuhan, SHAO Zhixuan, WU Huarui

2021, 35(4): 66-74,82.

Abstract ( ) PDF ( )

Knowledge map

Save

The Eventic Graph(EG) is a directed graph that describes the logic relationship between events, such as continuation and causality, etc. In contrast to the current researches focusing on the event extraction on the open domain. this paper aims at the construction of the eventic graph for the political field. We establish an annotation scheme for political events and construct an event corpus for the political field. Moreover, we present a character embedding based neural network by integrating the attention mechanism and a BERT+BiLSTM framework for political event extraction as the pipeline and the joint model, respectively. Experiments on our constructed corpus show that the porposed method could achieve significant improvement on event classification and argument classification in terms of F₁-score compared with previous neural network based methods.

Select

Information Extraction and Text Mining

Clinical Term Normalization Based on BERT

SUN Yuejun, LIU Zhiqiang, YANG Zhihao, LIN Hongfei

2021, 35(4): 75-82.

Abstract ( ) PDF ( )

Knowledge map

Save

The diversity of clinical terms in electronic medical records hinder the analysis and utilization of medical data. To address this issue, this paper proposes a method of clinical term normalization based on BERT. The method uses Jaccard similarity to select the candidate words from the standard term set, and matches the original words and candidate words based on BERT model to obtain standardized results. Evaluated on the dataset of CHIP2019 clinical term normalization evaluation task, the method obtains 90.04% accuracy.

Select

Information Extraction and Text Mining

Bridge Inspection Named Entity Recognition Based on Transformer-BiLSTM-CRF

LI Ren, LI Tong, YANG Jianxi, MO Tianjin, JIANG Shixin, LI Dong

2021, 35(4): 83-91.

Abstract ( ) PDF ( )

Knowledge map

Save

The information extraction for bridge inspection reports is a less addressed issue, which contain a large amount of key business information such as structural component parameters and inspection description. Clarifying the task of named entity recognition in this field, this paper also reveals the characteristics of the entities to be identified, such as location name or route name nesting, character ambiguity, context location correlation and direction sensitivity. A bridge inspection named entity recognition approach is then proposed based on Transformer-BiLSTM-CRF. First, the Transformer encoder is used to model the long-distance position-dependent features of text sequences, and the BiLSTM network is adopted to further capture the direction-sensitive features. Finally, the labeled sequence prediction is implemented via the CRF model. The experimental results show that, compared with the mainstream named entity recognition models, the proposed model achieves better performance.

Select

Question-answering and Dialogue

Deep Active Learning Method for Question Intention Recognition

FU Yuwen, MA Zhirou, LIU jie, BAI Lin, BO Manhui, YE Dan

2021, 35(4): 92-99,109.

Abstract ( ) PDF ( )

Knowledge map

Save

Deep learning has achieved best performance in many natural language processing tasks on the basis of large amount of annotation data. To reduce the cost of corpus annotation, this paper combines the active learning and deep learning to identify the corpus of question intent. To minimize the iteration of retraining in active learning, a lightweight architecture suitable for question intent recognition task is proposed by using a deep learning model consisting of a two-layer CNN structure. At the same time, in order to better evaluate the value of the sample, a multi-criteria active learning method is designed by considering the information, representativeness and diversity of samples. Finally, experiments on the civil aviation customer service corpus show that the method can reduce the annotation workload by about 50%, which is also validated by the public dataset TREC question classification corpus.

Select

Question-answering and Dialogue

A Neural Network for Long Answer Selection with Coarse and Fine-grained Information

SUN Yuan, WANG Jian, ZHANG Yijia, QIAN Lingfei, LIN Hongfei

2021, 35(4): 100-109.

Abstract ( ) PDF ( )

Knowledge map

Save

The long answer selection plays an important role in non-factoid question answering systems such as community question answering and open-domain question answering systems. To improve the performance of long answer selection, we propose a novel model which combines coarse (sentence-level) and fine-grained (word-level) information. Our model also alleviates the following two issues: ① not all the important information in a long sequence can be modeled by a single vector, and ② the failure to capture global information under the compare-aggregate framework. Besides, our model uses fine-grained information without extra training parameters. The experiments on InsuranceQA dataset show that the proposed model outperforms the state-of-the-art sequence models by 3.30% in accuracy.

Select

Question-answering and Dialogue

Joint Question Type and Penalty Mechanism for Question Generation

WU Kaili, ZHU Mengmeng, ZHU Hongyu, ZHANG Yitian, HONG Yu

2021, 35(4): 110-119.

Abstract ( ) PDF ( )

Knowledge map

Save

Question Generation aims to understand the semantics of the input and generate questions automatically. This paper focuses on answer-aware question generation, i.e. the input sentence is the target answer of the generated question. We propose a question generation model which integrates the question type and a penalty mechanism. We first fine-tune a pre-trained model BERT to get a question type classifier. Then we use the gate mechanism in the encoder to fuse the source representation and question type information, and obtain the question-type-aware representation. In addition, to alleviate the repeated words in the generated question, we propose a penalty mechanism of generating word from the target answer into the loss. Experimental results show the proposed method can effectively improve the accuracy of question type, and reduce the generated words from target answer to a certain extent. In the SQuAD dataset, the BLEU-4 achieves 18.52%, and the accuracy of question type reaches 93.46%.

Select

Question-answering and Dialogue

Machine Reading Comprehension Based on Sequence and Graph Structure

CHEN Zheng, REN Jiankun, YUAN Haorui

2021, 35(4): 120-128.

Abstract ( ) PDF ( )

Knowledge map

Save

Machine Reading Comprehension(MRC) is an essential and challenging task in Natural Language Processing(NLP). As the state-of-the-art solution, the BERT-based reading comprehension model, however, is defected in long-distance and global semantic construction owing to the structure and scale of the sequence model,. This paper proposes a new Machine Reading Comprehension method combining sequence and graph structure. First, named entities are extracted from the text, and sentence co-occurrence and sliding window co-occurrence are used to construct named entity co-occurrence diagram. Then a spatial-based Graph Convolutional Neural Network is designed to learn the embedded representation of the named entities. And the entity embedded representation obtained by graph structure is fused with the text embedded representation obtained by the sequence structure. In the end, the answer for the Machine Reading Comprehension question is decided by segment extraction. The experimental results show that, compared with the sequential structure-based reading comprehension model based on BERT, our model has achieved 7.8％ improvements in EM, and 6.6％ in F₁.

Select

Multimodal Natural Language Processing

Chinese Image Captioning Based on Middle-Level Visual-Semantic Composite Attributes

XIAO Yuhan, JIANG Aiwen, WANG Mingwen, JIE Anquan

2021, 35(4): 129-138.

Abstract ( ) PDF ( )

Knowledge map

Save

Image captioning is a multi-modal information processing task in the cross domain of computer vision, natural language processing and machine learning. In contrast to the existing studies on English image captioning, this paper proposes an image Chinese image captioning algorithm by extracting multi-level visual semantic attributes for content representation. Experiments are performed on the AI Challenger 2017, the largest Chinese image captioning dataset at present, and the Flick8k-CN Chinese image captioning dataset. Compared with mainstream image description algorithms, the algorithm has a significant improvement of about 3%-30%.

Please choose a citation manager

Content to export

2021 Volume 35 Issue 4 Published: 07 May 2021