Journal of Chinese Information Processing

Select

Language Analysis and Calculation

Word Segmentation for Ancient Chinese Texts Based on Nonparametric Bayesian Models and Deep Learning

YU Jingsong, WEI Yi, ZHANG Yongwei, YANG Hao

2020, 34(6): 1-8.

Abstract ( ) PDF ( )

Knowledge map

Save

All the Chinese characters in ancient Chinese texts are written continuously, without obvious segmentation marks between words. This brings great challenges to text understanding and even cultural inheritance. To deal with word segmentation for ancient Chinese texts, we propose the Multi-Stage Iterative Training (MSIT) for unsupervised word segmentation by combining non-parametric Bayesian models with BERT(Bidirectional Encoder Representations from Transformers). It achieves the F₁ score of 93.28% on Zuozhuan (an ancient Chinese history book) dataset. After adding only 500 ground truth sentences, which can be considered as weakly supervised learning, the F₁ score reaches 95.55%. It outperforms the previous best result, which trains on 6/7 of the Zuozhuan dataset (about 36,000 ground truth sentences). When using the same training set, our method gets the F₁ score of 97.40%, the state-of-the-art result. Our proposed method is not only better than traditional sequence labeling algorithms including BERT model, but also proved that it has better generalization ability by experiments. The model and related codes are available online.

Select

Language Analysis and Calculation

Inner-Attention Based Multi-Way Convolutional Neural Network for Relation Recognition in Chinese Compound Sentence

SUN Kaili, DENG Dunhua, LI Yuan, LI Miao, LI Yang

2020, 34(6): 9-17,26.

Abstract ( ) PDF ( )

Knowledge map

Save

Compound sentence relation recognition is to identify for the semantic relation of clauses, which is the key task in semantic analysis of compound sentences. This task is difficult due to the an implicit relation in non-saturate compound sentences. To deal with the implicit semantic information, a multi-channel CNN based on the inner-attention mechanism is proposed in this paper. The inner-attention mechanism is based on Bi-LSTM, which enables it to learn bidirectional semantic features and associated features between clauses. At the same time, CNN is used to model the sentence representation to obtain local features. Compared with other results, experiment results on the CCCS and TCT show that the macro-F₁ score and the average recall score of this paper reach 85.61% and 84.87%, achieving 6.08% and 3.05% relative improvement, respectively.

Select

Language Analysis and Calculation

Sequential Graph Neural Networks for Multi-Label Sequence Labeling

WANG Shaojing, LIU Pengfei, QIU Xipeng

2020, 34(6): 18-26.

Abstract ( ) PDF ( )

Knowledge map

Save

Aims at the problem of labeling multiple sequence labels in the same sentence, we propose a new sequence graph model. The sequence graph model is to capture two main kinds of dependencies: one is the relationship between the time series dimensions of different words, and the other is to unify the dependence of words on different tasks. We adopt LSTM or Transformer-like structure to model information interactions in a time series dimension. And we use attention mechanism at each step to model the interaction between different tasks and obtain a better representation of each word. The experimental results show that our model can not only achieve better performance at Ontonotes 5.0, but also can recover interpretable structures between different task labels.

Select

Language Resources Constraction

English-Chinese Clause Alignment Corpus Tagging System Based on Component Sharing

GE Shili, SONG Rou

2020, 34(6): 27-35.

Abstract ( ) PDF ( )

Knowledge map

Save

English-Chinese clause alignment corpus serves the study and application of grammatical structure correspondence between English and Chinese clauses. It is of great significance to linguistic theory and language translation (including human translation and machine translation). Previous work on grammar theory and corpus lacks sufficient research on definitions of clause and clause complex. It is theoretically defective and difficult to support the application of natural language processing. Firstly, this paper makes theoretical preparations for the construction of English-Chinese clause alignment corpus. Starting from the theory of Chinese clause complex put forward in recent years, this paper defines the concept of component sharing, and further defines English clause and clause complex based on naming sharing and quotation sharing, which endows clause and clause complex with integrity and unity. Based on the study, an English-Chinese clause alignment annotation system is designed, including English NT clause tagging and Chinese translation generation and combination. The corpus annotation shows that, at the clause complex level, the components involved by the structural transformation in English-Chinese translation can be limited to English clauses, and related naming and telling, without involving the internal structure of namings and tellings. Based on these works, the English-Chinese clause aligned corpus provides research samples for linguistic research, English-Chinese language comparison, and English-Chinese machine translation.

Select

Language Resources Constraction

A Platform for Entity and Entity Relationship Labeling in Medical Texts

ZHANG Kunli, ZHAO Xu, GUAN Tongfeng, SHANG Baiyu, LI Yumeng, ZAN Hongying

2020, 34(6): 36-44.

Abstract ( ) PDF ( )

Knowledge map

Save

The medical text is an important data foundation for the implementation of intelligent healthcare. As a kind of semi-structured or unstructured data, the medical text needs to be labeled for entity and entity relationships, paving the way for text structuring, named entity recognition, and automatic relationship extraction. Aimed at constructing the Chinese medical knowledge graph, a semi-automated entity and relationship labeling platform is designed to integrate multiple algorithms for pre-labeling, schedule control, quality control and data analysis. Based on this platform, the medical knowledge graph entity and relationship labeling are carried out. The results show that the labeling platform can control the labeling process in the construction of text resources, ensure the labeling quality, and improve the labeling efficiency.

Select

Information Extraction and Text Mining

Extracting Entities and Entity Relationships from Genealogy Text

REN Ming, XU Guang, WANG Wenxiang

2020, 34(6): 45-54.

Abstract ( ) PDF ( )

Knowledge map

Save

In order to organize genealogy resources efficiently, it is necessary to extract entities and their relationships from unstructured genealogy text and build a structured representation. The extraction of entities and the relationships is often transformed to sequence tagging task. Given the high density of entities, relationships and the overlapping relationships, this paper proposes a conceptual model to guide the extraction. Then the commonly-used deep learning models for sequence tagging are tested and compared on a real dataset. Experimental results show that BERT-BiLSTM-CRF outperforms the others in terms of precision, recall and F₁ score, and the proposed method is effective in extracting entities and relationships from genealogy text.

Select

Information Extraction and Text Mining

Joint Model for Entity Alias Extraction in Tourism Domain

YANG Yifan, CHEN Wenliang

2020, 34(6): 55-63.

Abstract ( ) PDF ( )

Knowledge map

Save

At present, the Internet contains a large amount of entity introduction texts, which provides a resource basis for the construction of entity knowledge. As an attribute of entity, an alias is a different expression of the official name of an entity with great significance in knowledge graphs. In this paper, the introduction text of the attraction is used as a corpus, and the alias annotation strategy is proposed with the combination of different alias description methods. Alias extraction can be divided into two subtasks: entity recognition and relation classification. This paper proposes a joint model of scenic entity alias extraction based on deep learning, and completes two subtasks simultaneously. The experimental results on the data set constructed in this paper show that the performance of the joint model is significantly improved compared with the pipelined model.

Select

Information Extraction and Text Mining

Distant Supervision for Person Attribute Recognition

MA Jin, YANG Yifan, CHEN Wenliang

2020, 34(6): 64-72.

Abstract ( ) PDF ( )

Knowledge map

Save

Attribute recognition is aimed at obtaining attribute values of entities from unstructured text. In order to extract person attributes from text, a large amount of annotated data is usually needed, which is not availabel so far. To address this issue, we use Infobox of encyclopedia web pages to construct the tuples of person attributes, and then apply distant supervision method to obtain large-scale and multi-category annotated datasets for person attributes, thus avoiding the tedious process of manual annotation. Additionally, we present two kinds of models based on CRF and BiLSTM-CRF for person attribute recognition as the baseline systems. The experimental results show that BiLSTM-CRF performs better than CRF on this newly built dataset.

Select

Information Extraction and Text Mining

Mean Prototypical Networks for Text Classification

XIAN Yantuan, XIANG Yan, YU Zhengtao, WEN Yonghua, WANG Hongbin, ZHANG Yafei

2020, 34(6): 73-80,88.

Abstract ( ) PDF ( )

Knowledge map

Save

Text classification is a fundamental issue of natural language processing. Based on the prototypical networks, this paper proposes a mean prototype network by an integrating different time steps prototype vectors through moving average, and then combining the mean prototype network with a simple RNN to propose a novel text classification model. The model uses a single-layer RNN to learn the vector representation of text, and learns categories vector representation by the mean prototype networks. The model applies the distance between the text vector and the prototype vector to train the model and predict the text category. Compared with the existing neural text classification method, the model is featured by the shallow depth and fewer parameters, and the introduction of similarity between samples in training and prediction process. The proposed method achieves state-of-the-art results on five benchmark datasets for text classification.

Select

Machine Reading Comprehension

An Approach to Multi-Type Question Machine Reading Comprehension

TAN Hongye, QU Baoxing

2020, 34(6): 81-88.

Abstract ( ) PDF ( )

Knowledge map

Save

Machine reading comprehension (MRC) enables the machine read a given passage and then answer some relevant questions. A number of data sets and models have been proposed for a specific type of problems, without dealing with the diversity of problems in real-world. In this paper, we propose a multi-task reading comprehension model based on Bert. It uses the attention mechanism to obtain multi representations of questions and passages and then classify the questions. Then the model utilizes the classification results to answer the various questions. Experiments on Chinese public machine reading comprehension dataset CAIL2019-CJRC show that our system achieves better results than all the baseline models.

Select

Machine Reading Comprehension

Question Expansion for Machine Reading Comprehension of Opinion

ZHANG Zhaobin, WANG Suge, CHEN Xin, ZHAO Linling, WANG Dian

2020, 34(6): 89-96,105.

Abstract ( ) PDF ( )

Knowledge map

Save

Among the Chinese reading comprehension of the college entrance examination, the opinion questions are rich in abastract viewpoint expressions. In order to obtain the answer information related to the questions from the reading materials, the abstract words in the questions need to be expanded, resulting an expansion of the opinion questions. This paper proposes a question expansion modeling method with the multi-task hierarchical Long Short-Term Memory (Multi-HLSTM). First, the reading materials and the questions are connected with attention mechanism. At the same time, the two tasks of the questions prediction and the answers prediction are modeled to further expand the questions. Finally, the extended questions and the original questions are applied to extract the candidate sentences of the questions as the answers. On the data sets of opinion questions reading comprehensions of the Chinese college entrance examination, its related simulation test and the datasets of description and opinion type of DuReader, the experimental results show that the proposed question expansion model is effective on the extraction of candidate sentences.

Select

Information Retrieval and Question Answering

A Multi-Scale Temporal Dynamic Model for Sequential Recommendation with Clockwork RNN

YUAN Tao, NIU Shuzi, LI Huiyuan

2020, 34(6): 97-105.

Abstract ( ) PDF ( )

Knowledge map

Save

Sequential recommendation attempts to use the historical interaction sequence between users and items to predict the next item to interact with. A multi-scale temporal dynamic model for sequential recommendation with Clockwork RNN is proposed to solve the uncertainty of recommended items by the on users long-term global interest, medium time interest or short time local interest. Firstly, the CW-RNN layer is introduced to extract user’s multi-scale temporal interest features from the historical interaction sequence between users and items. The convolution with CNN on the time scale dimension is then used to learn the user’s interest dependency on different time scales, and generate the user’s unified interest representations. Finally, it uses the fully connected layer to model the interaction between the unified multi-scale user interest representations and item’s embedding representations. Experiments are carried out on MovieLens-1M and Amazon Movies and TV, two public datasets. The results show that our proposed model improves the accuracy by 3.80% and 8.63% respectively compared with the current optimal sequential recommendation algorithms.

Select

NLP Application

Chinese Grammatical Error Correction Method Based on Transformer Enhanced Architecture

WANG Chencheng, YANG Liner, WANG Yingying, DU Yongping, YANG Erhong

2020, 34(6): 106-114.

Abstract ( ) PDF ( )

Knowledge map

Save

Grammatical error correction is an important task in the field of natural language processing, which has attracted wide attention in recent years. This paper regards grammatical error correction task as a translation task to translate the wrong texts into the right ones. We use the transformer model with multi-head attention mechanism as framework, and propose a dynamic residual structure to combine the outputs of different neural blocks dynamically to better capture semantic information. Due to the lack of training corpus, we propose a data augmentation method to generate the parallel data by corrupting a monolingual corpus. The experimental results show that the proposed method based on dynamic residuals and data augmentation has significantly improved the performance of error correction, achieving the best performance on NLPCC 2018 Chinese grammatical error correction task.

Please choose a citation manager

Content to export

2020 Volume 34 Issue 6 Published: 15 July 2020