Journal of Chinese Information Processing

Select

Survey

A Survey on Text Analysis Methods and Applications for Educational Questions

HUANG Zhenya, LIU Qi, CHEN Enhong, LIN Xin, HE Liyang, LIU Jiayu, WANG Shijin

2022, 36(10): 1-16.

Abstract ( ) PDF ( )

Knowledge map

Save

One of the important research directions on the integration of artificial intelligence into pedagogy is analyzing the meanings of educational questions and simulating how humans solve problems. In recent years, a large number of educational question resources have been collected, which provides the data support of the related research. Leveraging the big data analysis and natural language processing related techniques, researchers propose many specific text analysis methods for educational questions, which are of great significance to explore the cognitive abilities of how human master knowledge. In this paper, we summarize several representative topics, including question quality analysis, machine reading comprehension, math problem solving, and automated essay scoring. Moreover, we introduce the relevant public datasets and open-source toolkits. Finally, we conclude by anticipating several future directions.

Select

Language Analysis and Calculation

Attention-based Mask Language Model for Implicit Discourse Relation Classification

DOU Zujun, HONG Yu, LI Xiao, ZHOU Guodong

2022, 36(10): 17-26.

Abstract ( ) PDF ( )

Knowledge map

Save

Implicit discourse relation recognition is to determine the semantic relations between arguments in the absence of explicit connectives. The challenge lies in the small scale of the existing training data and the relatively limited semantic diversity contained in it. To address the issue, this paper proposes a novel discourse relation recognition method based on the interactive-attention-based mask language model. The motivations include ①the mask language model has local language generation capabilities in the self-supervised learning process, that is, the ability to "reconstruct the semantic representation of the mask region" based on understanding the contextual semantics; ②the mask reconstruction has formed the effect of data enhancement (potentially automatic data expansion) and improves the robustness of discourse relation recognition. Technically, the method calculates interactive-attention weights between the arguments. Then, we select the keywords between arguments for masking according to interactive-attention weights. The experiments on Penn Discourse Treebank 2.0 (PDTB 2.0) show that the proposed method increases F1 score by 3.21%, 6.46%, 2.74%, and 6.56% for four top relations (Comparison, Contingency, Expansion, and Temporal), respectively.

Select

Language Analysis and Calculation

Lexicalized Tree Adjoining Grammar Based Data Augmentation for Parsing

CHEN Hongbin, ZHANG Yujie, XU Jin'an, CHEN Yufeng

2022, 36(10): 27-37,44.

Abstract ( ) PDF ( )

Knowledge map

Save

Parsing is a key technology in natural language processing. The neural network based parsing models require large-scale annotated data, and data augmentation technology is demanded to extend the exiting treebank. This paper proposes a data augmentation approach based on a lexicalized tree adjoining grammar for parsing. To generate sentences with various expressions of correct syntax structure, we design and implement a lexicalized tree extraction algorithm and a parse tree synthesis algorithm, in which "adjoining" and "substitution" operations are utilized to derive new syntactic trees. To generate the semantically correct sentences, we use language model to evaluate the derived sentences. Experiments on Chinese treebank CTB5 shows that dependency and constituency parsing accuracy could be improved by 1.39% and 2.14% on the 20% of CTB5 data show that the accuracy of strained on the derived data are increased, respectively.

Select

Language Resources Construction

Syntax-Aware Chinese Frame Semantic Role Labeling Based on Self-Attention

WANG Xiaohui, LI Ru, WANG Zhiqiang, CHAI Qinghua, HAN Xiaoqi

2022, 36(10): 38-44.

Abstract ( ) PDF ( )

Knowledge map

Save

Frame semantic role labeling is a semantic analysis task based on theFrameNet.Semantic role labeling usually has a strong dependence on syntax. Most of the current semantic role labeling models are based on Bi-LSTM, which can obtain the long-distance dependency information in sentences, but cannot obtain the syntactic information in sentences well.In this paper,we iimental results show that the F₁ of the model on the CFN (Chinese FrameNet) dataset has been improved, which proves that the self-attention mechanism can improve the performance of the Chinese frame semantic role labeling model.

Select

Language Resources Construction

Research and Construction of Chinese Medicine Knowledge Base

ZHANG Kunli, REN Xiaohui, ZHUANG Lei, ZAN Hongying, ZHANG Weicong, SUI Zhifang

2022, 36(10): 45-53.

Abstract ( ) PDF ( )

Knowledge map

Save

A medicine knowledge base with complete classification system and comprehensive drug information can provide basis and support for clinical decision-making and rational drug use. Based on multiple domestic medical resources as references and data sources, this paper establishes the knowledge description system and classification system of medicine base, standardizes classification of drugs and forms detailed knowledge description, and constructs a multi-source Chinese Medicine Knowledge Base (CMKB). The classification of CMKB includes 27 first-level categories and 119 secondary categories, and describes 14,141 drugs from multiple levels such as drug indications, dosage and administration. Furthermore, the BiLSTM-CRF and T-BiLSTM-CRF models are used to extract information of disease entities in unstructured descriptions, forming structured information extraction of drug attributes, and establishing the knowledge association between drug entities and automatically extracted disease entities. The constructed CMKB can be connected with the Chinese medical knowledge graph to expand drug information, and can provide the knowledge basis for intelligent diagnosis and medical question and answer.

Select

Knowledge Representation and Acquisition

Uncertain Knowledge Graph Embedding by Beta Distribution and Semi-supervised Learning

XU Yao, HE Shizhu, LIU Kang, ZHANG Chi, JIAO Fei, ZHAO Jun

2022, 36(10): 54-62.

Abstract ( ) PDF ( )

Knowledge map

Save

In recent years, embedding models for deterministic knowledge graph have made great progress in tasks such as knowledge graph completion. However, how to design and train embedding models for uncertain knowledge graphs is still an important challenge. Different from deterministic knowledge graphs, each fact triple of uncertain knowledge graph has a corresponding confidence. Therefore, the uncertain knowledge graph embedding model needs to accurately calculate the confidence of each triple. The existing uncertain knowledge graph embedding model with relatively simple structure can only deal with symmetric relations, and cannot handle the false-negative problem well. Aiming to solve the above problems, we first propose a unified framework for training uncertain knowledge graph embedding models. The framework uses a multi-model based semi-supervised learning method to train uncertain knowledge graph embedding models. In order to solve the problem of excessive noise in semi-supervised samples, we also use Monte Carlo Dropout to calculate the uncertainty of the model on the output results, and effectively filter the noisy data in semi-supervised samples according to this uncertainty. In addition, in order to better represent the uncertainty of entities and relationships in uncertain knowledge graph to deal with more complex relations, we also propose an uncertain knowledge graphs embedding model UBetaE based on Beta distribution, which represents both entities and relations as a set of mutually independent Beta distributions. The experimental results on the public dataset show that the combination of the semi-supervised learning method and UBetaE model proposed in this paper not only greatly alleviates the false-negative problem, but also significantly outperforms the current SOTA uncertain knowledge graph embedding models such as UKGE in multiple tasks.

Select

Ethnic Language Processing and Cross Language Processing

Graph Sampling and Aggregation Entity Relation Extraction Based on Tibetan Albert Pre-trained Language Model

YU Tao, Nyima Ciren, YONG Cuo, Nyima Trashi

2022, 36(10): 63-72.

Abstract ( ) PDF ( )

Knowledge map

Save

Entity relation extraction task is to recognize relations between different entities in sentences. This paper proposes a Graph Sampling and Aggregation (GraphSAGE) Tibetan entity relation extraction method based on Albert pre-trained model. The language model is used to obtain high quality sentence features. Input data for the GraphSAGE model is generated by the graph structure data construction and representation method designed in this paper. The experimental results show that our method is effective and superior to the baseline methods.

Select

Ethnic Language Processing and Cross Language Processing

Deep Learning Based Tibetan Sentence Segmentation Through Dependency Syntax

Thupten Tsering, Rinchen Dhondub, Nyima Tashi, Pema Tashi, CAI Zangtai

2022, 36(10): 73-80.

Abstract ( ) PDF ( )

Knowledge map

Save

Sentence segmentation is an essential task in Tibetan processing. According to the structural characteristics of Tibetan sentences, this paper proposes a deep Tibetan sentence segmentation model that integrates Tibetan-dependent syntax. The model first encodes the input sequence into word embedding and Tibetan-dependent syntactic information embedding, respectively. Then the two embeddings are concatenated and fed into a bidirectional LSTM to capture the sequential context. The final CRF layer is employed to predict the segmentation. The experimental results show that the F₁ value of this model is 99.4%.

Select

Ethnic Language Processing and Cross Language Processing

A Multi-scale-based Mongolian Offline Handwriting Recognition Method

WU Huijuan, FAN Daoerji, BAI Fengshan, Tengda, PAN Yuecai

2022, 36(10): 81-87.

Abstract ( ) PDF ( )

Knowledge map

Save

One major feature of Mongolian is the seamless connection of characters in a word, so a Mongolian word has multiple character division methods. A multi-scale Mongolian offline handwriting recognition method is proposed, in which one image of handwritten Mongolian word are mapped into to multiple target sequences to train the model. This paper distinguishes three candidate character division methods: "Twelve Prefix" code, presentation form code and grapheme code. The multi-scale model processes the sequence of handwritten images with a Bidirectional Long Short-Term Memory network, which are then fed into a Connectionist Temporal Classification (CTC) layer to map the image to the "Twelve Prefix" code sequence, the presentation form code sequence, and the grapheme code sequence, respectively. The sum of three CTC loss is used as the total loss function of the model. The experiments show that the model achieves the best performance on the public Mongolian offline handwritten data set MHW, with 66.22% and 63.97% accuracy on test set I and II, respectively.

Select

Ethnic Language Processing and Cross Language Processing

Cross-border National Cultural Entity Recognition Method with Word Set Information

YANG Zhenping, MAO Cunli, LEI Xiongli, GAO Shengxiang, LU Shan, ZHANG Yongbing

2022, 36(10): 88-96.

Abstract ( ) PDF ( )

Knowledge map

Save

Cross-border national cultural entities are usually composed of domain words that describe national cultural characteristics. This paper proposes a cross-border national cultural entity recognition method with word set information obtained from domain lexicon. Firstly, a cross-border national cultural domain lexicon is constructed to obtain the word set information. Secondly, the weight between the word set vectors is obtained through attention mechanism, and the positional encoding is adopted. Finally, the word set information is incorporated into the feature extraction layer to enhance the domain entity boundary information and alleviate the problem of word information loss caused by using only character features. Experimental results show that, the F₁ value of the proposed method is improved by 2.71% compared with the baseline method.

Select

Information Extraction and Text Mining

A BERT-based End-to-End Model for Chinese Document-level Event Extraction

ZHANG Hongkuan, SONG Hui, XU Bo, WANG Shuyi

2022, 36(10): 97-106.

Abstract ( ) PDF ( )

Knowledge map

Save

Document-level event extraction aims at discovering the event with its arguments and their roles from texts. This paper proposes an end-to-end model for domain-specific document-level event extraction based on BERT. We introduce the embedding of event type and entity nodes to the subsequent layer for event argument and role identification, which represents the relation between events, arguments and roles to improve the accuracy of classifying multi-event arguments. With the title and the embedding of the quintuple of event, we realize the identification of principal and subordinate events, and element fusion between multiple events. Experimental results show that our model outperforms the baselines.

Select

Information Extraction and Text Mining

Chinese Event Argument Extraction via Reading Comprehension Framework

CHEN Min, Wu Fan, LI Peifeng, WANG Zhongqing, ZHU Qiaoming

2022, 36(10): 107-115.

Abstract ( ) PDF ( )

Knowledge map

Save

Event argument extraction methods is usually formulated as a multi-classification or sequence labeling task to identify the mention by entities in the sentence. The category of argument roles are represented by vectors without considering their prior information. In fact, the semantics of argument role category is closely related with the argument itself. Therefore, this paper proposes to regard argument extraction as machine reading comprehension, with argument role described as natural language question. and the way to extract arguments is to answer these questions based on the context. This method can make better use of the prior information existed in argument role categories and its effectiveness is shown in the experiments of Chinese corpus of ACE 2005.

Select

Information Extraction and Text Mining

Parameter Adaptive Model Under Multi-Type Attention for Multi-label Text Classification

LI Zhiqiang, GUO Yi, WANG Zhihong

2022, 36(10): 116-125.

Abstract ( ) PDF ( )

Knowledge map

Save

Multi-label text classification assigns the most relevant multiple labels to each document from a huge label set. This paper proposes a parameter adaptive model under a multi-strategy attention mechanism (MSAPA) for multi-label text classification. The MSAPA model first uses multiple types of attention mechanisms to extract global and local keyword features with self-attention mechanism and label attention mechanism, respectively. Then it adopts a multi-parameter adaptive strategy to dynamically assign weights to multiple types of attention mechanisms, so as to learn a better text representation for classification. Experiments on the two benchmark data sets of AAPD and RCV1 validate the superiority of the MSAPA model.

Select

Sentiment Analysis and Social Computing

Aspect-level Sentiment Classification Based on Multi-Task Pre-Training Model

ZHOU Min, WANG Zhongqing, LI Shoushan, ZHOU Guodang

2022, 36(10): 126-134.

Abstract ( ) PDF ( )

Knowledge map

Save

The lack of annotated sample has become a major challenge for aspect-level sentiment classification. This paper proposes a combined multi-task pre-training Bert model to alleviate this issue. A large amount of unlabeled document-level sentiment classification data is employed to train a variety of classification tasks for a pre-trained model with share parameters, so as to transfer the useful semantic and grammatical information shared between aspect-level comments and document-level comments. Experiments on the SemEval-14 data set show that, compared with a series of baseline models, the model proposed in this paper effectively improves the accuracy of aspect-level sentiment classification.

Select

Sentiment Analysis and Social Computing

Aspect-Level Sentiment Analysis Based on Graph Convolutional Network

YAN Jinfeng, SHAO Xinhui

2022, 36(10): 135-144.

Abstract ( ) PDF ( )

Knowledge map

Save

Aspect-level sentiment analysis is a fundamental subtask of fine-grained sentiment analysis to predict the sentiment polarities of the given aspects or entities in text. The semantic information, syntactic information and their internteractive information are crucial to aspect-level sentiment analysis. This paper proposes a CA-GCN model based on graph convolution and attention. The model is mainly divided into two parts. First, the model integrates the rich feature representation obtained by CNN and Bi-LSTM with the aspect-oriented features obtained through graph convolution. Second ,the model applied two multi-head interactive attention to integrate the aspect, the context and the feature information obtained by the graph convolution, which is then fed into a multi-head self-attention to learn the dependency relationship among words in the sentence. Compared with the ASGCN model, the accuracy of the model on the datasets of Twitter,Lap14 and Rest14 is improved by 1.06%,1.62% and 0.95%, and the F1 score is improves by 1.07%,2.60% and 1.98%,respectively.

Select

Sentiment Analysis and Social Computing

Multimodal Sentiment Analysis Based on Multilevel Feature Fusion Attention Network

WANG Jinghao, LIU Zhen, LIU Tingting , WANG Yuanyi, CHAI Yanjie

2022, 36(10): 145-154.

Abstract ( ) PDF ( )

Knowledge map

Save

Existing methods for sentiment analysis in social media usually deal with single modal data, without capturing the relationship between multimodal information. This paper propose to treat the hierarchical structure relations between texts and images in social media as complementarity. This paper designs a multi-level feature fusion attention network to capture both the ‘images-text’ and the ‘text-images’ relations to perceive the user’s sentiments in social media. Experimental results on Yelp and MultiZOL datasets show that this method can effectively improve the sentiment classification accuracy for multimodal data.

Select

Sentiment Analysis and Social Computing

RAVA: An Reinforced-Association-Based Method for 12345 Hotline Events Allocation

CHENG Xiaomin, CHEN Gang, CHEN Jianpeng, SHE Xiangrong, MAO Jian

2022, 36(10): 155-166,172.

Abstract ( ) PDF ( )

Knowledge map

Save

The 12345 hotline has become a typical representation of digital transformation of local government in recent years. A reinforced-association-based event allocation (RAVA) method is proposed to address the low efficiency in manual allocation. Firstly, event portrait is constructed to obtain event encoding vectors by pointer-generator network. Then, an association reinforced attention (ARA) mechanism is used to capture the correlation of the concatenated vector of 'event-three-responsibility' to decide the result of 'event-three-responsibility'. The above results are concatenated with the department description vectors and then input into the classifier. Finally, the candidate answer is reordered to decide the final allocated department for 12345 hotline. Experimental results show that the RAVA method can achieve better results compared with several baseline methods on the Wuhu 12345 hotline datasets.

Select

Speech Processing

Chinese Speech Recognition Based on Pinyin Constraint and Joint Learning

LIANG Renfeng, YU Zhengtao, GAO Shengxiang, HUANG Yuxin, GUO Junjun , XU Shuli

2022, 36(10): 167-172.

Abstract ( ) PDF ( )

Knowledge map

Save

In contrast to phonetic languages achieving good performance of Automatic Speech Recognition (ASR) like English and France, Chinese is a logographic language without direct association with its pronunciation. To employ Pinyin which is the symbol system for the pronunciation of Chinese words, we propose an Automatic Speech Recognition method using Pinyin as a constraint for the decoding via multi-task learning framework. We introduce both Pinyin and Chinese character supervising signal to enhance the Chinese speech representing ability in the shared encoder, with Chinese character target ASR as the primary task and Pinyin target ASR as the auxiliary task. Experiments show that the proposed model gains a better recognition result with 2.24% reduction of the word error rate (WER).

Please choose a citation manager

Content to export

2022 Volume 36 Issue 10 Published: 30 December 2022