Journal of Chinese Information Processing

Select

Survey

Social Bot Account Detection on Microblog: A Survey

ZHANG Xuan, LI Baobin

2022, 36(12): 1-15.

Abstract ( ) PDF ( )

Knowledge map

Save

Social bots in microblog platforms significantly impact information dissemination and public opinion stance. This paper reviews the recent researches on social bot account detection in microblogs, especially Twitter and Weibo. The popular methods for data acquisition and feature extraction are reviewed. Various bot detection algorithms are summarized and evaluated, including approaches based on statistical methods, classical machine learning methods, and deep learning methods. Finally, some suggestions for future research are anticipated.

Select

Language Analysis and Calculation

Compound Sentence Relation Conversion Based on ERNIE-Gram and TinyBERT

YANG Jincai, CHEN Xuesong, HU Quan, CAI Xuxun

2022, 36(12): 16-26.

Abstract ( ) PDF ( )

Knowledge map

Save

The compound sentence relation refers to the semantic relation between clauses. Among the current classification systems of compound sentence, the compound sentence trichotomy and HIT-CDTB are the most popular systems. Based on the pre-trained language models like ERNIE-Gram and TinyBERT, as well as PCA (principal component analysis), we proposed a three-stage model to recognize relation about compound sentence. Experiments reveal 77.60% accuracy of relation conversion from compound sentence trichotomy to HIT-CDTB, and 89.17% vice vesa.

Select

Language Analysis and Calculation

Knowledge Enhanced Pre-trained Language Model for Textual Inference

XIONG Kai , DU Li, DING Xiao , LIU Ting, QIN Bing, FU Bo

2022, 36(12): 27-35.

Abstract ( ) PDF ( )

Knowledge map

Save

Although the pre-trained language model has achieved high performance on a large number of natural language processing tasks, the knowledge contained in some pre-trained language models is difficult to support more efficient textual inference. Focused on using a wealth of knowledge to enhance the pre-trained language model for textual inference, we propose a framework for textual inference to integrate the knowledge of graphs and graph structures into the pre-trained language model. Experiments on two subtasks of textual inference indicate our framework outperforms a series of baseline methods.

Select

Language Analysis and Calculation

Chinese Grammar Error Detection Based on Data Enhancement and Multi-task Learning

XIE Haihua , CHEN Zhiyou , CHENG Jing , LYU Xiaoqing , TANG Zhi

2022, 36(12): 36-43.

Abstract ( ) PDF ( )

Knowledge map

Save

Due to the complexity of Chinese grammars and insufficient training data, Chinese grammar error diagnosis (CGED) is a challenging task without applicable approaches in practice. In this paper, we propose a CGED model, APM-CGED, with data augmentation, pre-trained language model and linguistic feature based multi-task learning. Data augmentation can effectively expand the training set, and pre-trained language models are rich in semantic information helpful to grammatical analysis. Meanwhile, the linguistic feature based multi-task learning enables the language model to learn linguistic features useful for grammatical error diagnosis. The method proposed in this paper get better result on the CGED dataset than other compared models.

Select

Language Resources Construction

A Domain Specific Chinese Reading Comprehension Data Set

SUN Yuefan, YANG Liang, LIN Yuan, XU Kan, LIN Hongfei

2022, 36(12): 44-51.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a Chinese reading comprehension dataset-Restaurant(Res) for a specific field(catering industry). The data are collected from the Dianping application, with user reviews in the catering industry. The annotators provide questions and annotate the answers according to the date. There are currently two versions of the Res dataset: Res_v1 contains only questions with answers in user comments, and Res_v2 includes additional questions without answers in the comments. We apply the mainstream BiDAF, QANet and Bert models in the dataset, achieving as high as 73.78% accuracy. lagging far behind human performance of 91.03%.

Select

Language Resources Construction

Construction of a Finely-Grained Training Dataset for Chinese Semantic-Role Labeling

SONG Heng, CAO Cungen, WANG Ya , WANG Shi

2022, 36(12): 52-66,73.

Abstract ( ) PDF ( )

Knowledge map

Save

Semantic roles play an important role in the natural language understanding, but most of the existing semantic-role training datasets are relatively rough or even misleading in labeling semantic roles. In order to facilitate the fine-grained semantic analysis, an improved taxonomy of Chinese semantic roles is proposed by investigating a real-world corpus. Focusing on a corpus formed with sentences with only one pivotal semantic role, we propose a semi-automatic method for fine-grained Chinese semantic role dataset construction. A corpus of 9,550 sentences has been labeled with 9,423 pivot semantic roles, 29,142 principal peripheral semantic roles and 3,745 auxiliary peripheral semantic roles. Among them, 172 sentences are double-labeled with semantic roles and 104 sentences are labeled with semantic roles of uncertain semantic events. With a Bi-LSTM+CRF model, we compare the dataset against the Chinese Proposition Bank and reveal differences in the recognition of principal peripheral semantic roles, which provide clues for further improvement.

Select

Knowledge Representation and Acquisition

Wenmai—A Probablistic-Like Association Reliable Chinese Knowledge Graph

LI Wenhao, LIU Wenchang, SUN Maosong, YI Xiaoyuan

2022, 36(12): 67-73.

Abstract ( ) PDF ( )

Knowledge map

Save

The existing Chinese knowledge graphs are derived from Wikipedia and Baidu Baike by leveraging the information of the entity infobox and categorical system. Differently,This article proposes a Chinese knowledge graph with probabilistic links by treat the hyperlinks in these resources as entity relations, weighted by the TF-IDF value of the mention frequency of the target entity in the entry article of the source entity. A reliable link screening algorithm is further desgned to remove the occasional links to make the knowledge graph more reliable. Based on the above methods, this article has constructed a probabilistically probabilistic-like association reliable Chinese knowledge graph named "Wenmai", which is public available in GitHub as a support for knowledge-guided natural language processing.

Select

Knowledge Representation and Acquisition

Heterogeneous Hypernetwork Representation Learning with the Translation Constraint

LIU Zhenguo, ZHU Yu, ZHAO Haixing, WANG Xiaoying, HUANG Jianqiang

2022, 36(12): 74-84.

Abstract ( ) PDF ( )

Knowledge map

Save

In contrast to the ordinary network with only pairwise relationships between the nodes, there also exist complex tuple relationships (i.e. the hyperedges) among the nodes in the hypernetwork. However, most of the existing network representation learning methods cannot effectively capture complex tuple relationships. Therefore, to resolve this issue, a heterogeneous hypernetwork representation learning method with the translation constraint (HRTC) is proposed. Firstly, the proposed method combines clique expansion and star expansion to transform a heterogeneous hypernetwork abstracted as the hypergraph into a heterogeneous network abstracted as 2-section graph+incidence graph. Secondly, a meta-path walk method aware of semantic relevance of the nodes (SRwalk) is proposed to capture semantic relationships between the nodes. Finally, while the pairwise relationships between the nodes are trained, the tuple relationships among the nodes are captured by introducing the translation mechanism in knowledge representation learning. Experimental results show that as for the link prediction task, the performance of the proposed method is close to that of other optimal baseline methods, and as for the hypernetwork reconstruction task, the performance of the proposed method is better than that of other optimal baseline methods on the drug dataset for case beyond 0.6 hyperedge reconstruction ratio, meanwhile, the average performance of the proposed method outperforms that of other optimal baseline methods by 16.24% on the GPS dataset.

Select

Ethnic Language Processing and Cross Language Processing

Pre-trained Language Model Based Tibetan Text Classification

AN Bo, LONG Congjun

2022, 36(12): 85-93.

Abstract ( ) PDF ( )

Knowledge map

Save

Tibetan text classification is a fundamental task in Tibetan natural language processing. The current mainstream text classification model is a large-scale pre-training model plus fine-tuning. However, Tibetan lacks open source large-scale text and pre-training language model, and cannot be verified on Tibetan text classification task. This paper crawls a large Tibetan text dataset to solve the above problems and trains a Tibetan pre-training language model (BERT-base-Tibetan) based on this dataset. Experimental results show that the pre-training language model can significantly improve the performance of Tibetan text classification (F₁ value increases by 9.3% on average) and verify the value of the pre-training language model in Tibetan text classification tasks.

Select

Ethnic Language Processing and Cross Language Processing

HRTNSC: Hybrid Representation-based Subjective and Objective Sentence Classification for Tibetan News

KONG Chunwei, LYU Xueqiang, ZHANG Le

2022, 36(12): 94-103,114.

Abstract ( ) PDF ( )

Knowledge map

Save

Focused on Tibetan news texts, this paper proposes a hybrid representation-based subjective and objective sentence classification model (HRTNSC). The input layer is enriched by fusing syllable-level features and word-level features. The BiLSTM+CNN network is applied to the subjective and objective classification of sentences. The experimental results show that the HRTNSC model achieves an optimal F₁ value of 90.84%, which is better than the benchmark model.

Select

Information Extraction and Text Mining

Chinese Short Text Entity Linking Based on BERT and GCN

GUO Shiwei, MA Bo, MA Yupeng, YANG Yating

2022, 36(12): 104-114.

Abstract ( ) PDF ( )

Knowledge map

Save

Short text entity linking can relies on local short text information and knowledge base due to the lack of global topic information. This paper proposes the concept of short text interaction graph (STIG) and a double stage training strategy. The Bert is used to extract the multi-granularity features between local short text and candidate entities, and the graph convolution mechanism is used on the short text interaction graph. To alleviate the degradation of graph convolution caused by mean pooling, a method is further proposed to compress the feature of nodes and edges information in interaction graph into a dense vector. Experiments on CCKS2020 entity linking dataset show the effectiveness of the proposed method.

Select

Information Extraction and Text Mining

An Improved Argument Extraction Method for Cultural Events Based on BERT

LIN Zhi, LI Yuan, WANG Qinglin

2022, 36(12): 115-122.

Abstract ( ) PDF ( )

Knowledge map

Save

Event extraction methods usually use the small-scale open-domain event extraction corpus of ACE 2005, which is difficult for applying deep learning. A semi-supervised domain event argument extraction method is proposed to automatically annotate cultural event corpus from official websites of Chinese public libraries by using template and domain dictionary. Then manual annotation is applied to ensure the label accuracy. To resolve the problem of polysemy in word embedding layer, an improved method using BERT model and positional character embedding layer is proposed for the BiLSTM-CRF model. Experiments demonstrate an F₁ value of 84.9% for the proposed method of event argument extraction, which is superior to the classical event argument recognition methods.

Select

Information Extraction and Text Mining

Event Prediction Based on Event Evolutionary Graph and GCN

TANG Yan, CHEN Yi, ZHANG Zuowei

2022, 36(12): 123-132.

Abstract ( ) PDF ( )

Knowledge map

Save

Focusing on improving the evolutionary graph construction and enriching the event representation, this paper proposes an event prediction model based on the event evolutionary graph and Graph Convolutional Network(GCN). This model applies an event extraction model, and redefines the edge’s weight on the event evolutionary graph by combining frequency and mutual information. The representation of the event context is learned by BiLSTM and memory network, which is fed as the input into GCN under the guidance of the event evolutionary graph. The final event prediction is jointly completed by such event-relationship aware, context-aware, and neighbor aware event embeddings. Experiment results on the Gigaword benchmark show that the proposed model outperforms six advanced models in event prediction accuracy, with 5.55% increase compared with the latest SGNN method.

Select

Information Extraction and Text Mining

A Long-term Memory Prediction Model for Language Learning via LSTM

YE Junyao, SU Jingyong, WANG Yaowei, XU Yong

2022, 36(12): 133-138,148.

Abstract ( ) PDF ( )

Knowledge map

Save

Spaced repetition is a common mnemonic method in language learning. In order to decide proper review intervals for a desired memory effect, it is necessary to predict the learners’ long-term memory. This paper proposes a long-term memory prediction model for language learning via LSTM. We extract statistical features and sequence features from the memory behavior history of learners. The LSTM is used to learn the memory behavior sequence. The half-life regression model is applied to predict the probability of foreign language learners' recall of words. Upon the 9 billion pieces of real memory behavior data collected for evaluation, the sequence features are revealed more informative than statistical features. Compared with the state-of-the-art models, the error of the proposed LSTM-HLR model is significantly reduced by 50%.

Select

Information Extraction and Text Mining

Patent Efficacy Phrase Recognition Based on Multiple Features

LUO Yixiong, LYU Xueqiang, YOU Xindong

2022, 36(12): 139-148.

Abstract ( ) PDF ( )

Knowledge map

Save

Patent efficacy is one of the key information in the patent text. To identify the patent efficacy phrase, a multiple feature approach is proposed to combine both character-level features and word-level features. The character-level features include characters, character pinyin, and character wubi. The word-level features correspond to a collection of words containing those characters. Character-level features are vectorized by word2vec or BERT. Attention mechanism is used to fuse the word-level feature vectors in the input sequence. All feature vectors are concatenated as the input of BiLSTM (or Transformer)+CRF. Experiments on patents of new energy vehicles demonstrate the best 91.15% F1 value is achieved by BiLSTM+CRF with the combination of word2vec character vector, Bert character vector, wubi feature vector and word feature vector.

Select

Sentiment Analysis and Social Computing

Multi-dimensional Emotion Regression via dimension-Label Information

TAN Xizi, ZHU Suyang, LI Shoushan, ZHOU Guodong

2022, 36(12): 149-158.

Abstract ( ) PDF ( )

Knowledge map

Save

In recent years, emotion analysis has experienced rapid development. As one of the tasks of emotion analysis, the emotion regression is more generalized and less affected by the classification taxonomy, lacking of sufficient corpus, though. In this paper, we propose a multi-dimensional emotion regression method via dimension-label information to predict the input text scores in three dimensions (Valence, Arousal, Dominance). This method conducts emotion regression by the probability of emotion classification prediction, with an objective to maximize the distance between two texts with different emotion labels. Experimental results on EMOBANK show that the proposed method has achieved significant improvement according to the mean square error and Pearson correlation coefficient, especially in the Valence and Arousal dimensions.

Select

Sentiment Analysis and Social Computing

Aspect-level Sentiment Classification Based on Double Channel Semantic Difference Network

ZENG Biqing, XU Mayi, YANG Jianhao, PEI Fenghua , GAN Zibang,
DING Meirong , CHENG Lianglun

2022, 36(12): 159-172.

Abstract ( ) PDF ( )

Knowledge map

Save

Aspect-Level sentiment classification aims to analyze the sentiment polarity of different aspect words in a sentence. To realize aspect-word aware contextual representations, this paper proposes a double channel semantic difference network(DCSDN) with the notation of theory of Semantic Difference. The DCSDN captures the contextual feature information of different aspects in the same text with the double channel architecture, and extract the semantic features of the texts in the double channel via a semantic extraction network. It employs the semantic difference attention to enhance the attention to key information. Experiments on Laptop datasets and Restaurant datasets (SemEval2014) and the Twitter dataset(ACL) demonstrate the accuracy reaching 81.35%, 86.34% and 78.18% respectively.

Select

Sentiment Analysis and Social Computing

Dialog Sentiment Analysis with Multi-party Attention

CHEN Chen, ZHOU Xiabing, WANG Zhongqing, ZHANG Min

2022, 36(12): 173-181.

Abstract ( ) PDF ( )

Knowledge map

Save

Dialog sentiment analysis aims to classify the sentiment of each sentence in a dialogue, considering both the speaker’s personal emotion and the emotion transmission between speakers. To model this with Transformer, this paper proposes a multi-party attention mechanism to better model the interaction between different speakers and simulate dialogue scenes. Experiments show that, compared with other SOTA models, Dialogue Transformer has simpler implementation, faster running speed, and an significantly increased Weighted-F₁ value.

Please choose a citation manager

Content to export

2022 Volume 36 Issue 12 Published: 17 January 2023