ISSN 1003-0077   CN 11-2325/N   CODEN ZXXHAU  
   RSS  Email Alert     
Citation Search Key words Adv search
2022 Vol. 36, No. 12
Published: 2023-01-17

Information Extraction and Text Mining
Language Resources Construction
Sentiment Analysis and Social Computing
Language Analysis and Calculation
Knowledge Representation and Acquisition
Ethnic Language Processing and Cross Language Processing
1 Social Bot Account Detection on Microblog: A Survey
ZHANG Xuan, LI Baobin
Social bots in microblog platforms significantly impact information dissemination and public opinion stance. This paper reviews the recent researches on social bot account detection in microblogs, especially Twitter and Weibo. The popular methods for data acquisition and feature extraction are reviewed. Various bot detection algorithms are summarized and evaluated, including approaches based on statistical methods, classical machine learning methods, and deep learning methods. Finally, some suggestions for future research are anticipated.
2022 Vol. 36 (12): 1-15 [Abstract] ( 226 ) HTML (1 KB)  PDF  (4493 KB)  ( 335 )
       Language Analysis and Calculation
16 Compound Sentence Relation Conversion Based on ERNIE-Gram and TinyBERT
YANG Jincai, CHEN Xuesong, HU Quan, CAI Xuxun
The compound sentence relation refers to the semantic relation between clauses. Among the current classification systems of compound sentence, the compound sentence trichotomy and HIT-CDTB are the most popular systems. Based on the pre-trained language models like ERNIE-Gram and TinyBERT, as well as PCA (principal component analysis), we proposed a three-stage model to recognize relation about compound sentence. Experiments reveal 77.60% accuracy of relation conversion from compound sentence trichotomy to HIT-CDTB, and 89.17% vice vesa.
2022 Vol. 36 (12): 16-26 [Abstract] ( 141 ) HTML (1 KB)  PDF  (8500 KB)  ( 191 )
27 Knowledge Enhanced Pre-trained Language Model for Textual Inference
XIONG Kai , DU Li, DING Xiao , LIU Ting, QIN Bing, FU Bo
Although the pre-trained language model has achieved high performance on a large number of natural language processing tasks, the knowledge contained in some pre-trained language models is difficult to support more efficient textual inference. Focused on using a wealth of knowledge to enhance the pre-trained language model for textual inference, we propose a framework for textual inference to integrate the knowledge of graphs and graph structures into the pre-trained language model. Experiments on two subtasks of textual inference indicate our framework outperforms a series of baseline methods.
2022 Vol. 36 (12): 27-35 [Abstract] ( 148 ) HTML (1 KB)  PDF  (2149 KB)  ( 271 )
36 Chinese Grammar Error Detection Based on Data Enhancement and Multi-task Learning
XIE Haihua , CHEN Zhiyou , CHENG Jing , LYU Xiaoqing , TANG Zhi
Due to the complexity of Chinese grammars and insufficient training data, Chinese grammar error diagnosis (CGED) is a challenging task without applicable approaches in practice. In this paper, we propose a CGED model, APM-CGED, with data augmentation, pre-trained language model and linguistic feature based multi-task learning. Data augmentation can effectively expand the training set, and pre-trained language models are rich in semantic information helpful to grammatical analysis. Meanwhile, the linguistic feature based multi-task learning enables the language model to learn linguistic features useful for grammatical error diagnosis. The method proposed in this paper get better result on the CGED dataset than other compared models.
2022 Vol. 36 (12): 36-43 [Abstract] ( 106 ) HTML (1 KB)  PDF  (4967 KB)  ( 188 )
       Language Resources Construction
44 A Domain Specific Chinese Reading Comprehension Data Set
SUN Yuefan, YANG Liang, LIN Yuan, XU Kan, LIN Hongfei
This paper proposes a Chinese reading comprehension dataset-Restaurant(Res) for a specific field(catering industry). The data are collected from the Dianping application, with user reviews in the catering industry. The annotators provide questions and annotate the answers according to the date. There are currently two versions of the Res dataset: Res_v1 contains only questions with answers in user comments, and Res_v2 includes additional questions without answers in the comments. We apply the mainstream BiDAF, QANet and Bert models in the dataset, achieving as high as 73.78% accuracy. lagging far behind human performance of 91.03%.
2022 Vol. 36 (12): 44-51 [Abstract] ( 99 ) HTML (1 KB)  PDF  (3362 KB)  ( 158 )
52 Construction of a Finely-Grained Training Dataset for Chinese Semantic-Role Labeling
SONG Heng, CAO Cungen, WANG Ya , WANG Shi
Semantic roles play an important role in the natural language understanding, but most of the existing semantic-role training datasets are relatively rough or even misleading in labeling semantic roles. In order to facilitate the fine-grained semantic analysis, an improved taxonomy of Chinese semantic roles is proposed by investigating a real-world corpus. Focusing on a corpus formed with sentences with only one pivotal semantic role, we propose a semi-automatic method for fine-grained Chinese semantic role dataset construction. A corpus of 9,550 sentences has been labeled with 9,423 pivot semantic roles, 29,142 principal peripheral semantic roles and 3,745 auxiliary peripheral semantic roles. Among them, 172 sentences are double-labeled with semantic roles and 104 sentences are labeled with semantic roles of uncertain semantic events. With a Bi-LSTM+CRF model, we compare the dataset against the Chinese Proposition Bank and reveal differences in the recognition of principal peripheral semantic roles, which provide clues for further improvement.
2022 Vol. 36 (12): 52-66,73 [Abstract] ( 90 ) HTML (1 KB)  PDF  (4689 KB)  ( 166 )
       Knowledge Representation and Acquisition
67 Wenmai—A Probablistic-Like Association Reliable Chinese Knowledge Graph
LI Wenhao, LIU Wenchang, SUN Maosong, YI Xiaoyuan
The existing Chinese knowledge graphs are derived from Wikipedia and Baidu Baike by leveraging the information of the entity infobox and categorical system. Differently,This article proposes a Chinese knowledge graph with probabilistic links by treat the hyperlinks in these resources as entity relations, weighted by the TF-IDF value of the mention frequency of the target entity in the entry article of the source entity. A reliable link screening algorithm is further desgned to remove the occasional links to make the knowledge graph more reliable. Based on the above methods, this article has constructed a probabilistically probabilistic-like association reliable Chinese knowledge graph named "Wenmai", which is public available in GitHub as a support for knowledge-guided natural language processing.
2022 Vol. 36 (12): 67-73 [Abstract] ( 93 ) HTML (1 KB)  PDF  (2680 KB)  ( 188 )
74 Heterogeneous Hypernetwork Representation Learning with the Translation Constraint
LIU Zhenguo, ZHU Yu, ZHAO Haixing, WANG Xiaoying, HUANG Jianqiang
In contrast to the ordinary network with only pairwise relationships between the nodes, there also exist complex tuple relationships (i.e. the hyperedges) among the nodes in the hypernetwork. However, most of the existing network representation learning methods cannot effectively capture complex tuple relationships. Therefore, to resolve this issue, a heterogeneous hypernetwork representation learning method with the translation constraint (HRTC) is proposed. Firstly, the proposed method combines clique expansion and star expansion to transform a heterogeneous hypernetwork abstracted as the hypergraph into a heterogeneous network abstracted as 2-section graph+incidence graph. Secondly, a meta-path walk method aware of semantic relevance of the nodes (SRwalk) is proposed to capture semantic relationships between the nodes. Finally, while the pairwise relationships between the nodes are trained, the tuple relationships among the nodes are captured by introducing the translation mechanism in knowledge representation learning. Experimental results show that as for the link prediction task, the performance of the proposed method is close to that of other optimal baseline methods, and as for the hypernetwork reconstruction task, the performance of the proposed method is better than that of other optimal baseline methods on the drug dataset for case beyond 0.6 hyperedge reconstruction ratio, meanwhile, the average performance of the proposed method outperforms that of other optimal baseline methods by 16.24% on the GPS dataset.
2022 Vol. 36 (12): 74-84 [Abstract] ( 91 ) HTML (1 KB)  PDF  (2609 KB)  ( 172 )
       Ethnic Language Processing and Cross Language Processing
85 Pre-trained Language Model Based Tibetan Text Classification
AN Bo, LONG Congjun
Tibetan text classification is a fundamental task in Tibetan natural language processing. The current mainstream text classification model is a large-scale pre-training model plus fine-tuning. However, Tibetan lacks open source large-scale text and pre-training language model, and cannot be verified on Tibetan text classification task. This paper crawls a large Tibetan text dataset to solve the above problems and trains a Tibetan pre-training language model (BERT-base-Tibetan) based on this dataset. Experimental results show that the pre-training language model can significantly improve the performance of Tibetan text classification (F1 value increases by 9.3% on average) and verify the value of the pre-training language model in Tibetan text classification tasks.
2022 Vol. 36 (12): 85-93 [Abstract] ( 94 ) HTML (1 KB)  PDF  (1390 KB)  ( 197 )
94 HRTNSC: Hybrid Representation-based Subjective and Objective Sentence Classification for Tibetan News
KONG Chunwei, LYU Xueqiang, ZHANG Le
Focused on Tibetan news texts, this paper proposes a hybrid representation-based subjective and objective sentence classification model (HRTNSC). The input layer is enriched by fusing syllable-level features and word-level features. The BiLSTM+CNN network is applied to the subjective and objective classification of sentences. The experimental results show that the HRTNSC model achieves an optimal F1 value of 90.84%, which is better than the benchmark model.
2022 Vol. 36 (12): 94-103,114 [Abstract] ( 63 ) HTML (1 KB)  PDF  (2056 KB)  ( 175 )
       Information Extraction and Text Mining
104 Chinese Short Text Entity Linking Based on BERT and GCN
GUO Shiwei, MA Bo, MA Yupeng, YANG Yating
Short text entity linking can relies on local short text information and knowledge base due to the lack of global topic information. This paper proposes the concept of short text interaction graph (STIG) and a double stage training strategy. The Bert is used to extract the multi-granularity features between local short text and candidate entities, and the graph convolution mechanism is used on the short text interaction graph. To alleviate the degradation of graph convolution caused by mean pooling, a method is further proposed to compress the feature of nodes and edges information in interaction graph into a dense vector. Experiments on CCKS2020 entity linking dataset show the effectiveness of the proposed method.
2022 Vol. 36 (12): 104-114 [Abstract] ( 100 ) HTML (1 KB)  PDF  (3453 KB)  ( 209 )
115 An Improved Argument Extraction Method for Cultural Events Based on BERT
LIN Zhi, LI Yuan, WANG Qinglin
Event extraction methods usually use the small-scale open-domain event extraction corpus of ACE 2005, which is difficult for applying deep learning. A semi-supervised domain event argument extraction method is proposed to automatically annotate cultural event corpus from official websites of Chinese public libraries by using template and domain dictionary. Then manual annotation is applied to ensure the label accuracy. To resolve the problem of polysemy in word embedding layer, an improved method using BERT model and positional character embedding layer is proposed for the BiLSTM-CRF model. Experiments demonstrate an F1 value of 84.9% for the proposed method of event argument extraction, which is superior to the classical event argument recognition methods.
2022 Vol. 36 (12): 115-122 [Abstract] ( 104 ) HTML (1 KB)  PDF  (1916 KB)  ( 189 )
123 Event Prediction Based on Event Evolutionary Graph and GCN
Focusing on improving the evolutionary graph construction and enriching the event representation, this paper proposes an event prediction model based on the event evolutionary graph and Graph Convolutional Network(GCN). This model applies an event extraction model, and redefines the edge’s weight on the event evolutionary graph by combining frequency and mutual information. The representation of the event context is learned by BiLSTM and memory network, which is fed as the input into GCN under the guidance of the event evolutionary graph. The final event prediction is jointly completed by such event-relationship aware, context-aware, and neighbor aware event embeddings. Experiment results on the Gigaword benchmark show that the proposed model outperforms six advanced models in event prediction accuracy, with 5.55% increase compared with the latest SGNN method.
2022 Vol. 36 (12): 123-132 [Abstract] ( 88 ) HTML (1 KB)  PDF  (3975 KB)  ( 165 )
133 A Long-term Memory Prediction Model for Language Learning via LSTM
YE Junyao, SU Jingyong, WANG Yaowei, XU Yong
Spaced repetition is a common mnemonic method in language learning. In order to decide proper review intervals for a desired memory effect, it is necessary to predict the learners’ long-term memory. This paper proposes a long-term memory prediction model for language learning via LSTM. We extract statistical features and sequence features from the memory behavior history of learners. The LSTM is used to learn the memory behavior sequence. The half-life regression model is applied to predict the probability of foreign language learners' recall of words. Upon the 9 billion pieces of real memory behavior data collected for evaluation, the sequence features are revealed more informative than statistical features. Compared with the state-of-the-art models, the error of the proposed LSTM-HLR model is significantly reduced by 50%.
2022 Vol. 36 (12): 133-138,148 [Abstract] ( 125 ) HTML (1 KB)  PDF  (3022 KB)  ( 249 )
139 Patent Efficacy Phrase Recognition Based on Multiple Features
LUO Yixiong, LYU Xueqiang, YOU Xindong
Patent efficacy is one of the key information in the patent text. To identify the patent efficacy phrase, a multiple feature approach is proposed to combine both character-level features and word-level features. The character-level features include characters, character pinyin, and character wubi. The word-level features correspond to a collection of words containing those characters. Character-level features are vectorized by word2vec or BERT. Attention mechanism is used to fuse the word-level feature vectors in the input sequence. All feature vectors are concatenated as the input of BiLSTM (or Transformer)+CRF. Experiments on patents of new energy vehicles demonstrate the best 91.15% F1 value is achieved by BiLSTM+CRF with the combination of word2vec character vector, Bert character vector, wubi feature vector and word feature vector.
2022 Vol. 36 (12): 139-148 [Abstract] ( 71 ) HTML (1 KB)  PDF  (2110 KB)  ( 133 )
       Sentiment Analysis and Social Computing
149 Multi-dimensional Emotion Regression via dimension-Label Information
TAN Xizi, ZHU Suyang, LI Shoushan, ZHOU Guodong
In recent years, emotion analysis has experienced rapid development. As one of the tasks of emotion analysis, the emotion regression is more generalized and less affected by the classification taxonomy, lacking of sufficient corpus, though. In this paper, we propose a multi-dimensional emotion regression method via dimension-label information to predict the input text scores in three dimensions (Valence, Arousal, Dominance). This method conducts emotion regression by the probability of emotion classification prediction, with an objective to maximize the distance between two texts with different emotion labels. Experimental results on EMOBANK show that the proposed method has achieved significant improvement according to the mean square error and Pearson correlation coefficient, especially in the Valence and Arousal dimensions.
2022 Vol. 36 (12): 149-158 [Abstract] ( 86 ) HTML (1 KB)  PDF  (1661 KB)  ( 150 )
159 Aspect-level Sentiment Classification Based on Double Channel Semantic Difference Network
ZENG Biqing, XU Mayi, YANG Jianhao, PEI Fenghua , GAN Zibang,
DING Meirong , CHENG Lianglun
Aspect-Level sentiment classification aims to analyze the sentiment polarity of different aspect words in a sentence. To realize aspect-word aware contextual representations, this paper proposes a double channel semantic difference network(DCSDN) with the notation of theory of Semantic Difference. The DCSDN captures the contextual feature information of different aspects in the same text with the double channel architecture, and extract the semantic features of the texts in the double channel via a semantic extraction network. It employs the semantic difference attention to enhance the attention to key information. Experiments on Laptop datasets and Restaurant datasets (SemEval2014) and the Twitter dataset(ACL) demonstrate the accuracy reaching 81.35%, 86.34% and 78.18% respectively.
2022 Vol. 36 (12): 159-172 [Abstract] ( 93 ) HTML (1 KB)  PDF  (10749 KB)  ( 134 )
173 Dialog Sentiment Analysis with Multi-party Attention
CHEN Chen, ZHOU Xiabing, WANG Zhongqing, ZHANG Min
Dialog sentiment analysis aims to classify the sentiment of each sentence in a dialogue, considering both the speaker’s personal emotion and the emotion transmission between speakers. To model this with Transformer, this paper proposes a multi-party attention mechanism to better model the interaction between different speakers and simulate dialogue scenes. Experiments show that, compared with other SOTA models, Dialogue Transformer has simpler implementation, faster running speed, and an significantly increased Weighted-F1 value.
2022 Vol. 36 (12): 173-181 [Abstract] ( 102 ) HTML (1 KB)  PDF  (2613 KB)  ( 194 )
Copyright © Editorial Board of
Supported by:Beijing Magtech