2022 Volume 36 Issue 12 Published: 17 January 2023
  

  • Select all
    |
    Survey
  • Survey
    ZHANG Xuan, LI Baobin
    2022, 36(12): 1-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    Social bots in microblog platforms significantly impact information dissemination and public opinion stance. This paper reviews the recent researches on social bot account detection in microblogs, especially Twitter and Weibo. The popular methods for data acquisition and feature extraction are reviewed. Various bot detection algorithms are summarized and evaluated, including approaches based on statistical methods, classical machine learning methods, and deep learning methods. Finally, some suggestions for future research are anticipated.
  • Language Analysis and Calculation
  • Language Analysis and Calculation
    YANG Jincai, CHEN Xuesong, HU Quan, CAI Xuxun
    2022, 36(12): 16-26.
    Abstract ( ) PDF ( ) Knowledge map Save
    The compound sentence relation refers to the semantic relation between clauses. Among the current classification systems of compound sentence, the compound sentence trichotomy and HIT-CDTB are the most popular systems. Based on the pre-trained language models like ERNIE-Gram and TinyBERT, as well as PCA (principal component analysis), we proposed a three-stage model to recognize relation about compound sentence. Experiments reveal 77.60% accuracy of relation conversion from compound sentence trichotomy to HIT-CDTB, and 89.17% vice vesa.
  • Language Analysis and Calculation
    XIONG Kai , DU Li, DING Xiao , LIU Ting, QIN Bing, FU Bo
    2022, 36(12): 27-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    Although the pre-trained language model has achieved high performance on a large number of natural language processing tasks, the knowledge contained in some pre-trained language models is difficult to support more efficient textual inference. Focused on using a wealth of knowledge to enhance the pre-trained language model for textual inference, we propose a framework for textual inference to integrate the knowledge of graphs and graph structures into the pre-trained language model. Experiments on two subtasks of textual inference indicate our framework outperforms a series of baseline methods.
  • Language Analysis and Calculation
    XIE Haihua , CHEN Zhiyou , CHENG Jing , LYU Xiaoqing , TANG Zhi
    2022, 36(12): 36-43.
    Abstract ( ) PDF ( ) Knowledge map Save
    Due to the complexity of Chinese grammars and insufficient training data, Chinese grammar error diagnosis (CGED) is a challenging task without applicable approaches in practice. In this paper, we propose a CGED model, APM-CGED, with data augmentation, pre-trained language model and linguistic feature based multi-task learning. Data augmentation can effectively expand the training set, and pre-trained language models are rich in semantic information helpful to grammatical analysis. Meanwhile, the linguistic feature based multi-task learning enables the language model to learn linguistic features useful for grammatical error diagnosis. The method proposed in this paper get better result on the CGED dataset than other compared models.
  • Language Resources Construction
  • Language Resources Construction
    SUN Yuefan, YANG Liang, LIN Yuan, XU Kan, LIN Hongfei
    2022, 36(12): 44-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a Chinese reading comprehension dataset-Restaurant(Res) for a specific field(catering industry). The data are collected from the Dianping application, with user reviews in the catering industry. The annotators provide questions and annotate the answers according to the date. There are currently two versions of the Res dataset: Res_v1 contains only questions with answers in user comments, and Res_v2 includes additional questions without answers in the comments. We apply the mainstream BiDAF, QANet and Bert models in the dataset, achieving as high as 73.78% accuracy. lagging far behind human performance of 91.03%.
  • Language Resources Construction
    SONG Heng, CAO Cungen, WANG Ya , WANG Shi
    2022, 36(12): 52-66,73.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic roles play an important role in the natural language understanding, but most of the existing semantic-role training datasets are relatively rough or even misleading in labeling semantic roles. In order to facilitate the fine-grained semantic analysis, an improved taxonomy of Chinese semantic roles is proposed by investigating a real-world corpus. Focusing on a corpus formed with sentences with only one pivotal semantic role, we propose a semi-automatic method for fine-grained Chinese semantic role dataset construction. A corpus of 9,550 sentences has been labeled with 9,423 pivot semantic roles, 29,142 principal peripheral semantic roles and 3,745 auxiliary peripheral semantic roles. Among them, 172 sentences are double-labeled with semantic roles and 104 sentences are labeled with semantic roles of uncertain semantic events. With a Bi-LSTM+CRF model, we compare the dataset against the Chinese Proposition Bank and reveal differences in the recognition of principal peripheral semantic roles, which provide clues for further improvement.
  • Knowledge Representation and Acquisition
  • Knowledge Representation and Acquisition
    LI Wenhao, LIU Wenchang, SUN Maosong, YI Xiaoyuan
    2022, 36(12): 67-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    The existing Chinese knowledge graphs are derived from Wikipedia and Baidu Baike by leveraging the information of the entity infobox and categorical system. Differently,This article proposes a Chinese knowledge graph with probabilistic links by treat the hyperlinks in these resources as entity relations, weighted by the TF-IDF value of the mention frequency of the target entity in the entry article of the source entity. A reliable link screening algorithm is further desgned to remove the occasional links to make the knowledge graph more reliable. Based on the above methods, this article has constructed a probabilistically probabilistic-like association reliable Chinese knowledge graph named "Wenmai", which is public available in GitHub as a support for knowledge-guided natural language processing.
  • Knowledge Representation and Acquisition
    LIU Zhenguo, ZHU Yu, ZHAO Haixing, WANG Xiaoying, HUANG Jianqiang
    2022, 36(12): 74-84.
    Abstract ( ) PDF ( ) Knowledge map Save
    In contrast to the ordinary network with only pairwise relationships between the nodes, there also exist complex tuple relationships (i.e. the hyperedges) among the nodes in the hypernetwork. However, most of the existing network representation learning methods cannot effectively capture complex tuple relationships. Therefore, to resolve this issue, a heterogeneous hypernetwork representation learning method with the translation constraint (HRTC) is proposed. Firstly, the proposed method combines clique expansion and star expansion to transform a heterogeneous hypernetwork abstracted as the hypergraph into a heterogeneous network abstracted as 2-section graph+incidence graph. Secondly, a meta-path walk method aware of semantic relevance of the nodes (SRwalk) is proposed to capture semantic relationships between the nodes. Finally, while the pairwise relationships between the nodes are trained, the tuple relationships among the nodes are captured by introducing the translation mechanism in knowledge representation learning. Experimental results show that as for the link prediction task, the performance of the proposed method is close to that of other optimal baseline methods, and as for the hypernetwork reconstruction task, the performance of the proposed method is better than that of other optimal baseline methods on the drug dataset for case beyond 0.6 hyperedge reconstruction ratio, meanwhile, the average performance of the proposed method outperforms that of other optimal baseline methods by 16.24% on the GPS dataset.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    AN Bo, LONG Congjun
    2022, 36(12): 85-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    Tibetan text classification is a fundamental task in Tibetan natural language processing. The current mainstream text classification model is a large-scale pre-training model plus fine-tuning. However, Tibetan lacks open source large-scale text and pre-training language model, and cannot be verified on Tibetan text classification task. This paper crawls a large Tibetan text dataset to solve the above problems and trains a Tibetan pre-training language model (BERT-base-Tibetan) based on this dataset. Experimental results show that the pre-training language model can significantly improve the performance of Tibetan text classification (F1 value increases by 9.3% on average) and verify the value of the pre-training language model in Tibetan text classification tasks.
  • Ethnic Language Processing and Cross Language Processing
    KONG Chunwei, LYU Xueqiang, ZHANG Le
    2022, 36(12): 94-103,114.
    Abstract ( ) PDF ( ) Knowledge map Save
    Focused on Tibetan news texts, this paper proposes a hybrid representation-based subjective and objective sentence classification model (HRTNSC). The input layer is enriched by fusing syllable-level features and word-level features. The BiLSTM+CNN network is applied to the subjective and objective classification of sentences. The experimental results show that the HRTNSC model achieves an optimal F1 value of 90.84%, which is better than the benchmark model.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    GUO Shiwei, MA Bo, MA Yupeng, YANG Yating
    2022, 36(12): 104-114.
    Abstract ( ) PDF ( ) Knowledge map Save
    Short text entity linking can relies on local short text information and knowledge base due to the lack of global topic information. This paper proposes the concept of short text interaction graph (STIG) and a double stage training strategy. The Bert is used to extract the multi-granularity features between local short text and candidate entities, and the graph convolution mechanism is used on the short text interaction graph. To alleviate the degradation of graph convolution caused by mean pooling, a method is further proposed to compress the feature of nodes and edges information in interaction graph into a dense vector. Experiments on CCKS2020 entity linking dataset show the effectiveness of the proposed method.
  • Information Extraction and Text Mining
    LIN Zhi, LI Yuan, WANG Qinglin
    2022, 36(12): 115-122.
    Abstract ( ) PDF ( ) Knowledge map Save
    Event extraction methods usually use the small-scale open-domain event extraction corpus of ACE 2005, which is difficult for applying deep learning. A semi-supervised domain event argument extraction method is proposed to automatically annotate cultural event corpus from official websites of Chinese public libraries by using template and domain dictionary. Then manual annotation is applied to ensure the label accuracy. To resolve the problem of polysemy in word embedding layer, an improved method using BERT model and positional character embedding layer is proposed for the BiLSTM-CRF model. Experiments demonstrate an F1 value of 84.9% for the proposed method of event argument extraction, which is superior to the classical event argument recognition methods.
  • Information Extraction and Text Mining
    TANG Yan, CHEN Yi, ZHANG Zuowei
    2022, 36(12): 123-132.
    Abstract ( ) PDF ( ) Knowledge map Save
    Focusing on improving the evolutionary graph construction and enriching the event representation, this paper proposes an event prediction model based on the event evolutionary graph and Graph Convolutional Network(GCN). This model applies an event extraction model, and redefines the edge’s weight on the event evolutionary graph by combining frequency and mutual information. The representation of the event context is learned by BiLSTM and memory network, which is fed as the input into GCN under the guidance of the event evolutionary graph. The final event prediction is jointly completed by such event-relationship aware, context-aware, and neighbor aware event embeddings. Experiment results on the Gigaword benchmark show that the proposed model outperforms six advanced models in event prediction accuracy, with 5.55% increase compared with the latest SGNN method.
  • Information Extraction and Text Mining
    YE Junyao, SU Jingyong, WANG Yaowei, XU Yong
    2022, 36(12): 133-138,148.
    Abstract ( ) PDF ( ) Knowledge map Save
    Spaced repetition is a common mnemonic method in language learning. In order to decide proper review intervals for a desired memory effect, it is necessary to predict the learners’ long-term memory. This paper proposes a long-term memory prediction model for language learning via LSTM. We extract statistical features and sequence features from the memory behavior history of learners. The LSTM is used to learn the memory behavior sequence. The half-life regression model is applied to predict the probability of foreign language learners' recall of words. Upon the 9 billion pieces of real memory behavior data collected for evaluation, the sequence features are revealed more informative than statistical features. Compared with the state-of-the-art models, the error of the proposed LSTM-HLR model is significantly reduced by 50%.
  • Information Extraction and Text Mining
    LUO Yixiong, LYU Xueqiang, YOU Xindong
    2022, 36(12): 139-148.
    Abstract ( ) PDF ( ) Knowledge map Save
    Patent efficacy is one of the key information in the patent text. To identify the patent efficacy phrase, a multiple feature approach is proposed to combine both character-level features and word-level features. The character-level features include characters, character pinyin, and character wubi. The word-level features correspond to a collection of words containing those characters. Character-level features are vectorized by word2vec or BERT. Attention mechanism is used to fuse the word-level feature vectors in the input sequence. All feature vectors are concatenated as the input of BiLSTM (or Transformer)+CRF. Experiments on patents of new energy vehicles demonstrate the best 91.15% F1 value is achieved by BiLSTM+CRF with the combination of word2vec character vector, Bert character vector, wubi feature vector and word feature vector.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    TAN Xizi, ZHU Suyang, LI Shoushan, ZHOU Guodong
    2022, 36(12): 149-158.
    Abstract ( ) PDF ( ) Knowledge map Save
    In recent years, emotion analysis has experienced rapid development. As one of the tasks of emotion analysis, the emotion regression is more generalized and less affected by the classification taxonomy, lacking of sufficient corpus, though. In this paper, we propose a multi-dimensional emotion regression method via dimension-label information to predict the input text scores in three dimensions (Valence, Arousal, Dominance). This method conducts emotion regression by the probability of emotion classification prediction, with an objective to maximize the distance between two texts with different emotion labels. Experimental results on EMOBANK show that the proposed method has achieved significant improvement according to the mean square error and Pearson correlation coefficient, especially in the Valence and Arousal dimensions.
  • Sentiment Analysis and Social Computing
    ZENG Biqing, XU Mayi, YANG Jianhao, PEI Fenghua , GAN Zibang,
    DING Meirong , CHENG Lianglun
    2022, 36(12): 159-172.
    Abstract ( ) PDF ( ) Knowledge map Save
    Aspect-Level sentiment classification aims to analyze the sentiment polarity of different aspect words in a sentence. To realize aspect-word aware contextual representations, this paper proposes a double channel semantic difference network(DCSDN) with the notation of theory of Semantic Difference. The DCSDN captures the contextual feature information of different aspects in the same text with the double channel architecture, and extract the semantic features of the texts in the double channel via a semantic extraction network. It employs the semantic difference attention to enhance the attention to key information. Experiments on Laptop datasets and Restaurant datasets (SemEval2014) and the Twitter dataset(ACL) demonstrate the accuracy reaching 81.35%, 86.34% and 78.18% respectively.
  • Sentiment Analysis and Social Computing
    CHEN Chen, ZHOU Xiabing, WANG Zhongqing, ZHANG Min
    2022, 36(12): 173-181.
    Abstract ( ) PDF ( ) Knowledge map Save
    Dialog sentiment analysis aims to classify the sentiment of each sentence in a dialogue, considering both the speaker’s personal emotion and the emotion transmission between speakers. To model this with Transformer, this paper proposes a multi-party attention mechanism to better model the interaction between different speakers and simulate dialogue scenes. Experiments show that, compared with other SOTA models, Dialogue Transformer has simpler implementation, faster running speed, and an significantly increased Weighted-F1 value.