2022 Volume 36 Issue 9 Published: 01 November 2022
  

  • Select all
    |
    Survey
  • Survey
    Li Yunhan, Shi Yunmei, Li Ning, Tian Ying'ai
    2022, 36(9): 1-18,27.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text correction, an important research field in Natural Language Processing (NLP), is of great application value in fields such as news, publication, and text input . This paper provides a systematic overview of automatic error correction technology for Chinese texts. Errors in Chinese texts are divided into spelling errors, grammatic errors and semantic errors, and the methods of error correction for these three types are reviewed. Moreover, datasets and evaluation methods of automatic error correction for Chinese texts are summarized. In the end, prospects for the automatic error correction for Chinese texts are raised.
  • Language Analysis and Calculation
  • Language Analysis and Calculation
    YANG Jincai, CAO Yuan, HU Quan
    2022, 36(9): 19-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    The classification of relation categories of Chinese complex sentences is to identify the semantic relation between clauses. The automatic classification of complex sentence relation category has important research values in linguistic studies and Chinese information processing. This paper explores the relation classification of marked generalized causal complex sentences with two clauses, which are the most frequently used complex sentences in Chinese articles. LTP(Language Technology Platform) is used to analyze dependency syntax to obtain features such as part of speech, word order of dependency parent node and dependency relationship with parent node. Different combinations of features are embedded with pre-trained word vector to obtain new vectors. The new vector is input into DPCNN model to classify the relation of causal complex sentences. Experimental results show that compared with the model without additional features, the fusion of sentence features makes the DPCNN model more effective. In various feature combinations, POS feature fusion has the highest accuracy and F1 value, which are 98.41% and 98.28% respectively.
  • Language Analysis and Calculation
    HE Chunhui, HU Shengze, ZHANG Chong, GE Bin
    2022, 36(9): 28-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Chinese sentence similarity measure has a wide range of applications in the field of text mining. This paper proposes a Chinese sentence similarity measure by combining deep semantic feotures and explicit features. BERT and feed forward network are used to obtain deep semantic vectors, that are concatenated with explicit features. The final similarity measure is completed through a classifier. The experimental results show that the performance of our proposed method on the three public Chinese datasets is better than all baseline methods.
  • Language Resources Construction
  • Language Resources Construction
    ZHU Hongyu, JIN Zhiling, HONG Yu, SU Yulan, ZHANG Min
    2022, 36(9): 38-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    The purpose of the Question Paraphrase Identification is to find the “homogeneous and heterogeneous” question pairs (questions with different semantic expressions) and to discard semantic independent noise questions. The existing pre-trained language models are widely used in semantic encoding of natural texts, but not well-performed in Question Paraphrase Identification. We propose a Direcctional Data Augmentation (DDA) method based on generation model to fine-tune the pre-trained language model. DDA uses the directional label to guide the neural generation network, so as to automatically generate a variety of “paraphrase and non-paraphrase” as an augment to the training set. In addition, we design a model-ensemble voting mechanism to correct the potential label errors of augmentation samples. The results of LCQMC show that, compared with the traditional data Augmentation methods, DDA can produce higher quality samples with more diversified semantic expression.
  • Language Resources Construction
    PI Zhou, XI Xuefeng, CUI Zhiming, ZHOU Guodong
    2022, 36(9): 46-56.
    Abstract ( ) PDF ( ) Knowledge map Save
    Data augmentation is a method to increase the training data without directly supplementing the training data. To address the lack of data issue, this paper proposes an EMDAM (Extract-Merge Data Augmentation Method) data augmentation method based on the CogLTX framework for long-text automatic summarization. EMDAM is mainly divided into two core parts: extracting and merging. First, short sentences are extracted from the original long text data set. Secondly, these short sentences are combined into long text in the order of the definition, which are the augmented data set. Compared with the baseline model, this enhancement strategy significantly improves the performance of the baseline model on the PubMED_Min , CNN/DM_Min , and news2016zh_Min datasets. And on the SLCTDSets, the final Rouge score is improved by nearly 2 points compared to the model without the enhancement strategy.
  • Machine Translation
  • Machine Translation
    WU Lin, CHEN Hangying, LI Ya, YU Zhengtao, YANG Xiaoxia, WANG Zhenhan
    2022, 36(9): 57-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    Model degeneration appears in reproducing the unsupervised neural machine translation method proposed by Facebook AI research in the context of Chinese and English language pairs. We propose three simple yet effective methods to solve this problem. First, we mask the non-target words in the translation. Second, we use a dictionary to translate the degenerated machine translations into the target language. Third, we add some parallel corpus (100k parallel sentences) into the training process. Experimental results show that all three methods can effectively prevent the model from degeneration. In total unsupervised setting, the 2nd method has a better BLEU score of 7.87. With 100k parallel sentences, the 1st method is better with a BLEU score of 14.28.
  • Machine Translation
    CHEN Linqing, LI Junhui, GONG Zhengxian
    2022, 36(9): 67-75.
    Abstract ( ) PDF ( ) Knowledge map Save
    How to effectively use textual context information is a challenge in the field of document-level neural machine translation (NMT). This paper proposes to use a hierarchical global context derived from the entire document to improve the document-level NMT models. The proposed model obtains the dependencies between the words in current sentence and all other sentences, as well as those between all words. Then the dependencies of different levels are combined as the global context containing the hierarchical contextual information. In order to take advantage of the parallel sentence in training, this paper employs a two-step training strategy: a sentence level model is first trained by the Transformer, and then fine-tuned on a document-level corpus. Experiments on several benchmark corpus data sets show that the proposed model significantly improves translation quality compared with other strong baseline models.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    ZHANG Zefeng, MAO Cunli, YU Zhengtao, HUANG Yuxin, LIU Yiyang
    2022, 36(9): 76-83,92.
    Abstract ( ) PDF ( ) Knowledge map Save
    Currently, there are few researches on sensitive information identification for judicial public opinion, which is challenged by nonstandard descriptions, rich redundant information and numerous domain words. To address these issues, we propose a novel recognition model of judicial public opinion sensitive information via integrating the domain terminology dictionary. Firstly, the bi-directional recurrent neural network and multi-head attention mechanism are used to encode the public opinion text. Secondly, the domain terminology dictionary is used as the guiding knowledge for classification, and a similarity matrix is constructed with the public opinion text representation to derive the judicially sensitive text representation. Furthermore, convolutional neural network is used to encode local information, and multi-head attention mechanism is used to derive the weight aware local features. Finally, the identification of sensitive information in the judicial field is employed. The experimental results show that compared with the Bi-LSTM Attention baseline model, the F1 value increases by 8%.
  • Information Extraction and Text Mining
    SONG Wei, ZHOU Junhao
    2022, 36(9): 84-92.
    Abstract ( ) PDF ( ) Knowledge map Save
    Currently most Chinese named entity recognition methods use either Chinese character-level or word-level feature-aware networks for recognition. In this condition, the advantages of both character-level and word-level methods cannot be combined together, which makes it difficult to obtain adequate information of the Chinese character glyphs or the word semantics. This paper proposes a Chinese named entity recognition method based on multi-level feature-aware network. Firstly, a two-channel gated convolutional neural network is proposed to perceive Chinese character-level features, which alleviates the OOV(out-of-vocabulary) words and obtains the glyph information of Chinese characters. To apply entities with more weights, the self-attention mechanism is used to perceive the word semantics with position information. Therefore, the Chinese character-level and word-level information obtained above are fused, with a Highway network based on the gating mechanism to filter the redundant information. Finally, the Conditional Random Field is applied to learn the constraints in the sentence. The experimental results show that the proposed algorithm performs better than the current mainstream Chinese named entity recognition algorithms.
  • Information Extraction and Text Mining
    DENG Qiuyan, XIE Songxian, ZENG Daojian, ZHENG Fei, CHENG Chen, PENG Lihong
    2022, 36(9): 93-101.
    Abstract ( ) PDF ( ) Knowledge map Save
    There are a large amount of text data in the field of public security. How to extract case-related information from texts of different sources and formats is an important issue for public security information processing. This paper proposes an event extraction method that combines the event detection without triggers and the event argument role classification based on reading comprehension. First of all, this method adopts a method to realize event detection without triggers. Based on the result of event detection, it realizes the classification of event argument role through reading comprehension. Experiments show that the proposed method achieves effective performance of event extraction in the field of public security.
  • Information Retrieval
  • Information Retrieval
    LIU Shudong, ZHANG Ke, CHEN Xu
    2022, 36(9): 102-111.
    Abstract ( ) PDF ( ) Knowledge map Save
    At present, news recommendation has gradually become one of the core techniques and attracts many attention from various fields at home and abroad. Focusing on the unbalanced data issue, this paper proposes a news recommendation model with long and short-term user preference based on users’ multi-dimensional interest. We divide the user long-term interests into several dimensions, and utilize the attention mechanism to distinguish the importance of different dimensions. In addition, we combine CNN and attention network to learn the news representations, and uses GRU to capture users’ short-term preferences from their recent reading history.. Experiments on a real-life news datasets show that our proposed model outperforms the state-of-the-art news recommendation methods in terms of AUC, MRR and NDCG.
  • Information Retrieval
    JI Deqiang, WANG Hairong, LI Mingliang, ZHONG Weixing
    2022, 36(9): 112-119.
    Abstract ( ) PDF ( ) Knowledge map Save
    Most of the existing recommendation methods based on knowledge graphs use either the user or the item. We propose an improved recommendation method incorporating user-item neighbor entity. This model uses TransR for entity propagation in the knowledge graph to obtain user embedding representations. It uses GCN to aggregate candidate items in the neighborhood entities of the knowledge graph to obtain item embedding representations. Experiments on the MovieLens-20M, Book-Crossing, and Last-FM datasets prove that the average AUC and ACC values of this method are increased by 8.75% and 7.10%, compared with other 10 methods such as Wide&Deep, RippleNet, and KGAT.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    FU Xiangling, YAN Chenwei, ZHAO Pengya, SONG Meiqi, WU Weiqiang
    2022, 36(9): 120-128,138.
    Abstract ( ) PDF ( ) Knowledge map Save
    Fraud detection in consumer finance is an important issue in both academic and industrial community. With the emergence of group fraud, classical machine learning methods doesn’t work well due to the small number of fraudulent users and insufficient feature data. Since group fraudulent users are closely related, this paper investigates to construct a user-related network by the phone call data between users. The user feature in the graph is extracted through network statistical indicators and Deepwalk algorithm, making full use of the topological structure information and the neighboring information. The above information, together with the user’s inherent characteristics, are input to the LightGBM model. The experimental results show that with the graph representation learning method, the AUC is improved by 7.3% compared with using only inherent features.
  • Sentiment Analysis and Social Computing
    GE Xiaoyi, ZHANG Mingshu, WEI Bin, LIU Jia
    2022, 36(9): 129-138.
    Abstract ( ) PDF ( ) Knowledge map Save
    The identification of rumors is of substantial significance research value. Current deep learning-based solution brings excellent results, but fails in capturing the relationship between emotion and semantics or providing emotional explanations. This paper proposes a dual emotion-aware method for interpretable rumor detection, aiming to provide a reasonable explanation from an emotional point of view via co-attention weights. Compared with contrast model, the accuracy is increased by 3.9%,3.3% and 4.4% on the public Twitter15, Twitter16, and Weibo20 datasets.
  • Natural Language Understanding and Generation
  • Natural Language Understanding and Generation
    LI Mengmeng, JIANG Aiwen, LONG Yuzhong, NING Ming, PENG Hu, WANG Mingwen
    2022, 36(9): 139-148.
    Abstract ( ) PDF ( ) Knowledge map Save
    Visual storytelling is a cross-modal task derived from image captioning, with substantial academic significance and wide application in the fields of automatic generation of travel notes, education and so on. Current methods are challenged by insufficient description of fine-grained image contents, low correlation between image and the generated story, lack of richness in language and so on. This paper proposes a visual storytelling algorithm based on fine-grained visual features and knowledge graph. To fully mine and enhance the representations on image content, we design a fine-grained visual feature generator and a semantic concept generator. The first generator applies graph convolution learning on scene graph to embed entity relationships into the fine-grained visual information. The second generator integrates external knowledge graph to enrich high-level semantic associations between adjacent images. As a result, comprehensive and detailed representations for image sequence are finally realized. Compared with several state-of-the-art methods on the VIST dataset, the proposed algorithm has great advantages on Distinct-N and TTR performance over story-image correlation, story logic and word diversity.
  • Natural Language Understanding and Generation
    ZHANG Shi’an, XIONG Deyi
    2022, 36(9): 149-158.
    Abstract ( ) PDF ( ) Knowledge map Save
    Coreference is a common and essential language phenomenon. With coreference, repeated occurrence of complex expressions can be avoided in sentences, which makes sentences concise and coherent. In multi-turn spoken dialogue, the use of pronouns referring to entities can enhance communication efficiency. However, highly-frequent use of pronouns in a dialogue would make it difficult for machine to understand utterances, which in turn affects the quality of machine-generated responses. This article suggests that the quality of dialogue generation can be improved by resolving pronouns, specifically, to identify all the pronouns and noun phrases that express the same entity contained in multi-turn dialogue through coreference resolution model which is defined as coreference clusters. Two different methods are proposed and applied to coreference cluster to improve the dialogue model: (1)Using coreference clusters to recover the complete semantics of a query in order to reduce the difficulty of machine language understanding; (2)Using graph convolutional network to encode the coreference clusters into dialogue model which can improve the language understanding ability of the model. The proposed two methods in this article are tested onRiSAWOZ, a large-scale public dialogue dataset. The experimental results show that both methods can significantly improve the performance of dialogue generation.
  • Natural Language Understanding and Generation
    ZHU Shuai, CHEN Jianwen, ZHU Ming
    2022, 36(9): 159-168.
    Abstract ( ) PDF ( ) Knowledge map Save
    The main factor restricting the performance of multi-turn dialogue is the insufficient use of context information. Currently, one of important solutions to this problem is to rewrite user’s input based on preceding text of dialogue. The core task of rewrite is pronoun resolution and ellipsis recovery. We proposed SPDR (Span Prediction for Dialogue Rewrite) based on BERT, which performs multi-turn dialogue rewrite through predicting the start and end position of the span to fill before each token in user’s input. A new metric comes forward to evaluate the performance of rewrite result. Compared with traditional pointer generate network, the inference speed of our model is improved by about 100% without damaging the performance. Our model based on RoBERTa-wwm outperforms the pointer generate network in five metrics.