Most Read
  • Published in last 1 year
  • In last 2 years
  • In last 3 years
  • All

Please wait a minute...
  • Select all
  • Survey
    CEN Keting, SHEN Huawei, CAO Qi, CHENG Xueqi
    Journal of Chinese Information Processing. 2023, 37(5): 1-21.
    As a self-supervised deep learning paradigm, contrastive learning has achieved remarkable results in computer vision and natural language processing. Inspired by the success of contrastive learning in these fields, researchers have tried to extend it to graph data and promoted the development of graph contrastive learning. To provide a comprehensive overview of graph contrastive learning, this paper summarizes recent works under a unified framework to highlight the development trends. It also catalogues the popular datasets and evaluation metrics for graph contrastive learning, and concludes with the possible future direction of the field.
  • Language Resources Construction
    WANG Chengwen, DONG Qingxiu, SUI Zhifang, ZHAN Weidong,
    CHANG Baobao, WANG Haitao
    Journal of Chinese Information Processing. 2023, 37(2): 26-40.
    Pubic NLP datasets form the bedrock for NLP evaluation tasks, and the quality of such datasets has a fundamental impact on the development of evaluation tasks and the application of evaluation metrics. In this paper, we analyze and summarize eight types of problems relating to publicly available mainstream Natural Language Processing (NLP) datasets. Inspired by the quality assessment of testing in education community, we propose a series of evaluation metrics and evaluation methods combining computational and operational approaches, with the aim of providing a reference for the construction, selection and utilization of natural language processing datasets.
  • Language Resources Construction
    XIE Chenhui, HU Zhengsheng, YANG Lin'er, LIAO Tianxin, YANG Erhong
    Journal of Chinese Information Processing. 2023, 37(2): 15-25.
    Sentence pattern structure treebank is developed according to the theory of sentence-based grammar, which is of great significance to Chinese teaching. To further expand such treebank from Chinese as second language textbooks and Chinese textbooks to other domains, we propose a rule-based method to convert a phrase structure treebank named Penn Chinese Treebank (CTB) into a sentence pattern structure treebank so as to increase the size of the existing treebank. The experimental results show that our proposed method is effective.
  • Survey
    CHEN Jinpeng, LI Haiyang, ZHANG Fan, LI Huan, WEI Kaimin
    Journal of Chinese Information Processing. 2023, 37(3): 1-17,26.
    In recent years, session-based recommendation methods have attracted extensive attention from academics. With the continuous development of deep learning techniques, different model structures have been used in session-based recommendation methods, such as Recurrent Neural Networks, Attention Mechanism, and Graph Neural Networks. This paper conducts a detailed analysis, classification, and comparison over these models, and expounds on the target problems and shortcomings of these methods. In particular, this paper first compares the session-based recommendation methods with the traditional recommendation methods, and expounds the main advantages and disadvantages of the session-based recommendation methods through investigation. Subsequently, this paper details how complex data and information are modeled in session-based recommendation models, as well as the problems that these models can solve. Finally, this paper discusses and ideatifies the challenges and potential research directions in session-based recommendations.
  • Survey
    XUE Siyuan, ZHOU Jianshe, REN Fuji
    Journal of Chinese Information Processing. 2023, 37(2): 1-14.
    This paper summarizes the researches on automated essay scoring, including the development of automated essay scoring system. It also examines the tasks, public datasets and popular metrics in of automated essay scoring. The main techniques and models for automated essay scoring are reviewed, as well as the challenges in terms of both native Chinese speakers and non-native Chinese speakers.; Finally, the prospects for future automated essay scoring is discussed.
  • Survey
    FAN Zipeng, ZHANG Peng, GAO Hui
    Journal of Chinese Information Processing. 2023, 37(1): 1-15.
    Quantum natural language processing, as a cross-disciplinary field of quantum mechanics and natural language processing, has gradually attracted the attention of the community, and a large number of quantum natural language processing models and algorithms have been proposed. As a review of these work, this paper briefly summarizes the problems of current classical algorithms and the two research ideas of combinng quantum mechanics with natural language processing. It also explains the role of quantum mechanics in natural language processing from three aspects: semantic space, semantic modeling and semantic interaction. By analyzing the differences in storage resources and computation complexity between the quantum computing platform and the classical computing platform, it reveals the necessity of deploying quantum natural language processing algorithms on the quantum computing platform. Finally, the current quantum natural language processing algorithms are enumerated, and the research direction in this field are outlooked for further research.
  • Information Extraction and Text Mining
    SUN Xianghui, MIAO Deqiang, DOU Chenxiao,
    YUAN Long, MA Baochang, DENG Yong, ZHANG Lulu, LI Xiangang
    Journal of Chinese Information Processing. 2023, 37(2): 119-128.
    "Intent Recognition" and "Slot Filling" are two core tasks in intelligent human-computer interaction, which have received extensive attentions from academia and industry. Most state-of-the-art models on few-shot learning tasks are far inferior to their performance on many-shot learning tasks. In this paper, we propose a novel joint model based on semi-supervised and transfer learning for intent recognition and slot filling. Semi-supervised learning is used to identify the few-shot intent, requiring no additional labelled data. Transfer learning is used to exploit prior knowledge learned from the large samples to acquire a slot-filling model for the small samples. The proposed method won the first place in 2021 China Computer Society Big Data and Computational Intelligence Competition (CCF-BDCI) Organizing Committee and Chinese Information Society of China (CIPS) jointly held the National Information Retrieval Challenge Cup (CCIR Cup) track of 2021 CCIR Cup held by CCF Big Data & Computing Intelligence Contest (CCF-BDCI).
  • Information Extraction and Text Mining
    HU Jie, HE Wei, ZENG Zhangfan
    Journal of Chinese Information Processing. 2023, 37(2): 107-118.
    Current event extraction models based on the graph neural network cannot properly process the long-distance dependency, and the relationships between entities are not considered in the construction of the graph. This paper proposes a document-level Chinese financial event extraction model based on RoBERTa and the global graph neural network. Firstly, the pre-training model RoBERTa is used to encode documents. The feature representation of all sentences and the embedded representation of document context information are output. Then a global graph neural network including document nodes and entity nodes is constructed to strengthen the relationships between documents and entities. Finally, the global interactions between them are captured by the graph convolution network to obtain the entity level graph. An improved path reasoning mechanism is applied to solve the long-distance context-aware representation and cross-sentence argument distribution. The experimental results on CFA dataset show that the proposed model achieves higher F1 scores than other models.
  • Knowledge Representation and Acquisition
    CHEN Yuehe,TAN Chuanyuan,CHEN Wenliang,JIA Yonghui,HE Zhengqiu
    Journal of Chinese Information Processing. 2023, 37(1): 54-63.
    In recent years, knowledge graph complementation attracts more and more attentions from researchers. This paper presents a comparative analysis of the Chinese/English knowledge graph complementation tasks with a focus on the errors in Chinese knowledge graphs. It further proposes MER-Tuck, a link prediction method combining the embeddings for the entity and relation, and the embeddings for text describing entity and relation. This method enhances the learning ability of the matrix decomposition via external semantic information. A dataset is constructed in this paper for the Chinese knowledge graph complementation task. And Experiments on this dataset show that the proposed method is effective.
  • Question-answering and Dialogue
    ZHANG Zhilin,CHEN Wenliang
    Journal of Chinese Information Processing. 2023, 37(1): 121-131.
    The doctor-patient dialogue understanding is a typical task in intelligent medical community, which is challenged by entity representation and state determination. This paper proposes an information-enhanced doctor-patient dialogue understanding model. The model emphasizes the role features and symptom features, and integrates the semantics of symptom entities and reading comprehension semantics to enrich doctor-patient dialogue representation. On the first Intelligent Dialogue Diagnostic Assessment-Doctor-Patient Dialogue Understanding test set, the proposed model achieved 91.7% F1 for named entity recognition and 73.7% F1 for symptom state recognition.
  • Information Extraction and Text Mining
    WEN Qinghua,ZHU Hongyin,HOU Lei,LI Juanzi
    Journal of Chinese Information Processing. 2023, 37(1): 88-96.
    Open relation extraction is to obtain knowledge from massive texts, which is a challenging task in natural language processing community. With few annotation data and complex sentences, Chinese open relation extraction faces more difficulties. This paper proposes a multi strategy open relation extraction method, which comprehensively uses the knowledge graph to improve the coverage of entity recognition, realizes the relation extraction by the entity context, obtains the all element triples by the dependency parsing, and extracts the entity attribute from the text. Experiments show that the proposed method has high accuracy for various types of relationships.
  • Ethnic Language Processing and Cross Language Processing
    KONG Chunwei, LYU Xueqiang, ZHANG Le, ZHAO Haixing
    Journal of Chinese Information Processing. 2023, 37(2): 53-61.
    Aiming at the demand of public opinion analysis in Tibetan, this paper proposes a hot event detection method based on multi-feature fusion. Firstly, the hot news event characteristics are studied by analyzing the term frequency, term frequency growth rate and website influence.The heat measurement method is then put forward, and the hot words set is obtained by heat filtering. Secondly, the event word pair distribution is analyzed, the word pair generation model and semantic gravity model are designed, and the hot word pair set is obtained by heat filtering. Finally, a hierarchical clustering algorithm is introduced to detect hot events by clustering the mixed hot words and word pairs. The experimental results show that the optimal F value is 0.600 0, which is better than the benchmark methods.
  • Knowledge Representation and Acquisition
    YE Hongbin,ZHANG Ningyu , CHEN Huajun,DENG Shumin,BI Zhen,CHEN Xiang
    Journal of Chinese Information Processing. 2023, 37(1): 46-53.
    Knowledge graph is a large-scale semantic network that uses graph models to describe the knowledge. Concept knowledge graph is a special knowledge graph with a wide range of applications in semantic search, question, and other scenarios. In this paper, we propose a concept graph construction approach that can automatically construct a fine-grained Chinese concept hierarchy from massive texts. We also release an open and fine-grained Chinese concept graph called OpenConcepts, including 4.4 million concept instances, more than 50 000 fine-grained concepts, and 13 million concept-instance triples, with APIs to access the data.
  • Sentiment Analysis and Social Computing
    LIN Yuan, LI Jiaping, YANG Liang, ZHAO Xinhang, QIN Xue,XU Kan, LIN Hongfei
    Journal of Chinese Information Processing. 2023, 37(2): 129-137.
    Sentiment analysis refers to the classification of sentiment tendencies in a text. This paper defines the classification task as a comparison problem, and proposes a classification model based on comparative learning (Comparing to Learn, C2L). The goal of C2L is to score sentences by comparing with labeled samples. In fact, classification by comparison is more effective than training an overly complex model. The experimental results of two commonly used datasets show that the performance of C2L is better than many existing models.
  • Sentiment Analysis and Social Computing
    FENG Renjie,WANG Zhongqing
    Journal of Chinese Information Processing. 2023, 37(1): 144-152.
    In recent years, with the rapid development of e-commerce platform, more and more people choose to shop online and review the products. For longer reviews, the summary can give users a quick idea of the advantages and disadvantages of the product. At present, most of the mainstream generative summarization models only consider the sequential information of the text. However, the attribute and emotional information are very important. In order for modelling these information, this paper presents a generative summarizaiton model which combines the attribute and emotional information in comments. This method effectively integrate these information by embedding attributes and emotions into the encoding layer of the model. Experiments show that this method can generate a higher quality summary, which will be greatly improved on the ROUGE evaluation metric.
  • Question-answering and Dialogue
    SUN Bin,CHANG Kaizhi,LI Shutao
    Journal of Chinese Information Processing. 2023, 37(1): 112-120.
    In intelligent medical service, current QA system cannot deal with the complex question with multiple intentions. This paper proposes an intelligent understanding method of complex question based on semantic analysis and deep learning. The medical entity extraction and dependency parsing are first performed on the input question. Then, the syntax standardization method is proposed to decompose the input multi-intention question into several simple questions about attribute or relation. Finally, the intent understanding of the whole sentence is accomplished by classifying each simple question with a deep neural network. To validate the effectiveness of the proposed method, this paper builds a medical KG containing about 140,000 entities of 6 typical categories. The retrieval query is generated with the corresponding core entities and the relational predicates in the question intention, and the retrieved knowledge from the KG are generated as the answer. The results on the real medical consultation questions show that the proposed method can effectively recognize the multiple intentions in the complex questions and the corresponding QA system can produce comprehensive and accurate answers.
  • Language Resources Construction and Application
    SONG Heng, CAO Cungen, WANG Ya, WANG Shi
    Journal of Chinese Information Processing. 2023, 37(1): 16-32.
    Semantic roles play an important role in the natural language understanding, but most of the existing semantic-role training datasets are relatively rough or even misleading in labeling semantic roles. In order to facilitate the fine-grained semantic analysis, an improved taxonomy of Chinese semantic roles is proposed by investigating a real-world corpus. Focusing on a corpus formed with sentences with only one pivotal semantic role, we propose a semi-automatic method for fine-grained Chinese semantic role dataset construction. A corpus of 9550 sentences has been labeled with 9423 pivot semantic roles, 29142 principal peripheral semantic roles and 3745 auxiliary peripheral semantic roles. Among them, 172 sentences are double-labeled with semantic roles and 104 sentences are labeled with semantic roles of uncertain semantic events. With a Bi-LSTM+CRF model, we compare the dataset against the Chinese Proposition Bank and reveal differences in the recognition of principal peripheral semantic roles, which provide clues for further improvement.
  • Language Analysis and Calculation
    LI Zhifeng, BAI Yan, HONG Yu, LIU Dong, ZHU Mengmeng
    Journal of Chinese Information Processing. 2023, 37(3): 18-26.
    The paraphrase identification is to deciden whether two sentences express the same meaning. It is relatively easy for the general domain paraphrase identification to understand and judge the relationship between two sentences. To improve the paraphrase identification in specific domains, we propose a paraphrase identification method based on domain knowledge fuison. We retrieval the knowledge from the knowledge base and integrated them into the model. Experiments on the PARADE dataset (in computer science domain) show our method has reached 73.9% F1 score, out-performing the baseline by 3.1%.
  • Language Resources Construction and Application
    JIANG Jingchi, GUAN Changhe, LIU Jie, GUAN Yi, KE Shanfeng
    Journal of Chinese Information Processing. 2023, 37(1): 33-45.
    As data sources written by experts, agricultural books and network knowledge bases contain a large amount of agricultural common knowledge and experience, which are characterized by high reliability, rich knowledge and standard structure. In order to mine agricultural knowledge from multi-source data, this paper discusses issues related to agricultural named entities and entity relations, and proposes an agricultural knowledge labeling schema combining active learning and crowdsourcing. Under the guidance and participation of agricultural experts, a multi-source agricultural knowledge annotated corpus is constructed, which contains 9 categories of entities, 15 categories and 37 subcategories of semantic relations, totaling 48 000 entities and 50 000 entity relations. In the experiment, we demonstrate that active learning can save the annotation cost and improve the model training from the aspects of entity recognition and relation extraction.
  • Language Resources Construction
    LI Bin, YUAN Yiguo, LU Jingya, FENG Minxuan, XU Chao, QU Weiguang, WANG Dongbo
    Journal of Chinese Information Processing. 2023, 37(3): 46-53,64.
    Automatic word segmentation and part-of-speech tagging of ancient texts are the basic tasks of ancient Chinese information processing. The lack of large-scale vocabulary and annotated corpus leads to the slow development of ancient Chinese processing technology. The paper summrizes the First International Ancient Chinese Word Segmentation and POS Tagging Bakeoff, which provies manually annotated corpus as unified training data and basic test set and blind test set. The bakeoff also distinguishes open and close test mode according to whether external resources are used. The bakeoff was held at the Second Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA), which is in the context of the 13th Edition of the Language Resources and Evaluation Conference (LREC). A total of 14 teams participated in the bakeoff. On the basic test set, the F1-scores of word segmentation and POS tagging reaches 96.16% and 92.05%, respectively, in the close test, while 96.34% and 92.56%, respectively, in the open test. On the blind test set, the F1-scores of word segmentation and POS tagging reaches 93.64% and 87.77%, respectively, in the close test, while 95.03% and 89.47%, respectively, in the open test. The out-of-vocabulary words are still the barrier of ancient Chinese lexical analysis, and the deep learning and pre-training model effectively improve the performance of automatic ancient Chinese processing.
  • Information Extraction and Text Mining
    SHAN Wenqi,WANG Bo,HUANG Qingsong,LIU Lijun,HUANG Mian
    Journal of Chinese Information Processing. 2023, 37(1): 97-103.
    To capture the semistructured information and the complex semantic relations in the medical record texts, this article proposes a disease prediction method based on a weighted hierarchical attention mechanism. The weighted accumulation method is designed to convert ordinary sentence vectors into structurally weakly related sentence vectors. A hierarchical structure attention mechanism is formed for the word, sentence, and document levels to improve the model. In addition, a supervision layer is constructed to alleviate the learning bias problem. Experiments on the real data set show the proposed model outperforms current deep learning models.
  • Information Extraction and Text Mining
    ZHANG Yun, HUANG Cheng, ZHANG Yuyao, HUANG Jingwei, ZHANG Yude,
    HUANG Liya, LIU Yan, DING Keke, WANG Xiumei
    Journal of Chinese Information Processing. 2023, 37(3): 101-111.
    The lack of training data is a typical problem of named entity recognition today. To apply TMN model that requiring labelled triggers in Chinese, a new automatic annotation method GLDM-TMN is proposed. This method introduces Mogrifier LSTM structure, Dice loss function and various attention mechanisms to enhance the accuracy of trigger matching and entity annotation. Simulated experiments on two publicly available datasets show that GLDM-TMN has better improved the F1 value by 0.013 3 to 0.034 than TMN model with the same small amount of labeled data. Meanwhile, the proposed method with 20% of training data outperforms BiLSTM-CRF model with 40% of training data.
  • NLP Application
    SHAN Haocong, ZHOU Qiang
    Journal of Chinese Information Processing. 2023, 37(1): 169-178.
    Given a Chinese sentence group that contains a theme sentence, the internal structure label of the sentence group is based on the results of linguistic analysis. The main work of this paper is the information fusion and internal relevance analysis of structure label and eye movement trace of reading sentence group, which contains human psychological cognition. A classification model based support vector machine and recursive feature elimination is used to predict whether the punctuation clause segment is the key information containing the thematic content according to the corresponding eye movement feature data. By analyzing the distribution characteristics of eye movement data on the key segment, eye movement features with good distinction for the thematic information of the sentence group are extracted, and the final accuracy of 0.76 is achieved.
  • Sentiment Analysis and Social Computing
    WU Jiaming, LIN Hongfei, YANG Liang, XU bo
    Journal of Chinese Information Processing. 2023, 37(5): 135-142,172.
    Current humor detection is focused on textual humor recognition rather than carrying out this task on multimodal data. This paper proposes a modal fusion approach to humor detection based on the attention mechanism. Firstly, the model encodes each single-modal context to obtain the feature vector, and then the hierarchical attention mechanism is applied on feature sequences to capture the correlation of multi-modal information in the paragraph context. Tested on the UR-FUNNY public data set, the proposed model achieves an improvement of 1.37% in accuracy compared to the previous best result.
  • NLP Application
    LI Wenbiao, WU Yunfang
    Journal of Chinese Information Processing. 2023, 37(2): 158-168.
    Readability assessment is to automatically determine the reading difficulty of a given document. Focusing on Chinese readability assessment, this paper proposes a CNN + LSTM difficulty classification model with the variable-length convolutional layer and block structure. Extensive experiments on school textbooks and a manual-constructed test set show that the proposed method achieves 75.4% accuracy on 5-level difficulty prediction, which is superior to the existing models.
  • Sentiment Analysis and Social Computing
    LIANG Bin, LIN Zijie, XU Ruifeng, QIN Bing
    Journal of Chinese Information Processing. 2023, 37(2): 138-147,157.
    Existing research on sarcasm detection are focused on identifying the sentence-level sarcastic expression, ignoring the influence between satirical objects and sarcastic expressions.This paper proposes a new topic-oriented sarcasm detection task, which helps understand and model the sarcastic expression by introducing the topics as satirical objects. A new dataset for topic-oriented sarcasm detection is constructed, consisting of 707 topics and 4871 topic-comment pairs. Then, a topic-based prompt learning model is proposed to deal with the topic-oriented sarcasm detection task, which is based on the large-scale pre-trained language model and prompt learning. Experimental results on the proposed topic-oriented sarcasm dataset show that the proposed model outperforms the baseline models. Simultaneously, the in-depth analysis shows that the proposed topic-oriented sarcasm detection task is more challenging compared to traditional sentence-level sarcasm detection. The dataset and code are available at
  • Machine Translation
    LIU Yuan, LI Maoxi, XIANG Qingyu, LI Yihan
    Journal of Chinese Information Processing. 2023, 37(3): 89-100.
    Machine translation evaluation plays an important role in promoting the development and application of machine translation. The latest neural methods of evaluating machine translation use pretrained contextual embeddings to extract different deep semantic features, and then simply concatenate them feed into the multi-layer neural network to predict translation quality. We propose to introduce middle stage information fusion and late stage information fusion into evaluation of machine translation. More specifically, we propose to use embrace fusion to interactively fuse different features in the middle stage. In the late stage, we fuse sentence mover’s distance and sentence cosine similarity based on fine-grained accurate matching. Experimental results on the WMT'21 Metrics Task show that the proposed method can achieve competitive performance with the best metrics in the evaluation campaign.
  • Information Extraction and Text Mining
    CAO Biwei, CAO Jiuxin, GUI Jie, TAO Rui, GUAN Xin, GAO Qingqing
    Journal of Chinese Information Processing. 2023, 37(5): 88-100.
    Entity relation extraction aims to extract structured relation triples between entities from unstructured or semi-structured nature language texts. Character relation extraction is a finer-grained branch of entity relation extraction. Focusing on character relation extraction in Chinese literature, we presents a MF-CRC character relation extraction model. We first introduce adversarial learning framework to build the sentence-level noise classifier so as to filter the noise in the dataset. Then BERT and BiLSTM are employed and feature representations of Chinese surnames, gender and relation are designed. The character relation extraction model is finally established by integrating the multi-dimensional features. Experiments on three Chinese classics show that the proposed method outperforms SOTA models by 1.92% and 2.14% in micro-F1 and macro-F1 , respectively.
  • Ethnic Language Processing and Cross Language Processing
    ZHU Yulei, DEJI Kazhuo, QUN Nuo, NYIMA Tashi
    Journal of Chinese Information Processing. 2023, 37(2): 71-79.
    To further improve the deep learning methods for Tibetan sentiment analysis, this paper proposes a Tibetan sentiment analysis model combining graph neural network and pre-training model for Tibetan short texts. Firstly, the word vector is constructed using the Albert pre-training model for Tibetan text. Then, the Tibetan sentiment words annotated in the corresponding sentences are converted into word vectors, which are fused with the sentiment word features. Finally, the fused features are constructed as graph data and fed to the graph neural network model for classification. The experimental results show that the proposed model reaches 98.60% accuracy, which is better than other baseline models. The dataset for this article is publicly available at: https: //
  • Information Extraction and Text Mining
    WANG Qiqi, LI Peifeng
    Journal of Chinese Information Processing. 2023, 37(5): 80-87.
    In contrast to the existing relation triple extraction focused on written texts, this paper proposes a GCN(Graph Convolutional Network) based approach to model dialogue scenarios. Compared with the entity relations in written text, those in dialogues emphasizes the relationship among humans and are more colloquial. To address this issue, our method regards dialogue sentences as nodes, and assigns weighted edges between sentences according to sentence distance. With such constructed a dialogue scene graph, we then applies GCN to model the relationship between dialogues. Experimental results on DialogRE show that our model outperforms the existing state-of-the-art baselines.
  • Information Extraction and Text Mining
    YU Shujuan, MAO Xintao, ZHANG Yun, HUANG Liya
    Journal of Chinese Information Processing. 2023, 37(3): 112-122.
    Named entity recognition is a fundamental task of natural language processing. Lexicon-based method is the popular approach to enhance the representation of semantic and boundary information for Chinese named entity recognition. To utilize the glyphs containing rich entity information , we propose a novel Chinese named entity recognition model based on lexicon and glyph features. Specifically, the model enriches the semantic information through SoftLexicon and optimizes character representation through the improved radical-level embedding, which is fed into gated convolutional network. The experiments on four benchmark datasets show that the proposed model achieves significant improvements compared to both the existing models.
  • Information Extraction and Text Mining
    SUN Hong, WANG Zhe
    Journal of Chinese Information Processing. 2023, 37(3): 123-134.
    The current named entity recognition algorithms are featured by word enhancement, introducing external vocabulary information to determine the word boundary. This paper proposed a multi-granularity information fusion strategy for named entity recognition algorithm. By encoding each word component in Chinese characters with attention to the word sequence, this model has the ability to capture Chinese glyph information. The experimental results on multiple named entity recognition datasets show that the algorithm has clear advantages in model accuracy and inference speed.
  • Ethnic Language Processing and Cross Language Processing
    LIU Wanyue, AISHAN Wumaier, LI Zhe, HAN Yue, ZHANG Daren, YI Nian
    Journal of Chinese Information Processing. 2023, 37(2): 87-96,106.
    In neural machine translation, BPE (Byte Pair Encoding) is apopular method to segment sub-word sequence to solve the problem of rare words and out-of-vocabulary words. However, BPE can only segment words into unique subword sequences. Facing rich languages, there are many different combinations of the same word. single sub-word sequence will prevent the model from better learning different combination characteristics of word. Instead of relying on this single sub-word sequence, this paper proposes a method of tagging and fusing multiple subword sequences. Different parameters of BPE are applied to segment the same training data to obtain different subword sequences, with corresponding tags assigned. Experiments show the proposed method has improved more than 0.5 BLEU score for both morphological-rich and noninflectional language pairs. In addition, the less overlap there is between the different subword sequences, the better the translation quality that can be achieved.
  • Language Resources Construction
    HAN Ziyi, WANG Wei, XUAN Shichang
    Journal of Chinese Information Processing. 2023, 37(2): 41-52.
    Recent studies have shown that feeding DNNs with adversarial samples, i.e., samples containing small perturbations, can easily wreak havoc on their output. The field of Chinese adversarial sample generation has been challenged by achieving both the attack success rate and the sample readability. In this paper, we propose an adversarial attack method named MCGC that constrains the visual similarity and semantic similarity of adversarial samples at different stages of adversarial sample generation. Such generated adversarial samples have good readability and achieve an 90% or so success rate in target and untarget attacks against multiple models such as Text-CNN, Bi-LSTM, and BERT-Chinese. At the same time, this paper studies the differences in robustness between the mask language models (MLM) represented by BERT and the traditional natural language processing models.
  • Language Analysis and Calculation
    CAI Kunzhao, ZENG Biqing, CHEN Pengfei
    Journal of Chinese Information Processing. 2023, 37(3): 27-35.
    In natural language processing, gradient-based adversarial training is an effective method to improve the robustness of neural networks. This paper proposes an initialization strategy based on the global-based perturbation vocabulary to deal with the problem of low efficiency in the existing adversarial training algorithm, improving the efficiency of training neural networks while ensuring the effectiveness of initializing the perturbations. To keep tokens independent and avoid the training dominated by a few samples, we proposes an normalization strategy based on the global-based equal weight. Finally, we propose a multifaceted perturbations strategy to improve the robustness of pretraining language models. The experimental results show that the strategies can effectively improve the performance of neural networks.
  • Information Extraction and Text Mining
    JIA Baolin, YIN Shiqun, WANG Ningchao
    Journal of Chinese Information Processing. 2023, 37(3): 143-151.
    Extracting entities and relations from unstructured text has become a crucial task in natural language processing. We propose an end-to-end joint entity and relation extraction based on SGM module. In our model, word-level and character-level embeddings are transferred to SGM module to obtain efficient semantic representation. Then we employ span-attention to fuse the contextual information and sentence-level information to obtain the specific span representation. Finally, we use the full connection layer to classify the entities and relations. Without introducing other external complicated features, this model obtains rich semantics and takes full advantage of the association between entity and relation. The experimental results show that on the NYT10 and NYT11 datasets, the F1 of the proposed model in the relation extraction task reaches 70.6% and 68.3% respectively, which is much better than other models.
  • Natural Language Understanding and Generation
    ZHAO Zhichao, YOU Jinguo, HE Peilei, LI Xiaowu
    Journal of Chinese Information Processing. 2023, 37(3): 164-172.
    To address the current challenges of requiring large amounts of annotated data for Chinese NL2SQL (Natural language to SQL) methods, this paper introduces a dual learning NL2SQL model, DualSQL, for weakly supervised learning on a small number of trained datasets to generate SQL statements from Chinese queries. Specifically, two tasks as dual tasks are used simultaneously to train the natural language to SQL and vice versa, so that the model learns the dual constraints between tasks and obtains more relevant semantic information. To verify the effectiveness of dual learning on the NL2SQL parsing task, we use different proportions of data without labels during training. Experimental results show that the percentage accuracy of the proposed model is increased by at least 2.1% compared with the benchmark models such as Seq2Seq, Seq2Tree, Seq2SQL, SQLNet, -dual etc., in different Chinese and English datasets including ATIS, GEO, and TableQA, and execution accuracy by at least 5.3% on the Chinese TableQA dataset. Further, we show that using only 60% of labelled data can achieve similar effects to those with 90% of labelled data for supervised learning.
  • Information Retrieval
    HUANG Sisi, KE Wenjun, ZHANG Hang, FANG Zhi, YU Zengwen, WANG Peng, WANG Qingli
    Journal of Chinese Information Processing. 2023, 37(5): 122-134.
    The data sparsity issue in recommendation can be resolved by including explicit information in the knowledge graph. Most existing knowledge graph-based methods capture user behaviors solely through entity relationships, ignoring the implicit cues between users and items to recommend. To this end, this paper proposes a unique recommendation approach incorporating the knowledge graph and the prompt learning. In particular, the knowledge graph is employed to propagate user preferences and produce corresponding dynamic behaviors. And, the implicit insights absent from the knowledge graph could be absorbed by feeding pre-trained language model (PLM) with static user features under the prompt learning setting. Finally, the template probability within the PLM vocabulary is intuitively selected as the possibility of the recommendation. Experiments on the MovieLens-1M, Book-Crossing, and Last.FM datasets show that our technique outperforms state-of-the-art baselines by 6.4%, 4.0% and 3.6% in AUC, and 6.0%, 1.8%, and 3.2% in F1 value, respectively.
  • Knowledge Representation and Acquisition
    XIE Xiaoxuan, E Haihong, KUANG Zemin, TAN Ling, ZHOU Gengxian,
    Luo Haoran, LI Jundi, SONG Meina
    Journal of Chinese Information Processing. 2023, 37(3): 65-78.
    Traditional knowledge modeling methods have been always being plagued by the high complexity of hypertension knowledge, failing in accurate knowledge representation by the triples. In this paper, we propose a Triple-view Hypertension Hyper-relational Knowledge Graph (THH-KG). It builds a three-layer graph architecture containing calculation layer, concept layer and instance layer, based on which the joint expression of multiple medical logic rules, conceptual knowledge and patient knowledge are realized. Additionally, we propose a general storage method of hyper-relational knowledge graph in common graph database, on which a Hypertension Knowledge Graph Reasoning Engine (HKG-RE) is also established. Results in medication decision experiment witness 97.2% positive rate out of 108 patients with hypertension.
  • Question-answering and Dialogue
    JIN Zhiling,ZHU Hongyu,SU Yulan,TANG Hongxuan,HONG Yu,ZHANG Min
    Journal of Chinese Information Processing. 2023, 37(1): 104-111,120.
    The pre-trained language models, e.g. BERT, have been widely used in many natural language processing tasks for their unified semantic representation from “text pair” with the self-attention mechanism. However, there are two limitations in directly using BERT in answer selection task: 1) BERT fails to perceive that of the independent semantic representation of word chunks, phrases and clauses, so that the matching process tends to lack the information of different granularities; 2) The multi-head attention mechanism in BERT cannot calculate the correlation between semantic structures of different granularities. To address these issues, we propose a BERT based multi-granularity interactive inference network. This method encodes the language information of questions and answers through multi-granularity convolution to construct high-order interaction tensor, which enriches the semantic information and the interactivity of questions and answers. In addition, we propose a sentence-level loss to the emphasize key sentences in paragraph-level answers. Experiments on WPQA dataset show that the method proposed in this paper effectively improves the performance of answer selection for non-factoid questions.