Most Read
  • Published in last 1 year
  • In last 2 years
  • In last 3 years
  • All

Please wait a minute...
  • Select all
    |
  • Survey
    LUO Wen, WANG Houfeng
    Journal of Chinese Information Processing. 2024, 38(1): 1-23.
    Large Language Models (LLMs) have demonstrated exceptional performance in various Natural Language Processing (NLP) tasks, providing a potential for achieving general language intelligence. However, their expanding application necessitates more accurate and comprehensive evaluations. Existing evaluation benchmarks and methods still have many short-comings, such as unreasonable evaluation tasks and uninterpretable evaluation results. With increasing attention to robustness, fairness and so on, the demand for holistic, interpretable evaluations is impressing. This paper delves into the current landscape and challenges of LLM evaluation, summarizes existing evaluation paradigms, analyzes limitations, introduces pertinent evaluation metrics and methodologies for LLMs and discusses the ongoing advancements and future directions in the evaluation of LLMs.
  • Survey
    CUI Hongzhen, ZHANG Longhao, PENG Yunfeng, WU Wen
    Journal of Chinese Information Processing. 2024, 38(2): 1-14,24.
    Keyword extraction is a key research issue in natural language processing, knowledge graph, system dialogue, etc. In this paper, we analyze the keyword extraction process from the existing keyword extraction algorithms, and sort out in detail the computational features and application cases of existing keyword extraction methods. We analyze the supervised extraction, the unsupervised extraction, and the semi-supervised extraction methods in terms of features extraction, representative papers, model algorithms, and method descriptions, summarzing the research progress, algorithm mechanism, advantages, limitations, and application scenarios as well. The keyword extraction evaluation strategies are given, and the application prospects of semi-supervised methods of keyword extraction are prospected, as well as the research directions and possible challenges in feature fusion, domain knowledge, and graph construction.
  • Computational Model for Language Recognition
    ZHU Junhui, WANG Mengyan, YANG Erhong, NIE Jinran , YANG Lin'er, WANG Yujie
    Journal of Chinese Information Processing. 2024, 38(4): 17-27.
    Recent advancements in artificial intelligence have led to significant strides in language generation technologies, with chatbots like ChatGPT demonstrating proficiency in conversation and question answering. This paper investigates the differences between machine-generated language and human language by analyzing responses to 3 293 open-domain Chinese questions from humans and ChatGPT. The analysis examines 161 linguistic features in five dimensions: descriptive characteristics, word frequency, lexical diversity, syntactic complexity, and discourse cohesion. Classification algorithms are employed to assess the efficacy of these features in distinguishing between the two types of language. The results reveal significant differences in 77 linguistic features across descriptive characteristics, word frequency, and lexical diversity. Human language tends to exhibit higher readability, lower argument overlap, a more colloquial style, a richer vocabulary, and greater interactivity compared to machine-generated language.
  • Survey
    ZHANG Hongyi, LI Ren, YANG Jianxi, YANG Xiaoxia, XIAO Qiao, JIANG Shixin, WANG Di
    Journal of Chinese Information Processing. 2024, 38(4): 1-16.
    Table question answering (Table QA) directly gets answers form table data through natural language, which is one of the main forms of intelligent question answering. Recently, researchers pay great attention to resolve this task by semantic parsing. In this paper, we divide Table QA tasks into three types: single-table single-turn, multi-table single-turn, and multi-table multi-turn. This paper provides a systematic introduction to datasets and representative methods of various types of Table QA tasks. It also summarizes the data construction, input encoding, and pre-training objectives of the table pre-training models. Finally, we explore the strengths and weaknesses of current work, and discuss the future prospects and challenges of Table QA.
  • Survey
    CAO Hang, HU Chi, XIAO Tong, WANG Chenglong, ZHU Jingbo
    Journal of Chinese Information Processing. 2023, 37(11): 1-14.
    Most of the current machine translation systems adopt the autoregressive method for decoding, which leads to low inference efficiency. The non-autoregressive method significantly improves the inference speed through parallel decoding, attracting increasing research interest. We conduct a systematic survey for recent efforts to narrow the translation quality gap between Non-Autoregressive Machine Translation (NART) and Autoregressive Machine Translation (ART). We categorize NART methods by the way to capture the dependencies of target sequences. We also discuss the challenges of NART research.
  • NLP Application
    WANG Yaqiang, YANG Xiao, ZHU Tao, HAO Xuechao, SHU Hongping, CHEN Guo
    Journal of Chinese Information Processing. 2024, 38(1): 156-165.
    Postoperative risk prediction has a positive effect on clinical resource plan, emergency plan preparation and postoperative risk and mortality reduction. To employ the unstructured preoperative diagnosis with rich semantic information, this paper proposes a postoperative risk prediction model via unstructured data representation enhancement. The model utilizes self-attention to fuse the structured data with unstructured preoperative diagnosis. Compared with the baseline methods, the proposed model improves F1-Score by an average of 9.533% on the tasks of the pulmonary complication risk prediction, the ICU admission risk prediction and the cardiovascular adverse risk prediction.
  • Sentiment Analysis and Social Computing
    XU Rui, ZENG Cheng, CHENG Shijie, ZHANG Haifeng, HE Peng
    Journal of Chinese Information Processing. 2024, 38(1): 135-145.
    The rapid development of pre-trained models has made a breakthrough in the task of sentiment classification. However,there is a large number of semantically ambiguous and confusing text in the massive data provided by the Internet, which restricts the effect of most current classification models. To address this issue, a double triplet network for sentiment classification (DTN4SC) is proposed. This method improves the construction method of triplet sample combinations, by extracting and weighing two kinds of triplet samples from straightforward text and ordinary text, respectively, which captures the similarity between texts of the same category and the differences between texts of confusing categories. And during the training process, the confusing text in one batch is added to the next batch for further training. Experimental results on nlpcc2014, waimai_10k and ChnSentiCorp show that the proposed method has better performance in accuracy and F1 value compared with the existing sentiment classification methods of confusing text, by 3.16%, 2.35% and 2.5% improvements, respectively.
  • Sentiment Analysis and Social Computing
    ZHU Jie, LIU Suwen, LI Junhui, GUO Lifan, ZENG Haifeng, CHEN Feng
    Journal of Chinese Information Processing. 2023, 37(11): 151-157.
    Interpretable sentiment analysis aims to judge the polarity of text, and at the same time, give evidence for judgements or evidence for predictions. Most of the existing sentiment analysis methods are black box models, and interpretability evaluation is still a problem to be solved. This paper proposes a interpretable sentiment analysis method based on UIE. According to the characteristics of sentiment interpretable tasks, this method uses methods such as few-shot and text clustering to improve the rationality and loyalty of the model. The experimental results show that this method has won the first place in the task of “2022 language and intelligent technology competition: sentiment interpretable evaluation”.
  • Information Extraction and Text Mining
    WANG Haochang, ZHENG Guanyu, ZHAO Tiejun
    Journal of Chinese Information Processing. 2024, 38(2): 87-98.
    The information extraction of the parties, the basic contract information, the contract terms and other fine-grained entities in the contract text can effectively improve the efficiency of contract review and empower automated contract management. To address the challenge of complexity and subtlety of entities in the contract, this paper proposes a new fine-grained entity recognition model named BLBC-CFER based on lexicon enhancement. It employs the character-level enhancements provided by pre-trained language models, word-level enhancements provided by character-plus-word embeddings and word-level enhancements provided by lexical set structure embeddings. Based on these, it obtains the optimal sequence of tokens through deep neural networks. Experiments on a self-constructed fine-grained entity corpus of business contracts and two public data sets demonstrate the superior performance of the proposed method.
  • Survey
    REN Fanghui, GUO Xitong, PENG Xin, YANG Jinfeng
    Journal of Chinese Information Processing. 2024, 38(1): 24-35.
    As the firststep in task-oriented dialogue system (TOD), Spoken Language Understanding (SLU) governs the overall system performance. The past few years have witnessed a great progress of SLU due to the huge success of Large Language Model (LLM). This paper investigated the SLU task (in contrast to written language understanding) with a focus on medical field. Specifically, this paper illustrates the difficulties and challenges in medical SLU task. And it summarizes the progress and shortcomings of the existing researches from the perspectives of datasets, algorithms and applications. Besides, combined with the latest progress of generative LLM, this paper outlines the new research direction in this field.
  • Language Analysis and Computation Model
    SUN Yu, YAN Hang, QIU Xipeng, WANG Ding, MU Xiaofeng, HUANG Xuanjing
    Journal of Chinese Information Processing. 2024, 38(1): 74-85.
    Currently, the research on Large Language Models (LLMs), such as InstructGPT, is primarily focused on free-form generation tasks, while the exploration in structured extraction tasks has been overlooked. In order to gain a deep understanding of LLMs on structured extraction tasks, this paper analyzes InstructGPT's performance on named entity recognition (NER), one of the fundamental structured extraction tasks, in both zero-shot and few-shot settings. To ensure the reliability of the findings, the experiments cover common and nested datasets from both biomedical domain and general domain. The results demonstrate that InstructGPT's performance on zero-shot NER achieves 11% to 56% of the performance by a finetuned small-scaled model. To explore why InstructGPT struggles with NER, this paper examines the outputs, finding invalid generation for 50% of them. Besides, the occurrence of both "false-negative" and "false-positive" predictions makes it difficult to improve performance by only addressing the invalid generation. Therefore, in addition to ensuring the validity of generated outputs, further research still should focus on finding effective ways of using InstructGPT in this area.
  • Computational Argumentation
    Journal of Chinese Information Processing. 2023, 37(10): 106-107.
    论辩(Argumentation)以人的逻辑论证过程作为研究对象,是一个涉及逻辑、哲学、语言、修辞、计算机科学和教育等多学科的研究领域。近年来,论辩研究引起计算语言学学者的关注,并催生了一个新的研究领域,即计算论辩学(Computational Argumentation)。学者们试图将人类关于逻辑论证的认知模型与计算模型结合起来,以提高人工智能自动推理的能力。根据参与论辩过程的人数不同,计算论辩学的研究可以分成两类,即单体式论辩(Monological Argumentation)和对话式论辩(Dialogical Argumentation)。单体式论辩的研究对象是仅有一个参与者的辩论性文本,如议论文和主题演讲等。相关的研究问题包括论辩单元检测、论辩结构预测、论辩策略分类和议论文评分等。对话式论辩的研究对象是针对某一个特定议题进行观点交互的论辩过程, 一般有多个参与者。相关的研究问题包括论辩结果预测、交互式论点对抽取、论辩逻辑链抽取等。
  • Ethnic Language Processing and Cross Language Processing
    AN Bo, ZHAO Weina, LONG Congjun
    Journal of Chinese Information Processing. 2024, 38(2): 70-78.
    Text classification is one of the fundamental tasks in natural language processing. The lack of labeled data has always been an important factor limiting the development of natural language processing technologies for Tibetan and other minority languages, as traditional deep learning models have higher requirements for the scale of labeled data. To address this issue, this paper implements low-resource Tibetan text classification using prompt learning based on pre-trained language models, which involves conducting Tibetan text classification experiments using different Tibetan pre-trained language models and prompt templates. The experimental results show that, by designing reasonable prompt templates and other methods, prompt learning can improve the effectiveness of Tibetan text classification (48.3%) in the case of insufficient training data, preliminarily verifying the value and potential of prompt learning in minority language processing. However, the experimental results also indicate that the prompt learning model may underperform in specific categories, suggesting there is still potential for enhancement in the Tibetan pre-trained language model.
  • Information Extraction and Text Mining
    ZHANG Xin, YUAN Jingling, LI Lin, LIU Jia
    Journal of Chinese Information Processing. 2023, 37(11): 49-59.
    Recent studies show that visual information can help text achieve more accurate named entity recognition. However, most of the exiting work treats an image as a collection of visual objects and attempts to explicitly align visual objects with entities in text, fails to cope with modal bias well when visual objects and the entities are quantitatively and semantically inconsistent. To deal with this problem, we propose a debiased contrastive learning approach (DebiasCL) for multimodal named entity recognition. We utilize the visual objects density to guide visual context-rich sample mining, which enhances debiased contrastive learning to achieve better implicit alignment by optimizing the latent semantic space learning between visual and textual representations. Empirical results shows that the DebiasCL achieves a F1-value of 75.04% and 86.51%, with 5.23% and 5.2% increased on "PER" and "MISC" entity type data in Twitter-2015 and Twitter-2017, respectively.
  • Survey
    YIN Hua, LU Yiliang, JI Yuelei, WU Zihao, PENG Ya'nan
    Journal of Chinese Information Processing. 2024, 38(3): 1-23.
    Abstract Meaning Representation (AMR), with the ability of accurately abstracting the complete meaning of sentences, realizes domain-independent semantic representation of entire sentences. AMR parsing has an impact on the performance of downstream NLP tasks and becomes a popular research topic both domestically and internationally in recent years. We first employ the CiteSpace tool to analyze the overall research landscape of AMR, revealing a much less Chinese AMR parsing researches compared with those for English. Then we discuss the development of AMR corpus and the difficulties of concept recognition, relation recognition, alignment and integration of structural information in AMR parsing. We categorize AMR parsing into four types, and explore the evolution of AMR parsing methods. Finally, we select 21 English AMR parsers and 7 Chinese AMR parsers, and compare various experimental metrics including Smatch.
  • Computational Model for Language Recognition
    SHEN Zhenqian, LI Wenqiang, REN Tiantian, WANG Yao, ZHAO Huijuan
    Journal of Chinese Information Processing. 2024, 38(4): 38-49.
    Electroencephalogram (EEG)-based attention states detection is of great significance for expanding the application the brain-computer interface. In this paper, a classification approach is presented to improve the accuracy of EEG-based attention states classification via the Convolutional Neural Network and Nested Long Short-term Memory (CNN-NLSTM) model. First, the power spectral density of the EEG signals is obtained by the Welch method and represented as a two-dimensional grayscale image. Then, the CNN is used to learn features that represent attention states from grayscale images, and resulted features are input into the NLSTM neural network to sequentially obtain attention characteristics for all time steps. Finally, the two networks are connected to build a deep learning framework for attentional states classification. The experimental results show that the proposed model evaluated by multiple 5-fold cross-validation outperforms other models by an average accuracy of 89.26% and a maximum accuracy of 90.40%.
  • Information Extraction and Text Mining
    ZHU Jizhao, ZHAO Yilin, ZHANG Jiaxin, HUANG Youpeng, FAN Chunlong
    Journal of Chinese Information Processing. 2024, 38(2): 99-108.
    Entity and relation extraction is a key technology to automatically build large-scale knowledge graphs from massive text data. Considering the effect of the entity on the discrimination of relation types, this paper proposes a joint entity and relation extraction model based on entity-pair specific attention mechanism (EPSA). First, the entity recognition is completed based on Bi-directional Long Short-Term Memory (Bi-LSTM) and Conditional Random Fields (CRF). Then the extracted entities are combined into entity-pairs and transformed into a unified embedding. The sentence representation is obtained by the entity-pair specific attention mechanism plus the entity-pair embedding. And finally, the relation extraction is completed by the a classification process. Experimental results on NYT and WebNLG datasets show that the proposed method out-performs the baselines by achieving 84.5% and 88.5% F1 value, respectively.
  • Language Analysis and Calculation Model
    WANG Chao, LYU Guoying, LI Ru, CHAI Qinghua, LI Jinrong
    Journal of Chinese Information Processing. 2024, 38(2): 25-35.
    Chinese frame semantic role labeling plays an important role in Chinese frame semantic analysis. At present, the task of semantic role labeling in Chinese frame is mainly aimed at verb frame. This paper constructs a Chinese adverb framework and dataset, and classifies the word in the framework according to its semantic strength. Then, this paper proposes a semantic role labeling model based on Bert feature fusion and expansion convolution. The model includes four layers, with the bert layer to reperesent the rich semantic information of sentences, the attention layer to dynamical weighs the information from each BERT layer, the expansion convolution (IDCNN) layer to extract features, and the CRF layer to predict tags. The model performs well in three adverb frame datasets, achieveing 82% or more F1 value. In addition, the model achieves 88.29% F1 value in CFN dataset, which is 4% above the baseline model.
  • Language Analysis and Calculation Model
    WANG Yu, YUAN Yulin
    Journal of Chinese Information Processing. 2024, 38(2): 36-45.
    The double negation structure is a special structure of “expressing positive meaning through two negations”, in which the two negations have an important impact on the semantic analysis and emotional classification in natural language processing. Taking “¬¬ P==>P” as the prototype, this paper examines the “negation word + negation word” structures in modern Chinese, and divides them into 3 categories, 25 sub-categories and 132 constructions in total. Then this paper proposes three conditions for the establishment of the double negation structure, and a rule-based method to identify the double negation. The accuracy rate of recognition of the double negation structure is 98.80%, the recall rate is 98.90%, and the F1 value is 98.95%. The proposed method could identify 8 640 sentences with 99.20% true double negation structure from 96 281 sentences.
  • NLP Application
    LUO Wenbing, LUO Kaiwei, HUANG Qi, WANG Mingwen
    Journal of Chinese Information Processing. 2024, 38(4): 143-155.
    Annotation of mathematical exercise topics is an essential task for building a structured exercise bank or realizing personalized learning. Due to the particularity of mathematical exercise texts, existing annotation models cannot capture deep key information well, and there are generally problems such as insufficient key knowledge introduced, overly direct fusion methods, and a lack of effective screening of information. This paper proposes a model MKAGated for automatic annotation of mathematical exercise topics. The model first uses the pre-trained model to represent the original exercise and two kinds of refined subject knowledge texts. Then, the attention mechanism is adopted to capture the interaction between the exercise and the two subject knowledge texts as the deep representations. Finally, a gated mechanism is applied to implicitly fuse the average pooling of the two deep representations to preserve the actual effective semantic features in the original exercise representation. Experimented on the self-built junior middle school mathematics exercise dataset, the proposed method outperformed the baseline by 1.99%, 2.99% and 2.12% according to micro-F1, macro-F1 and weighted-F1, respectively.
  • Information Extraction and Text Mining
    YU Zhengtao, GUAN Xin, HUANG Yuxin, ZHANG Siqi, ZHAO Qingjue
    Journal of Chinese Information Processing. 2024, 38(1): 115-123.
    Sensitive information recognition refers to the identification of sensitive massages related to pornography, drugs, cult, violence and other types of sensitive information on the Internet. A few-shot sensitive information recognition based on prototype network fine-tuning is proposed in this paper. The proposed method employs the fast adaptation function under the framework of few-shot learning to bridge the domain gap between the dataset in meta-training stage and that of meta-test stage. Specifically, the proposed model is trained on general news domain in meta-training stage with a two-stage gradient update mechanism to obtain a group of initial parameters. In meta-testing stage, model freezes a part of parameters to be fast finetuned for the sensitive text dataset. The experimental results show that the performance of the proposed model in sensitive information recognition task is significantly improved compared to a strong baseline few-shot model.
  • Information Extraction and Text Mining
    QU Wei, ZHOU Dong, ZHAO Wenyu, CAO Buqing
    Journal of Chinese Information Processing. 2023, 37(11): 81-90.
    Code summarization aims to automatically generate the natural language description of source code snippets, which facilitates software maintenance and program understanding. Recent studies have shown that the popular methods utilizing Transformer-ignores the external semantic information such as API documents. Therefore, we propose an automatic code summary generation method based on an improved Transformer integrating multiple semantic features. This method uses three independent encoders to extract multiple semantic features of source code (text, structure and external API documentations information), and the non-parametric Fourier transform is used to replace the self-attention layer in the encoder. The computation time and memory usage of the Transformer structure are reduced by a linear transformation. Experimental results on open datasets prove the effectiveness of the method.
  • Ethnic Language Processing and Cross Language Processing
    XU Zehui, ZHU Jie, XU Zezhou, WANG Chao, YAN Songsi, LIU Yashan
    Journal of Chinese Information Processing. 2023, 37(11): 23-28.
    Named entity recognition is a key task in Tibetan processing. This paper proposes a Casaded BiLSTM-CRF method combining three Tibetan pre-training models (Word2Vec, ELMo, ALBERT). The cascade Tibetan named entity recognition refers to treat this task by two sub-tasks, i.e. entity boundary delineation and entity class determination. Experiments show that the proposed model decreases the training time by 28.30% compared with the BiLSTM-CRF model, and combining the pre-training technique achieves better recognition results.
  • Sentiment Analysis and Social Computing
    GAO Zhun, DAN Zhiping, DONG Fangmin, ZHANG Yanke, ZHANG Hongzhi
    Journal of Chinese Information Processing. 2024, 38(2): 142-154.
    Current rumor detection research focuses on studying the directional characteristics of rumor propagation. To exploit the potential structural features of rumors, this paper proposes a multi-level dynamic propagation attention networks (MDPAN) to detect rumors. This method learns the contributions of all connecting edges in the propagation graph through a node-level attention, dynamically focusing on useful propagation relationships for identifying rumors. The graph convolutional networks extracts different levels of propagation features, diffusion features, and global structural features of rumors, which are fused via attention-based pooling methods. Compared with the EBGCN model on Twitter15, Twitter16 and Weibo16 datasets, the proposed method increases the overall accuracy by 2.1%, 0.7% and 1.7%, respectively.
  • Language Analysis and Calculation Model
    LI Zixuan, GUAN Saiping, JIN Xiaolong, BAI Long, GUO Jiafeng, CHENG Xueqi
    Journal of Chinese Information Processing. 2024, 38(2): 46-53.
    Temporal knowledge graphs integrate temporal information into traditional knowledge graphs and describe such dynamic event knowledge by sequences of knowledge graphs with timestamps. The temporal knowledge graph reasoning task aims to predict future events based on the historical event quadruples (subject entity, relation (event type), object entity, timestamp). To characterize the evolution process of the historical events comprehensively, this paper proposes a two-stage model, called MENet (Multi-sequence Evolution Network), based on jointly evolutional modeling of multiple history sequences. Specifically, in the first candidate entity selection stage, a candidate entity selection strategy is designed via heuristic rules, thus effectively reducing the number of entities to be modeled. In the second stage, it combines the long-term historical sequence of multiple entities to form a graph sequence, and models the evolution process of entities by capturing the structural dependency of concurrent events, the time value information of events, and the temporal dependencies across different timestamps. Experimental results on three standard datasets show that the proposed model outperforms state-of-the-art ones.
  • Information Extraction and Text Mining
    LI Zheng, TU Gang, WANG Hansheng
    Journal of Chinese Information Processing. 2024, 38(4): 86-98,107.
    In existing research on nested named entity recognition, this task is treated as span classification tasks via finetuned pretrained models. This paper proposes a multi-head model based on knowledge embedding (MKE for short) method to further improve this task. This method introduces domain-specific knowledge in the form of entity matrices, allowing the background knowledge to be embedded without any loss. It also transforms the named entity recognition into a multi-head selection process, followed by scoring the candidate spans using the attention score model. The experimental results show that the proposed method achieves the state-of-the-art performance on seven nested and flat named entity recognition datasets.
  • Sentiment Analysis and Social Computing
    YOU Peiwen, WANG Jingjing, GAO Xiaoya, LI Shoushan
    Journal of Chinese Information Processing. 2024, 38(4): 134-142.
    This paper proposes a new cross-modal speech sentiment classification task, which aims to leverage the text modal data as the source side to classify the speech modal data on the target side. This paper designs a cross-modal sentiment classification model based on knowledge distillation, which is intended to distill the prior pre-training knowledge learning from the text-modal sentiment classification model (teacher model) into the speech-modal sentiment classification model (student model). The proposed model is distinguished by that its capability of direct analysis of the original speech data without relying on the speech recognition technology, which is crucial to large-scale implementation in the actual speech emotion analysis application scenarios. Experimental results show that the proposed method can effectively use the experience of text modal sentiment classification to improve the effect of speech modal sentiment classification.
  • Language Analysis and Computation Model
    YAN Zhichao, LI Ru, SU Xuefeng, Li Xinjie, CHAI Qinghua, HAN Xiaoqi, ZHAO Yunxiao
    Journal of Chinese Information Processing. 2024, 38(1): 86-96.
    Frame Identification (FI), which aims to find the proper frame to activate for a target words in a given sentence, is an important prerequisite for labeling frame semantic roles. Generally, FI is regarded as a classifying task, applying the sequence modeling to learn the contextual representation of target words. To further capture the structural information of target words themselves, this paper proposes a model which fuses the contextual and structural information of target words. Specifically, BERT and GCN are utilized to model the contextual information of target words in different parts of speech and the structural information of target words in PropBank roles or dependence syntax, respectively. Also, this paper analyzes the structural differences of the dependency information of target words with different parts of speech, and employs an ensemble learning approach to consider the structural differences. Experiments on FN1.7 and CFN datasets show that our model outperforms the SOTA.
  • Sentiment Analysis and Social Computing
    CHENG Yan, HU Jiansheng, ZHAO Songhua, LUO Pin, ZOU Haifeng, FU Yan, LIU Chunlei
    Journal of Chinese Information Processing. 2024, 38(2): 155-168.
    Aspect terms extraction is a core task in aspect-level sentiment analysis. Due to few publicly available aspect datasets, this paper proposes to learn fine-grained aspect terms in the low-resource domains from the coarse-grained aspect categories in rich domains. To alleviate inter-domain granularity inconsistencies and feature mismatches, the paper proposes a dual memory interactive network that iterates continuously to obtain the correlation vector for each word by interacting the local memory of each word with the global aspect terms and aspect category memory. This method can obtain the interconnections between the aspect terms and the aspect category, as well as the internal correlations between the aspect terms or the aspect category itself. Experimented on the Laptop, Restaurant and Device datasets, the results show that the proposed method performed better compared to multiple baseline models.
  • Information Extraction and Text Mining
    WANG Yaqiang, LI Kailun, SHU Hongping, JIANG Yongguang
    Journal of Chinese Information Processing. 2024, 38(2): 121-131.
    Four diagnostic description extraction in clinical records has clinical application in improving the practice of traditional Chinese medicine. As the first exploration of this extraction task, we firstly construct a clinical four diagnostic description extraction corpus and then fine-tune a general domain pre-trained language model based on unlabeled clinical records of traditional Chinese medicine. We train the proposed four diagnostic description extraction model by utilizing a small labeled dataset through a well-designed batch data oversampling algorithm. The experimental results show that the performance of the proposed method in this paper is better than that of the compared methods, with an average improvement of the rare classes by 2.13% F1 score.
  • Ethnic Language Processing and Cross Language Processing
    CAI Zhijie, SAN Maocuo, CAIRANG Zhuoma
    Journal of Chinese Information Processing. 2023, 37(11): 15-22.
    Testset for text proofreading evaluation is the basis of spell checking research, including traditional and standard text proofreading testset. The traditional testset for text proofreading is obtained by artificially forging the correct data through subjective experience. The standard testset for text proofreading is obtained from the real dataset with strong reliability. Based on the analysis of the construction methods of English and Chinese text proofreading testsets, combined with the characteristics of Tibetan language, this paper studies the testset construction for Tibetan text proofreading, and completes a standard text proofreading testset with statistical analysis of the types and distribution of errors. The validity and usability of the testset are verified.
  • Language Analysis and Calculation
    ZHANG Shenglong, LIU Ying, MA Yanjun
    Journal of Chinese Information Processing. 2024, 38(3): 24-32.
    Metaphor is a special phenomenon in human languages. As for Metaphor Detection in Chinese, we propose a SaGE (Syntax-aware GCN with ELECTRA) method inspired by linguistics. SaGE utilizes ELECTRA and Transformer encoder to extract the semantic feature of a sentence, and the GCN to extract syntactic feature through a graph constructed by dependency parsing result. The model concatenates the two features to detect metaphors. SaGE obatins 85.22% macro-F1 score, a substantial improvement over the best reported score in CCL 2018 Chinese Metaphor Detection Task Dataset.
  • Information Retrieval
    LI Chi, YOU Xiaoyu, ZHANG Mi
    Journal of Chinese Information Processing. 2023, 37(11): 131-141.
    Graph convolutional network (GCN) based recommender models represent the state-of-the-art of collaborative recommendation, though defected in high computation costs. This paper proposes a collaborative filtering recommendation model based on decoupled graph convolutional network (DeGCF). For the parameter initialization, DeGCF utilizes negative samples-enhanced graph convolution to explicitly inject local and global graph structure features into the initial embedding of users and items. As for the model training, DeGCF only uses the inner product of the embedding vectors of users and items as the model outputs, thereby decoupling the graph convolution from the model training process. In addition, DeGCF trains model parameters using an inverse propensity score reweighted loss function. Experiments on three benchmark datasets demonstrate that the proposed model not only outperforms the state-of-the-art GCN models, but also achieves more than 13x speedup than LightGCN on large-scale datasets like Amazon-book.
  • Information Extraction and Text Mining
    ZHOU Mengjia, LI Fei, JI Donghong
    Journal of Chinese Information Processing. 2024, 38(1): 97-106.
    The dialog-level relation extraction is characterized by casual language, low information density and abundant personal pronouns. This paper proposes an end-to-end dialogue relation extraction model via TOD-BERT (Task-Oriented Dialogue BERT) pre-trained language model. It adopts the attention mechanism to capture the interaction between different words and different relations. Besides, the co-reference information related to personal pronouns is applied to enrich the entity features. Validated on DialogRE, a new dialog-level relational extraction dataset, the proposed model reaches 63.77 F1 score, which is significantly better than the baseline models.
  • Language Analysis and Calculation Model
    LIU Guang, TU Gang, LI Zheng, LIU Yijian
    Journal of Chinese Information Processing. 2024, 38(2): 15-24.
    At present, most syntactic dependency analysis is conducted via supervised learning with the help of word segmentation results. This practice is challenged by complex label schemes and the nesting structure which is difficult to parse. This paper proposes a phrase window model together with a dependency syntax labeling rule based on the phrase window. The labeling rule divides sentences into 7 types of nestable phrases, with annotation for the syntactic dependence between phrases. Inspired by the idea of target detection in the computer vision field, the phrase window model detects the beginning and end positions of phrases and realizes the synchronous recognition of nested phrases and syntactic dependencies. Experimental results show that on the self-built Chinese Phrase Window Dataset (CPWD), the phrase window model is more than 1 point better than the traditional end-to-end model. The corresponding method won the champion in the CCL2018 Chinese Metaphor Sentiment Analysis Competition, which improved more than 1 point than the baseline.
  • NLP Application
    HUANG Sijia, PENG Yanbing
    Journal of Chinese Information Processing. 2024, 38(1): 146-155.
    To address such issues as the poor interpretability of current legal intelligence system, the unsatisfactory prediction of less-frequent and confusing legal causes and the insufficient research on civil disputes, an interpretable hierarchical legal causes prediction model (IHLCP) is proposed, taking the hierarchical dependence between legal causes as the source of interpretability. In IHLCP, the fact description is encoded by capturing the semantic differences of cases, and an improved attention-based seq2seq model is used to predict the cause path. Further, the inner text information of the cause is used to filter out the noise information in the fact description. Experiments show that the IHLCP model designed in this paper has achieved the state-of-art performance on three large-scale data sets: CIVIL (ACC-91.0%, Pre-67.5%, Recall-57.9%, F1-62.3%), FSC (ACC-94.9%, PRE-78.8%, RECALL-75.9%, F1-77.3%) and CAIL (ACC-92.3%, Pre-90.9%, Recall-89.7%, F1-90.3%), boosting the ACC and F1 by 6.6% and 13.4%, respectively. The experimental resuces show that this model can help the system to understand the law causes, make up for the start comings of current legal intelligence system in few-shot and confusing Causes of law prediction, make up for the deficiency of low frequency confusing cause prediction and improve the inter pretability of the model.
  • Information Extraction and Text Mining
    WANG Runzhou, ZHANG Xinsheng, WANG Minghu
    Journal of Chinese Information Processing. 2024, 38(3): 113-129.
    The knowledge distillation technique compresses knowledge from large-scale models into lightweight models, improving the efficiency of text classification. This paper introduces a text classification model that combines a dynamic mask attention mechanism and multi-teacher, multi-feature knowledge distillation. It leverages knowledge sources from various teacher models, including Roberta and Electra, while considering semantic information across different feature layers. The dynamic mask attention mechanism adapts to varying data lengths, reducing interference from irrelevant padding. Experimental results on four publicly available datasets demonstrate that the student model (TinyBERT) distilled by the proposed method outperforms other benchmark distillation strategies. Remarkably, with only 1/10 of the teacher model's parameters and approximately half the average runtime, it achieves classification results comparable to the two teacher models, with only a marginal decrease in accuracy (4.18% and 3.33%) and F1 value (2.30% and 2.38%). The attention heat map indicates that the dynamic mask attention mechanism enhances focus on the effective information of the data.
  • Sentiment Analysis and Social Computing
    LIU Ye, LIU Shixin, ZENG Xueqiang, ZUO Jiali
    Journal of Chinese Information Processing. 2024, 38(4): 120-133.
    With the rise of Internet-based social media, Emoji has become a widely used image text for users in daily communication due to its graphical emotion expression. Existing studies on emotion recognition models simply convert the Emoji into word vectors, without directly capture its correlation with the target emotion. This paper proposes to construct an emotion distribution vector directly associated with the target emotion through soft labels, and to combine the Emoji emotion distribution information with text semantic information via the pre-training model, which is named EIFER (i.e. Emoji emotion distribution Information Fusion for multi-label Emotion Recognition). Based on the classical binary cross-entropy loss function, EIFER method models the correlation between emotion labels by introducing label-correlation aware loss. The EIFER method is an end-to-end model composed by a semantic information module, an Emoji information module and a multi-loss function prediction module. Experiment results on the SemEval2018 English dataset have shown that the proposed method has better performance than the existing methods.
  • Sentiment Analysis and Social Computing
    YE Shiren, DING Li, ALI MD Rinku
    Journal of Chinese Information Processing. 2024, 38(1): 124-134.
    In fine-grained sentiment and emotion analysis tasks, the label correlation and imbalanced label distribution are popular among samples. Inspired by circle loss in computer version, we develop a loss function model to handle these issues by employing gradient decay, pair optimization and margin. This loss function model is easily adapted to suit pre-trained networks without modifying the backbone structures. Compared with the current state-of-the-art results, our loss function model could improve Jaccard similarity coefficient, micro-F1, and macro-F1 values by 1.9%, 2%, and 1.9%, respectively, in SemEval18 dataset; and by 2.6%, 1.9%, and 3.6%, respectively, in GoEmotions dataset.
  • Information Extraction and Text Mining
    PENG Shiya, LIU Chang, YU Dong, DENG Yayue
    Journal of Chinese Information Processing. 2024, 38(2): 132-141,154.
    Compared with English, the study on textual moral identification for Chinese is less developed. Due to the differences in theory and mode of thinking, it is difficult to transfer the study of moral recognition from English to Chinese directly. To address the above issues, this paper proposes a task of Chinese moral sentence recognition. We firstly construct large Chinese moral sentence datasets at the level of 100,000 using manual annotation and machine-assisted approach. Then we apply several popular machine learning methods to the task, as well as using external knowledge to further improve the performance.