Journal of Chinese Information Processing

Select

Survey

A Survey on Graph Contrastive Learning

CEN Keting, SHEN Huawei, CAO Qi, CHENG Xueqi

Journal of Chinese Information Processing. 2023, 37(5): 1-21.

Abstract (1516) PDF (1010)

Knowledge map

Save

As a self-supervised deep learning paradigm, contrastive learning has achieved remarkable results in computer vision and natural language processing. Inspired by the success of contrastive learning in these fields, researchers have tried to extend it to graph data and promoted the development of graph contrastive learning. To provide a comprehensive overview of graph contrastive learning, this paper summarizes recent works under a unified framework to highlight the development trends. It also catalogues the popular datasets and evaluation metrics for graph contrastive learning, and concludes with the possible future direction of the field.

Select

Survey

Review on Session-based Recommendation Methods

CHEN Jinpeng, LI Haiyang, ZHANG Fan, LI Huan, WEI Kaimin

Journal of Chinese Information Processing. 2023, 37(3): 1-17,26.

Abstract (607) PDF (707)

Knowledge map

Save

In recent years, session-based recommendation methods have attracted extensive attention from academics. With the continuous development of deep learning techniques, different model structures have been used in session-based recommendation methods, such as Recurrent Neural Networks, Attention Mechanism, and Graph Neural Networks. This paper conducts a detailed analysis, classification, and comparison over these models, and expounds on the target problems and shortcomings of these methods. In particular, this paper first compares the session-based recommendation methods with the traditional recommendation methods, and expounds the main advantages and disadvantages of the session-based recommendation methods through investigation. Subsequently, this paper details how complex data and information are modeled in session-based recommendation models, as well as the problems that these models can solve. Finally, this paper discusses and ideatifies the challenges and potential research directions in session-based recommendations.

Select

Survey

Evaluating Large Language Models: A Survey of Research Progress

LUO Wen, WANG Houfeng

Journal of Chinese Information Processing. 2024, 38(1): 1-23.

Abstract (515) PDF (593)

Knowledge map

Save

Large Language Models (LLMs) have demonstrated exceptional performance in various Natural Language Processing (NLP) tasks, providing a potential for achieving general language intelligence. However, their expanding application necessitates more accurate and comprehensive evaluations. Existing evaluation benchmarks and methods still have many short-comings, such as unreasonable evaluation tasks and uninterpretable evaluation results. With increasing attention to robustness, fairness and so on, the demand for holistic, interpretable evaluations is impressing. This paper delves into the current landscape and challenges of LLM evaluation, summarizes existing evaluation paradigms, analyzes limitations, introduces pertinent evaluation metrics and methodologies for LLMs and discusses the ongoing advancements and future directions in the evaluation of LLMs.

Select

Information Extraction and Text Mining

Character Relation Extraction from Chinese Literature

CAO Biwei, CAO Jiuxin, GUI Jie, TAO Rui, GUAN Xin, GAO Qingqing

Journal of Chinese Information Processing. 2023, 37(5): 88-100.

Abstract (345) PDF (376)

Knowledge map

Save

Entity relation extraction aims to extract structured relation triples between entities from unstructured or semi-structured nature language texts. Character relation extraction is a finer-grained branch of entity relation extraction. Focusing on character relation extraction in Chinese literature, we presents a MF-CRC character relation extraction model. We first introduce adversarial learning framework to build the sentence-level noise classifier so as to filter the noise in the dataset. Then BERT and BiLSTM are employed and feature representations of Chinese surnames, gender and relation are designed. The character relation extraction model is finally established by integrating the multi-dimensional features. Experiments on three Chinese classics show that the proposed method outperforms SOTA models by 1.92% and 2.14% in micro-F₁ and macro-F₁ , respectively.

Select

Information Extraction and Text Mining

A Multi-granularity Approach to Named Entity Recognition

SUN Hong, WANG Zhe

Journal of Chinese Information Processing. 2023, 37(3): 123-134.

Abstract (340) PDF (321)

Knowledge map

Save

The current named entity recognition algorithms are featured by word enhancement, introducing external vocabulary information to determine the word boundary. This paper proposed a multi-granularity information fusion strategy for named entity recognition algorithm. By encoding each word component in Chinese characters with attention to the word sequence, this model has the ability to capture Chinese glyph information. The experimental results on multiple named entity recognition datasets show that the algorithm has clear advantages in model accuracy and inference speed.

Select

Sentiment Analysis and Social Computing

Multimodal Humor Detection Based on Attention Mechanism

WU Jiaming, LIN Hongfei, YANG Liang, XU bo

Journal of Chinese Information Processing. 2023, 37(5): 135-142,172.

Abstract (379) PDF (293)

Knowledge map

Save

Current humor detection is focused on textual humor recognition rather than carrying out this task on multimodal data. This paper proposes a modal fusion approach to humor detection based on the attention mechanism. Firstly, the model encodes each single-modal context to obtain the feature vector, and then the hierarchical attention mechanism is applied on feature sequences to capture the correlation of multi-modal information in the paragraph context. Tested on the UR-FUNNY public data set, the proposed model achieves an improvement of 1.37% in accuracy compared to the previous best result.

Select

Information Extraction and Text Mining

Chinese Named Entity Recognition with few Labeled Data

ZHANG Yun, HUANG Cheng, ZHANG Yuyao, HUANG Jingwei, ZHANG Yude,
HUANG Liya, LIU Yan, DING Keke, WANG Xiumei

Journal of Chinese Information Processing. 2023, 37(3): 101-111.

Abstract (364) PDF (287)

Knowledge map

Save

The lack of training data is a typical problem of named entity recognition today. To apply TMN model that requiring labelled triggers in Chinese, a new automatic annotation method GLDM-TMN is proposed. This method introduces Mogrifier LSTM structure, Dice loss function and various attention mechanisms to enhance the accuracy of trigger matching and entity annotation. Simulated experiments on two publicly available datasets show that GLDM-TMN has better improved the F₁ value by 0.013 3 to 0.034 than TMN model with the same small amount of labeled data. Meanwhile, the proposed method with 20% of training data outperforms BiLSTM-CRF model with 40% of training data.

Select

Information Extraction and Text Mining

A GCN-based Approach to Entity Relation Extrattion from Multi-party Dialogues

WANG Qiqi, LI Peifeng

Journal of Chinese Information Processing. 2023, 37(5): 80-87.

Abstract (328) PDF (275)

Knowledge map

Save

In contrast to the existing relation triple extraction focused on written texts, this paper proposes a GCN(Graph Convolutional Network) based approach to model dialogue scenarios. Compared with the entity relations in written text, those in dialogues emphasizes the relationship among humans and are more colloquial. To address this issue, our method regards dialogue sentences as nodes, and assigns weighted edges between sentences according to sentence distance. With such constructed a dialogue scene graph, we then applies GCN to model the relationship between dialogues. Experimental results on DialogRE show that our model outperforms the existing state-of-the-art baselines.

Select

Ethnic Language Processing and Cross Language Processing

Advances in Hindi Natural Language Processing

WANG Lianxi, LIN Nankai, JIANG Shengyi, DENG Zhiyan

Journal of Chinese Information Processing. 2023, 37(5): 53-69.

Abstract (229) PDF (255)

Knowledge map

Save

Compared with western languages, Hindi is a low resource language in Southeast Asia. Due to the lack of corpus, annotation specifications and computational modeling practices, the studies on Hindi natural language processing have not been well addressed. This paper reviews the research progresses in Hindi natural language processing in terms of the resource construction, part of speech tagging, named entity recognition, syntactic analysis, word sense disambiguation, as well as information retrieval, machine translation, sentiment analysis and automatic summarization. This paper also reveals the issues and challenges in Hindi natural language processing, and outlooks the future development trend.

Select

Information Extraction and Text Mining

Medical Entity Standardization Method Based on Collaborative Ensemble Learning

JIANG Jingchi, HOU Junyi, LI Xue, GUAN Yi, GUAN Changhe

Journal of Chinese Information Processing. 2023, 37(3): 135-142.

Abstract (312) PDF (247)

Knowledge map

Save

Medical entity standardization aims to map non-standardized terms in texts (e.g. electronic medical records and patient complaints) into unified and standardized medical entities. In view of the small scale and hardly standardized of annotated corpora in medical texts, this paper proposes a multi-model collaborative ensemble learning framework to solve the standardization of medical entities. By establishing a "cooperation and competition" mechanism among multiple models, we can combine the advantages of different standardization methods in character level and semantic level. Specifically, the collaborative learning implemented by knowledge distillation technology can extract effective features from each model. The diversity of candidate sets can be guaranteed by integrating entity standardization results of each model with competition-aware. In the CHIP-CDN 2021 task of medical entity standardization, the method proposed achieved a F1 value of 73.985% in the blind test data set, ranking second among 255 teams including Baidu BDKG, Ant-Financial Antins and AISpeech. Experimental results also show that this method can effectively standardize terms in medical texts.

Select

Information Extraction and Text Mining

Chinese Named Entity Recognition Based on Lexicon and Glyph Features

YU Shujuan, MAO Xintao, ZHANG Yun, HUANG Liya

Journal of Chinese Information Processing. 2023, 37(3): 112-122.

Abstract (349) PDF (237)

Knowledge map

Save

Named entity recognition is a fundamental task of natural language processing. Lexicon-based method is the popular approach to enhance the representation of semantic and boundary information for Chinese named entity recognition. To utilize the glyphs containing rich entity information , we propose a novel Chinese named entity recognition model based on lexicon and glyph features. Specifically, the model enriches the semantic information through SoftLexicon and optimizes character representation through the improved radical-level embedding, which is fed into gated convolutional network. The experiments on four benchmark datasets show that the proposed model achieves significant improvements compared to both the existing models.

Select

Machine Translation

Automatic Evaluation of Neural Machine Translation Based on Multiple Information Fusion

LIU Yuan, LI Maoxi, XIANG Qingyu, LI Yihan

Journal of Chinese Information Processing. 2023, 37(3): 89-100.

Abstract (330) PDF (234)

Knowledge map

Save

Machine translation evaluation plays an important role in promoting the development and application of machine translation. The latest neural methods of evaluating machine translation use pretrained contextual embeddings to extract different deep semantic features, and then simply concatenate them feed into the multi-layer neural network to predict translation quality. We propose to introduce middle stage information fusion and late stage information fusion into evaluation of machine translation. More specifically, we propose to use embrace fusion to interactively fuse different features in the middle stage. In the late stage, we fuse sentence mover’s distance and sentence cosine similarity based on fine-grained accurate matching. Experimental results on the WMT'21 Metrics Task show that the proposed method can achieve competitive performance with the best metrics in the evaluation campaign.

Select

Language Analysis and Calculation

Enhancing Paraphrase Identification by Injecting Specific Domain Knowledge

LI Zhifeng, BAI Yan, HONG Yu, LIU Dong, ZHU Mengmeng

Journal of Chinese Information Processing. 2023, 37(3): 18-26.

Abstract (368) PDF (220)

Knowledge map

Save

The paraphrase identification is to deciden whether two sentences express the same meaning. It is relatively easy for the general domain paraphrase identification to understand and judge the relationship between two sentences. To improve the paraphrase identification in specific domains, we propose a paraphrase identification method based on domain knowledge fuison. We retrieval the knowledge from the knowledge base and integrated them into the model. Experiments on the PARADE dataset (in computer science domain) show our method has reached 73.9% F₁ score, out-performing the baseline by 3.1%.

Select

Information Extraction and Text Mining

An End-to-End Joint Extraction of Entity and Relation Based on MLPs with Gating

JIA Baolin, YIN Shiqun, WANG Ningchao

Journal of Chinese Information Processing. 2023, 37(3): 143-151.

Abstract (321) PDF (207)

Knowledge map

Save

Extracting entities and relations from unstructured text has become a crucial task in natural language processing. We propose an end-to-end joint entity and relation extraction based on SGM module. In our model, word-level and character-level embeddings are transferred to SGM module to obtain efficient semantic representation. Then we employ span-attention to fuse the contextual information and sentence-level information to obtain the specific span representation. Finally, we use the full connection layer to classify the entities and relations. Without introducing other external complicated features, this model obtains rich semantics and takes full advantage of the association between entity and relation. The experimental results show that on the NYT10 and NYT11 datasets, the F₁ of the proposed model in the relation extraction task reaches 70.6% and 68.3% respectively, which is much better than other models.

Select

Language Analysis and Calculation

GAT: Global-Based Adversarial Training for Natural Language Understanding

CAI Kunzhao, ZENG Biqing, CHEN Pengfei

Journal of Chinese Information Processing. 2023, 37(3): 27-35.

Abstract (317) PDF (191)

Knowledge map

Save

In natural language processing, gradient-based adversarial training is an effective method to improve the robustness of neural networks. This paper proposes an initialization strategy based on the global-based perturbation vocabulary to deal with the problem of low efficiency in the existing adversarial training algorithm, improving the efficiency of training neural networks while ensuring the effectiveness of initializing the perturbations. To keep tokens independent and avoid the training dominated by a few samples, we proposes an normalization strategy based on the global-based equal weight. Finally, we propose a multifaceted perturbations strategy to improve the robustness of pretraining language models. The experimental results show that the strategies can effectively improve the performance of neural networks.

Select

Information Retrieval

Incorporating Knowledge Propagation and Prompt Learning for Recommendation

HUANG Sisi, KE Wenjun, ZHANG Hang, FANG Zhi, YU Zengwen, WANG Peng, WANG Qingli

Journal of Chinese Information Processing. 2023, 37(5): 122-134.

Abstract (318) PDF (191)

Knowledge map

Save

The data sparsity issue in recommendation can be resolved by including explicit information in the knowledge graph. Most existing knowledge graph-based methods capture user behaviors solely through entity relationships, ignoring the implicit cues between users and items to recommend. To this end, this paper proposes a unique recommendation approach incorporating the knowledge graph and the prompt learning. In particular, the knowledge graph is employed to propagate user preferences and produce corresponding dynamic behaviors. And, the implicit insights absent from the knowledge graph could be absorbed by feeding pre-trained language model (PLM) with static user features under the prompt learning setting. Finally, the template probability within the PLM vocabulary is intuitively selected as the possibility of the recommendation. Experiments on the MovieLens-1M, Book-Crossing, and Last.FM datasets show that our technique outperforms state-of-the-art baselines by 6.4%, 4.0% and 3.6% in AUC, and 6.0%, 1.8%, and 3.2% in F₁ value, respectively.

Select

Language Analysis and Calculation

Lexical Substitution Based on Paraphrase Modeling

QIANG Jipeng, CHEN Yu, LI Yang, LI Yun, WU Xindong

Journal of Chinese Information Processing. 2023, 37(5): 22-31,43.

Abstract (243) PDF (178)

Knowledge map

Save

Lexical substitution (LS) aims at finding an appropriate substitute for a target word in a sentence. In contrast to the BERT-based LS, this paper proposes a method to generate substitution candidates base on paraphrase to utilize the existing large-scale paraphrase corpus which contains a large number of rules of word substitution. Specifically, we first employ a paraphrase dataset to train a neural paraphrase model. Then, we propose a special decoding method to focus only on the variation of the target word to extract substitute candidates. Finally, we rank substitute candidates for choosing the most appropriate substitution without modifying the meaning of the original sentence based on text generation evaluation metrics. Compared with existing state-of-the-art methods, experimental results show that our proposed methods achieve the best results on two widely used benchmarks (LS07 and CoInCo).

Select

Language Resources Construction

Review of the First International Ancient Chinese Word Segmentation and POS Tagging Bakeoff

LI Bin, YUAN Yiguo, LU Jingya, FENG Minxuan, XU Chao, QU Weiguang, WANG Dongbo

Journal of Chinese Information Processing. 2023, 37(3): 46-53,64.

Abstract (399) PDF (172)

Knowledge map

Save

Automatic word segmentation and part-of-speech tagging of ancient texts are the basic tasks of ancient Chinese information processing. The lack of large-scale vocabulary and annotated corpus leads to the slow development of ancient Chinese processing technology. The paper summrizes the First International Ancient Chinese Word Segmentation and POS Tagging Bakeoff, which provies manually annotated corpus as unified training data and basic test set and blind test set. The bakeoff also distinguishes open and close test mode according to whether external resources are used. The bakeoff was held at the Second Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA), which is in the context of the 13th Edition of the Language Resources and Evaluation Conference (LREC). A total of 14 teams participated in the bakeoff. On the basic test set, the F1-scores of word segmentation and POS tagging reaches 96.16% and 92.05%, respectively, in the close test, while 96.34% and 92.56%, respectively, in the open test. On the blind test set, the F1-scores of word segmentation and POS tagging reaches 93.64% and 87.77%, respectively, in the close test, while 95.03% and 89.47%, respectively, in the open test. The out-of-vocabulary words are still the barrier of ancient Chinese lexical analysis, and the deep learning and pre-training model effectively improve the performance of automatic ancient Chinese processing.

Select

Multimodal Natural Language Processing

Question Recommendation Method Based on Multimodal Semantic Analysis

WANG Shijin, WANG Chengcheng, ZHANG dan, WEI Si, WANG Yuan

Journal of Chinese Information Processing. 2023, 37(5): 165-172.

Abstract (233) PDF (163)

Knowledge map

Save

The multi-origin, diverse and multimodal nature of educational resources brings up enormous challenges for educational resources recommendation. To address this issue, this paper proposed a method that recommends questions for practicing based on multimodal semantic analysis. First, we extract the multimodal features and the semantic relationships between different modals to construct a representation structure of multimodal educational resources. Then, we model the knowledge map with an algorithm pre-trained on multimodal video features and question features. In the end, fine-tuned by pre-collected video-question features, the model can extract more robust feature representations to recommend practice questions that are highly related to the lecture videos. Experiments show that this method outperforms the current methods.

Select

Question-answering and Dialogue

Question Answering in Reading Comprehension of College Entrance Examination Based on Heterogeneous Graph Neural Network

YANG Zhizhuo, LI Moqian, ZHANG Hu, LI Ru

Journal of Chinese Information Processing. 2023, 37(5): 101-111.

Abstract (220) PDF (156)

Knowledge map

Save

The question answering of college entrance examination reading comprehension is an important challenge in reading comprehension task in recent years. This paper proposes a model of answer sentence extraction based on heterogeneous graph neural network. Rich relationships (frame semantics and discourse topic relationships ) between nodes (sentences and words ) are introduced into the graph neural network. Therefore, questions can interact with candidate answer sentences through both words nodes and frame semantics and discourse topic relationships. The results show that the proposed model outperforms the baseline model with 78.08% F₁ value.

Select

Knowledge Representation and Acquisition

Triple-view Hyper-relational Knowledge Graph for Hypertension

XIE Xiaoxuan, E Haihong, KUANG Zemin, TAN Ling, ZHOU Gengxian,
Luo Haoran, LI Jundi, SONG Meina

Journal of Chinese Information Processing. 2023, 37(3): 65-78.

Abstract (328) PDF (155)

Knowledge map

Save

Traditional knowledge modeling methods have been always being plagued by the high complexity of hypertension knowledge, failing in accurate knowledge representation by the triples. In this paper, we propose a Triple-view Hypertension Hyper-relational Knowledge Graph (THH-KG). It builds a three-layer graph architecture containing calculation layer, concept layer and instance layer, based on which the joint expression of multiple medical logic rules, conceptual knowledge and patient knowledge are realized. Additionally, we propose a general storage method of hyper-relational knowledge graph in common graph database, on which a Hypertension Knowledge Graph Reasoning Engine (HKG-RE) is also established. Results in medication decision experiment witness 97.2% positive rate out of 108 patients with hypertension.

Select

Sentiment Analysis and Social Computing

Aspect Level Sentiment Analysis Based on Syntactic Structure and Mixed Attention Mechanism

LI Weijiang, WU Yuchen

Journal of Chinese Information Processing. 2023, 37(5): 143-156.

Abstract (241) PDF (150)

Knowledge map

Save

Current methods of aspect level sentiment classification are mostly based on cyclic neural network or single-layer attention mechanism while ignore the influence of location information on the emotional polarity of specific aspect words. This paper proposes an aspect level sentiment analysis model based on syntactic structure and mixed attention mechanism. It takes the position vector based on the syntactic structure tree as the auxiliary information, and adopts the mixed attention to extract the affective polarity of a sentence under a given aspect word. Specifically, it constructs positional attention mechanism and interactive multi-head attention mechanism, respectively, to obtain semantic information related to sentence and aspect words. The experiments on Restaurant and Laptop and ACL14 Twitter in Semeval 2014 public dataset show that, in most cases, the model performs better than the related baseline model and can effectively identify different aspects of emotional polarity.

Select

Computational Argumentation

Journal of Chinese Information Processing. 2023, 37(10): 106-107.

Abstract (118) PDF (138)

Knowledge map

Save

论辩（Argumentation）以人的逻辑论证过程作为研究对象，是一个涉及逻辑、哲学、语言、修辞、计算机科学和教育等多学科的研究领域。近年来，论辩研究引起计算语言学学者的关注，并催生了一个新的研究领域，即计算论辩学（Computational Argumentation）。学者们试图将人类关于逻辑论证的认知模型与计算模型结合起来，以提高人工智能自动推理的能力。根据参与论辩过程的人数不同，计算论辩学的研究可以分成两类，即单体式论辩（Monological Argumentation）和对话式论辩（Dialogical Argumentation）。单体式论辩的研究对象是仅有一个参与者的辩论性文本，如议论文和主题演讲等。相关的研究问题包括论辩单元检测、论辩结构预测、论辩策略分类和议论文评分等。对话式论辩的研究对象是针对某一个特定议题进行观点交互的论辩过程, 一般有多个参与者。相关的研究问题包括论辩结果预测、交互式论点对抽取、论辩逻辑链抽取等。

Select

Survey

A Survey of Non-Autoregressive Neural Machine Translation

CAO Hang, HU Chi, XIAO Tong, WANG Chenglong, ZHU Jingbo

Journal of Chinese Information Processing. 2023, 37(11): 1-14.

Abstract (125) PDF (133)

Knowledge map

Save

Most of the current machine translation systems adopt the autoregressive method for decoding, which leads to low inference efficiency. The non-autoregressive method significantly improves the inference speed through parallel decoding, attracting increasing research interest. We conduct a systematic survey for recent efforts to narrow the translation quality gap between Non-Autoregressive Machine Translation (NART) and Autoregressive Machine Translation (ART). We categorize NART methods by the way to capture the dependencies of target sequences. We also discuss the challenges of NART research.

Select

Ethnic Language Processing and Cross Language Processing

TASSM_BS: Tibetan Automatic Sentence Segmentation Method Based on Bi-LSTM and Self-Attention

CAI Rangsanzhi, Dolha, GESANG Duojie, LOUSANG Gadeng, RENZENG Duojie

Journal of Chinese Information Processing. 2023, 37(5): 44-52.

Abstract (227) PDF (133)

Knowledge map

Save

Sentence boundary identification is an essential task in natural language processing. Because of issues such as concurrent ending words and data sparse, the existing Tibetan sentence boundary identification methods based on the dictionary or the statistical model are less efficient. This paper proposes an automatic Tibetan sentence boundary identification method based on Bi-LSTM and Self-Attention. Experiments reveal this method outperforms other method by achieving 97.7%, 98.06% and 97.88% in terms of macro accuracy, macro recall and macro F1, respectively. The experimental results also demonstrate that front-end truncation for fixed sentence length , and the skip-gram syllable word representations are more effective.

Select

Natural Language Understanding and Generation

Profession Oriented Text Generation Based on Reference Specifications

HU Yu, WANG Jian, SUN Yuqing

Journal of Chinese Information Processing. 2023, 37(3): 152-163.

Abstract (291) PDF (130)

Knowledge map

Save

Reference specifications refer to the text description of professional knowledge points, which are used to guide the text generation. In this paper, we propose a profession oriented text generation model based on adversarial architecture (PT-GAN), using several independent generators for the texts on different matching degrees of knowledge points. Each generator is an auto-encoder, where the encoder is used to extract the features of reference specifications, and the decoder is used to generate text. We use two discriminators to guide the text generation on both the linguistic norms and professional knowledge. The linguistic discriminator guides the coherence and the profession discriminator is used to control professional attributes. Experiments on national profession qualification examination datasets show that the proposed model has a significant improvement comparing with other methods on coherence, relevance with reference specifications, and on matching knowledge points.

Select

Natural Language Understanding and Generation

Generating SQL Statement from Chinese Query Based on Dual Learning

ZHAO Zhichao, YOU Jinguo, HE Peilei, LI Xiaowu

Journal of Chinese Information Processing. 2023, 37(3): 164-172.

Abstract (328) PDF (117)

Knowledge map

Save

To address the current challenges of requiring large amounts of annotated data for Chinese NL2SQL (Natural language to SQL) methods, this paper introduces a dual learning NL2SQL model, DualSQL, for weakly supervised learning on a small number of trained datasets to generate SQL statements from Chinese queries. Specifically, two tasks as dual tasks are used simultaneously to train the natural language to SQL and vice versa, so that the model learns the dual constraints between tasks and obtains more relevant semantic information. To verify the effectiveness of dual learning on the NL2SQL parsing task, we use different proportions of data without labels during training. Experimental results show that the percentage accuracy of the proposed model is increased by at least 2.1% compared with the benchmark models such as Seq2Seq, Seq2Tree, Seq2SQL, SQLNet, -dual etc., in different Chinese and English datasets including ATIS, GEO, and TableQA, and execution accuracy by at least 5.3% on the Chinese TableQA dataset. Further, we show that using only 60% of labelled data can achieve similar effects to those with 90% of labelled data for supervised learning.

Select

Machine Translation

A Semantic Connection Enhanced Cross-language Pre-trained Model for MT Quality Estimation

YE Heng, GONG Zhengxian

Journal of Chinese Information Processing. 2023, 37(3): 79-88.

Abstract (275) PDF (111)

Knowledge map

Save

Quality Estimation(QE) of Machine Translation(MT) can automatically estimate the quality of MT outputs without references. Due to the lack of manual data, the current QE Systems with neural network architecture still have problems in automatically detecting translation errors. For the sake of utilizing the vast but unlabeled parallel data, this paper proposes a translation knowledge transfer method. First, the cross-lingual pre-trained model XLM-R is used to construct the neural quality estimation baseline system, then we propose three pre-training strategies to enhance the bilingual semantic connection ability of XLM-R. The proposed method in this paper has reached the new SOTA performance on both the WMT2017 and WMT2019 quality estimation data sets.

Select

Information Extraction and Text Mining

An Approach for Table Classification in Long Financlal Disclosures

LUO Xiaoqing, JIA Wang, LI Jiajing, YAN Hongfei, FENG Ke

Journal of Chinese Information Processing. 2023, 37(5): 70-79.

Abstract (242) PDF (107)

Knowledge map

Save

To address the challenging issue of table acquisition in long financlal disclosures, this paper proposes a context feature fusion approach. A table classification dataset is first constructed by preprocessing these long financlal disclosures and extracting tables with their contexts in the document. Then different multiscale Convolution Neural Networks (CNNs) are used for feature extraction according to the characteristics of table information and context information. Comparded with the baseline experiments, the Micro-F1 and Macro-F1 scores have improved by over 0.37% and 1.24% respectively.

Select

Sentiment Analysis and Social Computing

Automatic Classification of Illegal Business Types for Clues to Case Sources

FAN Qin, LI Bing, WEN Liqiang, LI Weiping

Journal of Chinese Information Processing. 2023, 37(5): 157-164.

Abstract (201) PDF (99)

Knowledge map

Save

Case source clues management is the initial step for industrial and commercial administration and law-enforcement. To deal with the sharp increasing case source clues, this paper explore the deep learning model to realize illegal types automatic recognition. After model selection and empirical research, the overall classification accuracy rate meets actual business needs. The experiment on a first-tier city’s data show that the proposed model can effectively realize the case source clues automatic classification.

Select

Knowledge Representation and Acquisition

Knowledge Representation Combining Quaternion Path Integration and Atrous Circular Convolution

CHEN Xinyuan, ZHOU Zhongmei, CHEN Qingqiang, GAO Meichun, SHI Daya

Journal of Chinese Information Processing. 2023, 37(3): 54-64.

Abstract (262) PDF (95)

Knowledge map

Save

Knowledge models endeavor to improve representation and feature extraction capabilities in order to model complex relation patterns in knowledge graphs. Existing approaches based on hypercomplex embeddings do not utilize path semantics between entity pairs. This paper proposes a fast computation method, which treats the merging of quaternion relation sequences between entity pairs as a multiple rotational blending problem, and adopts the attentions mechanism to integratedthe path semantics. Then, an atrous circular convolution framework is set up for better feature extraction. Experiments including Link Prediction and Path Query are conducted on benchmark datasets to demonstrate the advantage of our model over state-of-the-art models like Rotate3D.

Select

Question-answering and Dialogue

A Machine Reading Comprehension Method Guided by Long-Short Answers Classification

YANG Jianxi, XIANG Fangyue, LI Ren, LI Dong, JIANG Shixin, ZHANG Luyi, XIAO Qiao

Journal of Chinese Information Processing. 2023, 37(5): 112-121.

Abstract (215) PDF (93)

Knowledge map

Save

Existing machine reading comprehension models are defected in capturing the boundary information of the answer, leading to incomplete long answers and redundant short answers. This paper proposes a strategy to guide the machine reading comprehension through classification of answer length features. With the question and the document encoded by RoBERTa_wwm_ext pre-trained model, the questions are classified according to the predicted length of the answer. The result of the question classification is used to guide the answer prediction module in reading comprehension, where the beginning and end positions of all answers are finally obtained in the way of multi-task learning. Compared with the baseline models, the experimental results on the CMRC2018 dataset, the self-built Chinese bridge inspection question and answer dataset and the traditional Chinese data set DRCD all confirm the superior performance of the proposed method according to either EM value or F₁ value.

Select

Language Analysis and Calculation

Brain Mechanism of Speech Comprehension in Complex Sound Environments

GENG Libo, XUE Zixuan, CAI Wenpeng, ZHAO Xinyu, MA Yong, YANG Yiming

Journal of Chinese Information Processing. 2023, 37(5): 32-43.

Abstract (206) PDF (83)

Knowledge map

Save

By means of ERPs, this paper explore the neural mechanism of semantic processing under information masking condition by comparing the processing of Chinese sentences in quiet condition, white noise condition, Chinese noise condition and English noise condition. It is found that the waveforms of N400, LPC and other ERPs induced by different noise conditions are different, which provide evidences for several conclusions. Firstly, the language information in speech masking occupies the cognitive and psychological resources required by the target sound processing, and the resource competition reduces the listener's ability to identify the target signals, resulting in the information masking in the form of language interference. Secondly, the speech intelligibility of the masker plays a more critical role for difficult semantic processing in the speech masking. The masking effect on semantic processing is smaller when the language is a very familiar or completely unfamiliar language, while the masking effect may be stronger when the masking noise is the non-native language to which the listener has been exposed. Finally, the listener comprehensible semantic content contained in unfamiliar speech noise that appears less frequently is more likely to trigger listener attention transfer if it conflicts with the listener expectations, which, in turn, increases information masking intensity.

Select

Information Extraction and Text Mining

Document-level Named Entity Recognition for Literary Texts

JIA Yuxiang, CHAO Rui, ZAN Hongying, DOU Huayi, CAO Shuai, XU Shuo

Journal of Chinese Information Processing. 2023, 37(11): 100-109.

Abstract (72) PDF (82)

Knowledge map

Save

Named entity recognition is essential to the intelligent analysis of literary works. We annotate over 50 thousands named entities of four types from about 1.8 million words of two Jin Yong’s novels. According to the characteristics of novel text, this paper proposes a document-level named entity recognition model with a dictionary to record the historical state of Chinese characters. We use confidence estimation to fuse BiGRU-CRF and Transformer model. The experimental results show that the proposed method can effectively improve the performance of named entity recognition.

Select

Sentiment Analysis and Social Computing

Interpretable Sentiment Analysis Based on UIE

ZHU Jie, LIU Suwen, LI Junhui, GUO Lifan, ZENG Haifeng, CHEN Feng

Journal of Chinese Information Processing. 2023, 37(11): 151-157.

Abstract (86) PDF (77)

Knowledge map

Save

Interpretable sentiment analysis aims to judge the polarity of text, and at the same time, give evidence for judgements or evidence for predictions. Most of the existing sentiment analysis methods are black box models, and interpretability evaluation is still a problem to be solved. This paper proposes a interpretable sentiment analysis method based on UIE. According to the characteristics of sentiment interpretable tasks, this method uses methods such as few-shot and text clustering to improve the rationality and loyalty of the model. The experimental results show that this method has won the first place in the task of “2022 language and intelligent technology competition: sentiment interpretable evaluation”.

Select

Information Extraction and Text Mining

Generative Biomedical Event Extraction Based on Controllable Decoding

SU Fangfang, LI Fei, JI Donghong

Journal of Chinese Information Processing. 2023, 37(11): 68-80.

Abstract (61) PDF (75)

Knowledge map

Save

This paper presents a generative biomedical event extraction model based on the framework of the pre-trained language model T5, which allows the joint modeling of the three subtasks of trigger recognition, relation extraction and argument combination. The model employs a trie-based constrained decoding algorithm, which regulates sequence generation and reduces the search space for argument roles. Finally, curriculum learning algorithm is used in training, which familiarizes T5 with biomedical corpora and events with hierarchical structure. The model obtains 62.40% F₁-score on the Genia 2011 and 54.85% F₁-score on the Genia 2013, respectively, demonstrating the feasibility of using a generative approach to biomedical event extraction.

Select

Language Analysis and Computation Model

Performance and Challenges of InstructGPT in Named Entity Recognition

SUN Yu, YAN Hang, QIU Xipeng, WANG Ding, MU Xiaofeng, HUANG Xuanjing

Journal of Chinese Information Processing. 2024, 38(1): 74-85.

Abstract (87) PDF (75)

Knowledge map

Save

Currently, the research on Large Language Models (LLMs), such as InstructGPT, is primarily focused on free-form generation tasks, while the exploration in structured extraction tasks has been overlooked. In order to gain a deep understanding of LLMs on structured extraction tasks, this paper analyzes InstructGPT's performance on named entity recognition (NER), one of the fundamental structured extraction tasks, in both zero-shot and few-shot settings. To ensure the reliability of the findings, the experiments cover common and nested datasets from both biomedical domain and general domain. The results demonstrate that InstructGPT's performance on zero-shot NER achieves 11% to 56% of the performance by a finetuned small-scaled model. To explore why InstructGPT struggles with NER, this paper examines the outputs, finding invalid generation for 50% of them. Besides, the occurrence of both "false-negative" and "false-positive" predictions makes it difficult to improve performance by only addressing the invalid generation. Therefore, in addition to ensuring the validity of generated outputs, further research still should focus on finding effective ways of using InstructGPT in this area.

Select

Language Resources Construction

The Construction of Pre-Qin Ancient Chinese WordNet and A Contrastive Study with Ancient Sanskrit WordNet

LU Xuehui, XU Huidan, LI Bin, CHEN Siyu

Journal of Chinese Information Processing. 2023, 37(3): 36-45.

Abstract (282) PDF (73)

Knowledge map

Save

Pre-Qin ancient Chinese plays an important role in the history of Chinese language. However, there is no well-structured lexical resources of Pre-Qin ancient Chinese, which is essential in ancient language processing and cross language comparison. This paper summarizes the construction methods of WordNet, which a well-formed semantic hierarchy developed for tens human languages, with a special focus in ancient languages’ and Chinese WordNets. This paper then presents the construction and data checking process of the WordNet for Pre-Qin ancient Chinese (PQAC-WN), which covers 43 591 words, 61 227 senses and 17 975 synsets. By cross language comparison with the ancient Sanskrit WordNet, this paper analyzes the lexical similarities and differences of the two ancient languages, thus preliminarily verifying the application of the PQAC-WN.

Select

Information Extraction and Text Mining

Debiased Contrastive Learning for Multimodal Named Entity Recognition

ZHANG Xin, YUAN Jingling, LI Lin, LIU Jia

Journal of Chinese Information Processing. 2023, 37(11): 49-59.

Abstract (80) PDF (64)

Knowledge map

Save

Recent studies show that visual information can help text achieve more accurate named entity recognition. However, most of the exiting work treats an image as a collection of visual objects and attempts to explicitly align visual objects with entities in text, fails to cope with modal bias well when visual objects and the entities are quantitatively and semantically inconsistent. To deal with this problem, we propose a debiased contrastive learning approach (DebiasCL) for multimodal named entity recognition. We utilize the visual objects density to guide visual context-rich sample mining, which enhances debiased contrastive learning to achieve better implicit alignment by optimizing the latent semantic space learning between visual and textual representations. Empirical results shows that the DebiasCL achieves a F₁-value of 75.04% and 86.51%, with 5.23% and 5.2% increased on "PER" and "MISC" entity type data in Twitter-2015 and Twitter-2017, respectively.

Select

Information Extraction and Text Mining

Continual Relation Extraction via Supervised Contrastive Replay

ZHAO Jiteng, LI Guozheng, WANG Peng, LIU Yanhe

Journal of Chinese Information Processing. 2023, 37(11): 60-67,80.

Abstract (65) PDF (59)

Knowledge map

Save

Continual relation extraction is used to solve catastrophic forgetting caused by retraining models on new relations. Aiming at task-recency bias issue, this paper proposes a continual relation extraction method based on supervised contrastive replay. Specifically, for each new task, the model first uses the encoder to learn new sample embeddings, and then uses the samples of the same and different relation categories as positive and negative sample pairs to continually learn an embedding space with strong discrimination ability. At the same time, relation prototypes are added to the supervised contrastive loss to prevent the model from overfitting. Finally, the nearest class mean classifier is used for classification. The experimental results show that the proposed method can effectively alleviate the catastrophic forgetting issue in continual relation extraction, and achieve the state-of-the-art performance on FewRel and TACRED datasets.

Please choose a citation manager

Content to export