Journal of Chinese Information Processing

Select

Survey

Natural Language Understanding for Legal Text: A Review

AN Zhenwei, LAI Yuxuan, FENG Yansong

2022, 36(8): 1-11.

Abstract ( ) PDF ( )

Knowledge map

Save

In recent years, legal artificial intelligence has attracted increasing attention for its efficiency and convenience. Among others, legal text is the most common manifestation in legal practice, thus, using natural language understanding method to automatically process legal text is an important direction for both academia and industry. In this paper, we provide a gentle survey to summarize recent advances on natural language understanding for legal texts. We first introduce the popular task setups, including legal information extraction, legal case retrieval, legal question answering, legal text summarization, and legal judgement prediction. We further discuss the main challenges from three perspectives: understanding the difference of languages between legal domain and open domain, understanding the rich argumentative texts in legal documents, and incorporating legal knowledge into existing natural language processing models.

Select

Language Analysis and Calculation

Chinese Word Segmentation Based on ACNNC Model

ZHANG Zhonglin, YU Wei, YAN Guanghui, YUAN Chenyu

2022, 36(8): 12-19,28.

Abstract ( ) PDF ( )

Knowledge map

Save

At present, most of the existing Chinese word segmentation models are based on recurrent neural networks, which can capture the overall features of the sequence while ignoring local features. This paper combines the attention mechanism, convolutional neural network and conditional random fields, and proposes Attention Convolutional Neural Network CRF (ACNNC). The self-attention layer replaces the recurrent neural network to capture the global features of the sequence, and the convolutional neural network captures location features of the sequence. The features are combined in the fusion layer and then input into conditional random fields for decoding. The experimental results on BACKOFF 2005 show that the model proposed achieves 96.2%, 96.4%, 96.1% and 95.8% F₁ values on PKU, MSR, CITYU and AS test set, respectively.

Select

Language Analysis and Calculation

Chinese Chunk-Based Dependency Grammar

QIAN Qingqing, WANG Chengwen, WANG Guirong, RAO Gaoqi, XUN Endong

2022, 36(8): 20-28.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a Chinese chunk-based dependency grammar (CCDG), which is focused on the chunks governed by the predicates within and between sentences. As an effort in establishing a syntactic analysis framework at the level of sentence group, the CCDG propose a novel idea to enlarge the linguistic granularity of leaf nodes. It can solve the logical structure knowledge at the micro level and pave a foundation for the meso argument knowledge and macro textual knowledge. This paper presents the concept, representation, analysis method and characteristics of CCDG, as well as the development of corresponding tree-bank. By August, 2020, the treebank was scaled up to 1.87 million tokens (including 40,000 complex sentences and 100,000 sub-sentences), consisting of 67% news texts and 32% encyclopedia texts.

Select

Language Resources Construction

An Introduction to the Syntactic-Semantic Knowledge-Base of Chinese Verbs

YUAN Yulin, CAO Hong

2022, 36(8): 29-36,45.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper introduces the contents and structure of The Syntactic-Semantic Knowledge-Base of Chinese Verbs (KB@verb). First, the structural system and theoretical foundation of KB@verb are introduced. Secondly, KB@verb classifies verbs into eight sub-classes and defines 22 semantic roles, which are configured into decades of syntactic formats. These syntactic formats and their examples in real-world texts are also included. In addition, KB@verb identifies nine major grammatical functions of verbs and their degrees of membership. Finally, the retrieval system and the hardcopy of KB@verb are illustrated.

Select

Language Resources Construction

Corpus Construction for Named-Entity and Entity Relations for Electronic Medical Records of Stroke Disease

CHANG Hongyang, ZAN Hongying, MA Yutuan, ZHANG Kunli

2022, 36(8): 37-45.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper discusses the labeling of Named-Entity and Entity Relations in Chinese electronic medical records of stroke disease, and proposes a system and norms for labeling entity and entity relations that are suitable for content and characteristics of electronic medical records of stroke disease. Based on the guidance of the labeling system and norms, after multiple rounds of labeling and correction, we completed the labeling of entities and relationships in electronic medical record of stroke disease more than 1.5 million words(Stroke Electronic Medical Record entity and entity related Corpus, SEMRC). The constructed corpus contains 10,594 named entities and 14,597 entity relationships. The consistency of named entity reached 0.8516, and that of entity relationship reached 0.9416.

Select

Ethnic Language Processing and Cross Language Processing

Reseorch of Gesar Epic Named Entity Recognition Based on Deep Learning

HUANKE You, HUAQUE Cairang, CAIRANG Dangzhi, DUOJIE Cairang

2022, 36(8): 46-53.

Abstract ( ) PDF ( )

Knowledge map

Save

This thesis examines the Gesar epic text for its rich entities, and suggest to classify them in six types of named entities. A named entity recognition method combining Tibetan syllables and deep learning is proposed Gesar epic. The precision, recall and F-value of named entity recognition reached 92.01%, 91.96% and 91.99%, respectively, under the setting of more than 100,000 manually annotated named entities.

Select

Ethnic Language Processing and Cross Language Processing

Korean Sentence Ordering Based on Sub-Word Level Vector and Pointer Network

YAN Xiaodong, XIE Xiaoqing

2022, 36(8): 54-61.

Abstract ( ) PDF ( )

Knowledge map

Save

Sentence ranking is one of the core technologies in natural language processing, with wide applications in multi-document summarization, question answering and text generation. Considering the rich morphological changes of Korean words, this paper a puts forward a Korean sentence ordering model based on the sub-word level vector and pointer network. A morpheme split based word vector training method (MorV) is presented, and the Korean word vector is obtained by comparing the sub word n-gram word vector training method (SG). Two sentence vector methods, i.e. convolution neural network (CNN) and long-term memory network (LSTM), are explored in the pointer network. The results show that the combination of MorV and LSTM can better capture the semantic logic relationship between sentences to improve the sentence ordering.

Select

Ethnic Language Processing and Cross Language Processing

Vietnamese Event Detection Method Incroporating Chinese Semantic Information and Vietnamese Syntactic Featues

CHEN Long, GUO Junjun, ZHANG Yafei, GAO Shengxiang, YU Zhengtao

2022, 36(8): 62-72.

Abstract ( ) PDF ( )

Knowledge map

Save

Current event detection models based on deep learning rely on labeled data. However, the scarcity of annotation data for Vietnamese events and the ambiguity of event types have brought challenges to Vietnamese event detection.Taking advantage of the fact that sentences expressing the same viewpoint in different languages usually have the same or similar semantic components, this paper proposes a Vietnamese event detection framework that combines Chinese information and Vietnamese syntax. First, the shared encoder strategy and cross-attention network are applied to integrate Chinese semantic information into Vietnamese. Then the graph convolutional network is used to obtain Vietnamese representation based on Vietnamese dependency syntactic information. Finally, the Vietnamese semantic representation based on Chinese event type information is extracted through the event type perception network to realize Vietnamese news event detection. Experimental results show that the proposed method has achieved good results.

Select

Information Extraction and Text Mining

Conquering Unlabeled Entity in Medical Record Text under
Distant Supervision Framework

YANG Yifan, SHI Miaoyuan, MIAO Qingliang, LI Maolong

2022, 36(8): 73-80.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic extraction technology for medical record text is becoming increasingly important. At present, the distant supervision method is a popular solution to the lack of labeled corpus. Focusing on alleviating the unlabeled entity issue caused by distant supervision, this paper proposes a combined strategy of data augmentation, negative sampling and global optimal node set selection for the span-level based named entity recognition model. Experiments show that both data enhancement and the global optimal node set selection have a stable improvement of about 0.5%, and the negative sampling method has 5% to 10% improvement.

Select

Information Extraction and Text Mining

Entity Profile Enhancement Network for Event Detection

LI Zhongqiu, HONG Yu, WANG Jie, ZHOU Guodong

2022, 36(8): 81-91.

Abstract ( ) PDF ( )

Knowledge map

Save

Event detection aims to automatically identify and classify event triggers from unstructured texts. Entity representation(or entity profiling)is deemed to be positive to event detection based on the hypothesis that “the features of entities often implies the type of events they participate in” (for example, “police” often participate in “Arrest-Jail” type of events). In this paper, we propose an entity representation enhancement method based on the document-level context and entities interaction. We use the graph attention neural network to capture the document-level information of entities on the background of high attention network, and take into account the local interactive information of other related entities. In particular, an attention mask model based on the entity co-occurrence graph is developed to reduce the noise information. We finally combine the entity representation enhancement network, BERT semantic encoding network, and GAT aggregation network to form the overall event detection model. Experiments on ACE2005 demonstrate the proposed method achieves 76.2% F₁ score in trigger classification task, outperforming the baseline model by 2.2%.

Select

Information Extraction and Text Mining

Multi-domain Text Classification Based on Domain Feature Refinement

MA Shikun, TENG Chong, LI Fei, JI Donghong

2022, 36(8): 92-100.

Abstract ( ) PDF ( )

Knowledge map

Save

Text Classification is a fundamental task in natural larguage processing communing. However, current text classification is usually domain-independent, suffering from insufficient annotated training data. We propose a solution by leveraging the similar information of data in different domains to address the limited labeled training data issue. Under the framework of multi-task learning proposed by this paper, we extract domain-invariant and domain-specific features by using a shared encoder and multiple private encoders, respectively. Latent informaton from different domaius can be captured, which is beneficial for multi-domain text classification. Besides, we further apply an orthogonal projection operation to inherently disjoint shared and private feature spaces to refine of the shared features, and then designed a gate mechanism to fuse the shared and private features. Experiments on Amazon review and FDU-MTL show that the average accuracy of the proposed model on two datasets are 86.04% and 89.2%, respectively, significant better compared with multiple baseline models.

Select

Sentiment Analysis and Social Computing

PRBDN: A Pretraining-based Emotion Classification Model for Weibo Comment

LEI Pengbin, QIN Bin, WANG Zhili, WU Yufan, LIANG Siyi, CHEN Yu

2022, 36(8): 101-108.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a new method for text sentiment classification based on the pre-training model. The BiLSTM network is applied to dynamically adjust the output weight of the Transformer of each layer of the pre-training model, and the layered text representation vectors are filtered using features such as BiLSTM and BiGRU. By using the model, we achieved third place in the Netizen Emotion Recognition Track during the epidemic of CCF 2020 Science and Technology for Epidemic·Big Data Charity Challenge. The F₁ value of the final test set is 0.745 37, which is 0.000 1 less than the first-place model with 67% less parameters.

Select

Sentiment Analysis and Social Computing

Attention-based Recurrent Network Combined with Financial Lexicon for Aspect-level Sentiment Classification

ZHU Qinglin, LIANG Bin, XU Ruifeng, LIU Yuhan, CHEN Yi, MAO Ruibin

2022, 36(8): 109-117.

Abstract ( ) PDF ( )

Knowledge map

Save

To address the entity-level sentiment analysis of financial texts, this paper builds a multi-million level corpus of sentiment analysis of financial domain entities and labels more than five thousand financial domain sentiment words as financial domain sentiment dictionary. We further propose an Attention-based Recurrent Network Combined with Financial Lexicon, called FinLexNet. FinLexNet model uses a LSTM to extract category-level information based on financial domain sentiment dictionary and another LSTM to extract semantic information at the word-level. In addition, in order to get more attention to the financial sentiment words, an attention mechanism based on the financial domain sentiment dictionary is proposed. Finally, experiments on the dataset we constructed shows that our model has achieved better performance than the baseline models.

Select

Natural Language Understanding and Generation

A Dependency-Guided Character-Based Slot Filling Model for Chinese Spoken Language Understanding

ZHU Zhanbiao, HUANG Peijie, ZHANG Yexing, LIU Shudong, ZHANG Hualin, HUANG Junyao, LIN Piyuan

2022, 36(8): 118-126.

Abstract ( ) PDF ( )

Knowledge map

Save

The joint model for intent detection and slot filling have boosted the state of the art of spoken language understanding (SLU). However, the presence of rarely seen or unseen mention degrades the performance of the model. Earlier researches show that sequence labeling task can benefit from the use of dependency tree structure for inferring existence of slot tags. In Chinese spoken language understanding, dominant models for slot filling are character-based hence word-level dependency tree structure can't be integrated into model directly. In this paper, we propose a dependency-guided character-based slot filling (DCSF) model, which provides a concise way to resolve the conflict of incorporating the word-level dependency tree structure into the character-level model in Chinese. Our DCSF model can integrate dependency tree information into the character-level model while preserving word-level context and segmentation information by modeling different types of relationships between Chinese characters in the utterance. Experimental results on the public benchmark corpora SMP-ECDT and CrossWOZ show our model outperforms the compared models and has a great improvement, especially in low resource and unseen slot mentions scenario.

Select

Natural Language Understanding and Generation

BERT Based Joint Model for Intention Classification and Slot Filling

MA Tianyu, QIN Jun, LIU Jing, TIE Jun, HOU Qi

2022, 36(8): 127-134.

Abstract ( ) PDF ( )

Knowledge map

Save

The intention classification and the slot filling are two basic sub-tasks of spoken language understanding. A joint model of intention classification and slot filling based on BERT is proposed. Through an association network, the two tasks establish direct contact and share information. BERT is introduced into the model to enhance the semantic representation of word vectors, which effectively alleviates the issue of small training data. Experiments on ATIS and Snips data sets show that the proposed model can significantly improve the accuracy of intention classification and the F1 value of slot filling.

Select

Natural Language Understanding and Generation

Chinese Abstractivte Summarization with Local Context Augmentation via N-gram

YIN Baosheng, AN Pengfei

2022, 36(8): 135-143,153.

Abstract ( ) PDF ( )

Knowledge map

Save

The abstractive document summarization algorithm based on sequence-to-sequence model has achieved good performance. Given that the rich local contextual information contained in Chinese n-grams, this paper proposes NgramSum to integrate n-gram information into the neural framework of the existing model.. The framework takes the existing neural model as the backbone, extracts n-grams information from the local corpus, and applies the n-gram information to augment the local context via a gate module. The experimental results on the dataset of NLPCC2017 shared task3 show that the framework effectively enhances the sequence-to-sequence strong baseline model of LSTM, Transformer, and pre-trained model with an average of 2.76%, 3.25% and 3.10% increase, respectively, according to the ROUGE-1/2/L scores.

Select

Natural Language Understanding and Generation

Research on Response Generation via Dialogue Constraints

GUAN Mengyu, WANG Zhongqing, LI Shoushan, ZHOU Guodong

2022, 36(8): 144-153.

Abstract ( ) PDF ( )

Knowledge map

Save

Existing dialogue systems tend to generate meaningless general replies such as “OK” and “I don't know”. In daily dialogs, every utterance usually has obvious emotional and intentional tendencies. So this paper proposes a response generation model based on dialogue constraints. Based on the Seq2Seq model, it combines the recognition of utterances’ themes, sentiments and intentions. This method constrains the topics, emotions and intentions of the generated responses, generating responses with reasonable sentiment and intention tendencies and close relation to the topic of the conversation. Experiments show that the method proposed in this paper can effectively improve the quality of generated responses.

Select

Natural Language Understanding and Generation

Dual Attention-Based Paragraph-level Question Generation

ZENG Biqing, PEI Fenghua, XU Mayi, DING Meirong

2022, 36(8): 154-162,174.

Abstract ( ) PDF ( )

Knowledge map

Save

Paragraph-level question generation is to generate one or more related questions from a given paragraph. Current studies on sequence-to-sequence based neural networks fail to filter redundant information or focus on key sentences. To solve this issue, this paper proposes a dual attention based model for paragraph-level question generation. The model first uses the attention mechanism for the paragraph and the sentence where the answer is located. Then, the model uses the gating mechanism to dynamically assign weights and merges them into context information. Finally, it improves pointer-generator network to combine the context vector and attention distribution to generate questions. Experimental results show that this model has a better performance than exiting models on the SQuAD dataset.

Select

NLP Application

Quantitative Stylistics Perspective to Authorship of Water Margin

SONG Li, LIU Ying, MA Yanjun

2022, 36(8): 163-174.

Abstract ( ) PDF ( )

Knowledge map

Save

It is a classical dispute that whether Nai'an Shi authored Water Margin alone, and if he had any partnership with Guanzhong Luo. We first collect the controversies of the authorship of Water Margin and summarize the following five hypotheses: 1. written by Shi alone, 2. written by Luo alone, 3. originally written by Shi and continue to be written by Luo, 4. originally written by Luo and continue to be written by someone else, 5. originally written by Shi and adapted by Luo. Taking Luo's Quelling the Demons' Revolt as a reference, we then investigate the writing style of Water Margin via hypothesis text, text clustering, text classification, rolling stylometry, and text content analysis. The results reveal that the fourth hypothesis is most probably true: the first 70 sections were written by Luo and the rest were written by others.

Please choose a citation manager

Content to export

2022 Volume 36 Issue 8 Published: 26 September 2022