2022 Volume 36 Issue 8 Published: 26 September 2022
  

  • Select all
    |
    Survey
  • Survey
    AN Zhenwei, LAI Yuxuan, FENG Yansong
    2022, 36(8): 1-11.
    Abstract ( ) PDF ( ) Knowledge map Save
    In recent years, legal artificial intelligence has attracted increasing attention for its efficiency and convenience. Among others, legal text is the most common manifestation in legal practice, thus, using natural language understanding method to automatically process legal text is an important direction for both academia and industry. In this paper, we provide a gentle survey to summarize recent advances on natural language understanding for legal texts. We first introduce the popular task setups, including legal information extraction, legal case retrieval, legal question answering, legal text summarization, and legal judgement prediction. We further discuss the main challenges from three perspectives: understanding the difference of languages between legal domain and open domain, understanding the rich argumentative texts in legal documents, and incorporating legal knowledge into existing natural language processing models.
  • Language Analysis and Calculation
  • Language Analysis and Calculation
    ZHANG Zhonglin, YU Wei, YAN Guanghui, YUAN Chenyu
    2022, 36(8): 12-19,28.
    Abstract ( ) PDF ( ) Knowledge map Save
    At present, most of the existing Chinese word segmentation models are based on recurrent neural networks, which can capture the overall features of the sequence while ignoring local features. This paper combines the attention mechanism, convolutional neural network and conditional random fields, and proposes Attention Convolutional Neural Network CRF (ACNNC). The self-attention layer replaces the recurrent neural network to capture the global features of the sequence, and the convolutional neural network captures location features of the sequence. The features are combined in the fusion layer and then input into conditional random fields for decoding. The experimental results on BACKOFF 2005 show that the model proposed achieves 96.2%, 96.4%, 96.1% and 95.8% F1 values on PKU, MSR, CITYU and AS test set, respectively.
  • Language Analysis and Calculation
    QIAN Qingqing, WANG Chengwen, WANG Guirong, RAO Gaoqi, XUN Endong
    2022, 36(8): 20-28.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a Chinese chunk-based dependency grammar (CCDG), which is focused on the chunks governed by the predicates within and between sentences. As an effort in establishing a syntactic analysis framework at the level of sentence group, the CCDG propose a novel idea to enlarge the linguistic granularity of leaf nodes. It can solve the logical structure knowledge at the micro level and pave a foundation for the meso argument knowledge and macro textual knowledge. This paper presents the concept, representation, analysis method and characteristics of CCDG, as well as the development of corresponding tree-bank. By August, 2020, the treebank was scaled up to 1.87 million tokens (including 40,000 complex sentences and 100,000 sub-sentences), consisting of 67% news texts and 32% encyclopedia texts.
  • Language Resources Construction
  • Language Resources Construction
    YUAN Yulin, CAO Hong
    2022, 36(8): 29-36,45.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper introduces the contents and structure of The Syntactic-Semantic Knowledge-Base of Chinese Verbs (KB@verb). First, the structural system and theoretical foundation of KB@verb are introduced. Secondly, KB@verb classifies verbs into eight sub-classes and defines 22 semantic roles, which are configured into decades of syntactic formats. These syntactic formats and their examples in real-world texts are also included. In addition, KB@verb identifies nine major grammatical functions of verbs and their degrees of membership. Finally, the retrieval system and the hardcopy of KB@verb are illustrated.
  • Language Resources Construction
    CHANG Hongyang, ZAN Hongying, MA Yutuan, ZHANG Kunli
    2022, 36(8): 37-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper discusses the labeling of Named-Entity and Entity Relations in Chinese electronic medical records of stroke disease, and proposes a system and norms for labeling entity and entity relations that are suitable for content and characteristics of electronic medical records of stroke disease. Based on the guidance of the labeling system and norms, after multiple rounds of labeling and correction, we completed the labeling of entities and relationships in electronic medical record of stroke disease more than 1.5 million words(Stroke Electronic Medical Record entity and entity related Corpus, SEMRC). The constructed corpus contains 10,594 named entities and 14,597 entity relationships. The consistency of named entity reached 0.8516, and that of entity relationship reached 0.9416.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    HUANKE You, HUAQUE Cairang, CAIRANG Dangzhi, DUOJIE Cairang
    2022, 36(8): 46-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    This thesis examines the Gesar epic text for its rich entities, and suggest to classify them in six types of named entities. A named entity recognition method combining Tibetan syllables and deep learning is proposed Gesar epic. The precision, recall and F-value of named entity recognition reached 92.01%, 91.96% and 91.99%, respectively, under the setting of more than 100,000 manually annotated named entities.
  • Ethnic Language Processing and Cross Language Processing
    YAN Xiaodong, XIE Xiaoqing
    2022, 36(8): 54-61.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentence ranking is one of the core technologies in natural language processing, with wide applications in multi-document summarization, question answering and text generation. Considering the rich morphological changes of Korean words, this paper a puts forward a Korean sentence ordering model based on the sub-word level vector and pointer network. A morpheme split based word vector training method (MorV) is presented, and the Korean word vector is obtained by comparing the sub word n-gram word vector training method (SG). Two sentence vector methods, i.e. convolution neural network (CNN) and long-term memory network (LSTM), are explored in the pointer network. The results show that the combination of MorV and LSTM can better capture the semantic logic relationship between sentences to improve the sentence ordering.
  • Ethnic Language Processing and Cross Language Processing
    CHEN Long, GUO Junjun, ZHANG Yafei, GAO Shengxiang, YU Zhengtao
    2022, 36(8): 62-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    Current event detection models based on deep learning rely on labeled data. However, the scarcity of annotation data for Vietnamese events and the ambiguity of event types have brought challenges to Vietnamese event detection.Taking advantage of the fact that sentences expressing the same viewpoint in different languages usually have the same or similar semantic components, this paper proposes a Vietnamese event detection framework that combines Chinese information and Vietnamese syntax. First, the shared encoder strategy and cross-attention network are applied to integrate Chinese semantic information into Vietnamese. Then the graph convolutional network is used to obtain Vietnamese representation based on Vietnamese dependency syntactic information. Finally, the Vietnamese semantic representation based on Chinese event type information is extracted through the event type perception network to realize Vietnamese news event detection. Experimental results show that the proposed method has achieved good results.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    YANG Yifan, SHI Miaoyuan, MIAO Qingliang, LI Maolong
    2022, 36(8): 73-80.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic extraction technology for medical record text is becoming increasingly important. At present, the distant supervision method is a popular solution to the lack of labeled corpus. Focusing on alleviating the unlabeled entity issue caused by distant supervision, this paper proposes a combined strategy of data augmentation, negative sampling and global optimal node set selection for the span-level based named entity recognition model. Experiments show that both data enhancement and the global optimal node set selection have a stable improvement of about 0.5%, and the negative sampling method has 5% to 10% improvement.
  • Information Extraction and Text Mining
    LI Zhongqiu, HONG Yu, WANG Jie, ZHOU Guodong
    2022, 36(8): 81-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    Event detection aims to automatically identify and classify event triggers from unstructured texts. Entity representation(or entity profiling)is deemed to be positive to event detection based on the hypothesis that “the features of entities often implies the type of events they participate in” (for example, “police” often participate in “Arrest-Jail” type of events). In this paper, we propose an entity representation enhancement method based on the document-level context and entities interaction. We use the graph attention neural network to capture the document-level information of entities on the background of high attention network, and take into account the local interactive information of other related entities. In particular, an attention mask model based on the entity co-occurrence graph is developed to reduce the noise information. We finally combine the entity representation enhancement network, BERT semantic encoding network, and GAT aggregation network to form the overall event detection model. Experiments on ACE2005 demonstrate the proposed method achieves 76.2% F1 score in trigger classification task, outperforming the baseline model by 2.2%.
  • Information Extraction and Text Mining
    MA Shikun, TENG Chong, LI Fei, JI Donghong
    2022, 36(8): 92-100.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text Classification is a fundamental task in natural larguage processing communing. However, current text classification is usually domain-independent, suffering from insufficient annotated training data. We propose a solution by leveraging the similar information of data in different domains to address the limited labeled training data issue. Under the framework of multi-task learning proposed by this paper, we extract domain-invariant and domain-specific features by using a shared encoder and multiple private encoders, respectively. Latent informaton from different domaius can be captured, which is beneficial for multi-domain text classification. Besides, we further apply an orthogonal projection operation to inherently disjoint shared and private feature spaces to refine of the shared features, and then designed a gate mechanism to fuse the shared and private features. Experiments on Amazon review and FDU-MTL show that the average accuracy of the proposed model on two datasets are 86.04% and 89.2%, respectively, significant better compared with multiple baseline models.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    LEI Pengbin, QIN Bin, WANG Zhili, WU Yufan, LIANG Siyi, CHEN Yu
    2022, 36(8): 101-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a new method for text sentiment classification based on the pre-training model. The BiLSTM network is applied to dynamically adjust the output weight of the Transformer of each layer of the pre-training model, and the layered text representation vectors are filtered using features such as BiLSTM and BiGRU. By using the model, we achieved third place in the Netizen Emotion Recognition Track during the epidemic of CCF 2020 Science and Technology for Epidemic·Big Data Charity Challenge. The F1 value of the final test set is 0.745 37, which is 0.000 1 less than the first-place model with 67% less parameters.
  • Sentiment Analysis and Social Computing
    ZHU Qinglin, LIANG Bin, XU Ruifeng, LIU Yuhan, CHEN Yi, MAO Ruibin
    2022, 36(8): 109-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    To address the entity-level sentiment analysis of financial texts, this paper builds a multi-million level corpus of sentiment analysis of financial domain entities and labels more than five thousand financial domain sentiment words as financial domain sentiment dictionary. We further propose an Attention-based Recurrent Network Combined with Financial Lexicon, called FinLexNet. FinLexNet model uses a LSTM to extract category-level information based on financial domain sentiment dictionary and another LSTM to extract semantic information at the word-level. In addition, in order to get more attention to the financial sentiment words, an attention mechanism based on the financial domain sentiment dictionary is proposed. Finally, experiments on the dataset we constructed shows that our model has achieved better performance than the baseline models.
  • Natural Language Understanding and Generation
  • Natural Language Understanding and Generation
    ZHU Zhanbiao, HUANG Peijie, ZHANG Yexing, LIU Shudong, ZHANG Hualin, HUANG Junyao, LIN Piyuan
    2022, 36(8): 118-126.
    Abstract ( ) PDF ( ) Knowledge map Save
    The joint model for intent detection and slot filling have boosted the state of the art of spoken language understanding (SLU). However, the presence of rarely seen or unseen mention degrades the performance of the model. Earlier researches show that sequence labeling task can benefit from the use of dependency tree structure for inferring existence of slot tags. In Chinese spoken language understanding, dominant models for slot filling are character-based hence word-level dependency tree structure can't be integrated into model directly. In this paper, we propose a dependency-guided character-based slot filling (DCSF) model, which provides a concise way to resolve the conflict of incorporating the word-level dependency tree structure into the character-level model in Chinese. Our DCSF model can integrate dependency tree information into the character-level model while preserving word-level context and segmentation information by modeling different types of relationships between Chinese characters in the utterance. Experimental results on the public benchmark corpora SMP-ECDT and CrossWOZ show our model outperforms the compared models and has a great improvement, especially in low resource and unseen slot mentions scenario.
  • Natural Language Understanding and Generation
    MA Tianyu, QIN Jun, LIU Jing, TIE Jun, HOU Qi
    2022, 36(8): 127-134.
    Abstract ( ) PDF ( ) Knowledge map Save
    The intention classification and the slot filling are two basic sub-tasks of spoken language understanding. A joint model of intention classification and slot filling based on BERT is proposed. Through an association network, the two tasks establish direct contact and share information. BERT is introduced into the model to enhance the semantic representation of word vectors, which effectively alleviates the issue of small training data. Experiments on ATIS and Snips data sets show that the proposed model can significantly improve the accuracy of intention classification and the F1 value of slot filling.
  • Natural Language Understanding and Generation
    YIN Baosheng, AN Pengfei
    2022, 36(8): 135-143,153.
    Abstract ( ) PDF ( ) Knowledge map Save
    The abstractive document summarization algorithm based on sequence-to-sequence model has achieved good performance. Given that the rich local contextual information contained in Chinese n-grams, this paper proposes NgramSum to integrate n-gram information into the neural framework of the existing model.. The framework takes the existing neural model as the backbone, extracts n-grams information from the local corpus, and applies the n-gram information to augment the local context via a gate module. The experimental results on the dataset of NLPCC2017 shared task3 show that the framework effectively enhances the sequence-to-sequence strong baseline model of LSTM, Transformer, and pre-trained model with an average of 2.76%, 3.25% and 3.10% increase, respectively, according to the ROUGE-1/2/L scores.
  • Natural Language Understanding and Generation
    GUAN Mengyu, WANG Zhongqing, LI Shoushan, ZHOU Guodong
    2022, 36(8): 144-153.
    Abstract ( ) PDF ( ) Knowledge map Save
    Existing dialogue systems tend to generate meaningless general replies such as “OK” and “I don't know”. In daily dialogs, every utterance usually has obvious emotional and intentional tendencies. So this paper proposes a response generation model based on dialogue constraints. Based on the Seq2Seq model, it combines the recognition of utterances’ themes, sentiments and intentions. This method constrains the topics, emotions and intentions of the generated responses, generating responses with reasonable sentiment and intention tendencies and close relation to the topic of the conversation. Experiments show that the method proposed in this paper can effectively improve the quality of generated responses.
  • Natural Language Understanding and Generation
    ZENG Biqing, PEI Fenghua, XU Mayi, DING Meirong
    2022, 36(8): 154-162,174.
    Abstract ( ) PDF ( ) Knowledge map Save
    Paragraph-level question generation is to generate one or more related questions from a given paragraph. Current studies on sequence-to-sequence based neural networks fail to filter redundant information or focus on key sentences. To solve this issue, this paper proposes a dual attention based model for paragraph-level question generation. The model first uses the attention mechanism for the paragraph and the sentence where the answer is located. Then, the model uses the gating mechanism to dynamically assign weights and merges them into context information. Finally, it improves pointer-generator network to combine the context vector and attention distribution to generate questions. Experimental results show that this model has a better performance than exiting models on the SQuAD dataset.
  • NLP Application
  • NLP Application
    SONG Li, LIU Ying, MA Yanjun
    2022, 36(8): 163-174.
    Abstract ( ) PDF ( ) Knowledge map Save
    It is a classical dispute that whether Nai'an Shi authored Water Margin alone, and if he had any partnership with Guanzhong Luo. We first collect the controversies of the authorship of Water Margin and summarize the following five hypotheses: 1. written by Shi alone, 2. written by Luo alone, 3. originally written by Shi and continue to be written by Luo, 4. originally written by Luo and continue to be written by someone else, 5. originally written by Shi and adapted by Luo. Taking Luo's Quelling the Demons' Revolt as a reference, we then investigate the writing style of Water Margin via hypothesis text, text clustering, text classification, rolling stylometry, and text content analysis. The results reveal that the fourth hypothesis is most probably true: the first 70 sections were written by Luo and the rest were written by others.