Journal of Chinese Information Processing

Select

Language Analysis and Calculation

Abstract Meaning Representation Based Annotationand Analysis of Chinese Construction

HUANG Tong, LI Bin, YAN Peiyi, DAI Yuling, QU Weiguang

2020, 34(10): 1-9,18.

Abstract ( ) PDF ( )

Knowledge map

Save

As a structure without direct correspondence to its literal meaning, the construction is quite different from the regular sentences yet pose a great influence on the accuracy of parser. To facilitate the automatic analysis of construction, it is necessary to build a corpus for construction for the study of its internal structure. In this paper, Abstract Meaning Representation (AMR) is used to annotate the semantic structure of constructions. According to 1,057 construction with annotation, it is found that 61.2% of constructions can be described by the principle of compositionality in Chinese AMR. As for the remaining 38.8% of the constructions beyond the principle of compositionality (lack of concepts, difficult to separate components, and difficult to express rhetorical meaning), this paper proposes to label the whole structure as word or only annotate its surface meaning. The completed Chinese construction corpus provide data for both theoretical study and automatic analysis of the meaning of construction.

Select

Language Analysis and Calculation

AMR-to-Text Generation Based on Transformer

ZHU Jie, LI Junhui

2020, 34(10): 10-18.

Abstract ( ) PDF ( )

Knowledge map

Save

Abstract semantic representation to text (AMR-to-Text) generation is the task of generating text of the same meaning for a given AMR graph. This task can be viewed as a translation task from the source AMR graph to the target sentence. In contrast to the state-of-the-art solution of graph-to-sequence (graph2seq) model, this paper proposes a direct and effective AMR-to-Text generation method based on transformer under the seq2seq framework. The byte pair coding (BPE) and the shared vocabulary techniques are introduced to deal with the OOV (out-of-vocabulary) issue. On two English benchmark datasets, the experimental results show that the proposed method achieves best performance compared with those reported in the literature.

Select

Language Analysis and Calculation

Difficulties in Annotating the Discourse Dependency Structure of Chinese Texts and the Solution

FENG Wenhe, XU Yuyi, LI Qingchun

2020, 34(10): 19-26.

Abstract ( ) PDF ( )

Knowledge map

Save

The discourse dependency structure is generally expressed as the dominant relationship between the minimum discourse units (clauses), which can effectively depict the direct relationship between the minimum discourse units and its concentric nature compared with the rhetorical structure. Based on the annotating practice of Chinese discourse dependency structure corpus, this paper is focused on the analysis of the annotation difficulties and the possible solutions, including clause segmentation, clause relevance, dependency head and other important analysis tasks. In fact, these difficulties also challenge the automatic annotation. The solutions benefit both the construction of high-quality corpus and the research on parsing.

Select

Language Analysis and Calculation

A Fine-tuning Method Based on Tongyi Cilin and Pre-trained Word Embedding

SHE Qixing, WANG Bicong, LIU Ming, QIN Bing, WANG Lifeng

2020, 34(10): 27-32.

Abstract ( ) PDF ( )

Knowledge map

Save

Synonym discovery is a typical task in natural language processing, aiming at predicting whether a given word is a synonym of another word. With pre-trained word embedding appeared recently, a simple and effective distributional based approach is available by exploiting the similarity between word embeddings. To further augmenting external knowledge such as synonym tuples, this paper proposes a word embedding fine-tuning approach based on synonym tuples in Tongyi Cilin, so as to enhance the semantic representation of embedding. Our experiments show this approach is effective for predicting synonyms.

Select

Ethnic Language Processing and Cross Language Processing

Construction of Knowledge Base of Semantic Similar Tibetan Words Based on Word Vectors

LONG Congjun, ZHOU Maoke, LIU Huidan

2020, 34(10): 33-38,50.

Abstract ( ) PDF ( )

Knowledge map

Save

Word vectors play an important role in various fields of natural language processing. This paper tries to reveal the relationship between word vector technology and linguistic theory. Based on the features of word vectors, this paper proposes an approach to construct knowledge base of semantic similar Tibetan words. Based on the Chinese <Cilin> dictionary, published by Harbin University of Technology, we compute the differences between every word vector and the average word vectors of the atomic word group. With the help of Chinese-Tibetan bilingual dictionary, we deploy such differences to select the similar words from word vectors by Tibetan words and Tibetan syllables, respectively. Compared with those of manual verification, we find that the results of word vector computing are consistent with human language intuition. This approach may improve the efficiency of constructing Tibetan knowledge base of semantic similar words.

Select

Information Extraction and Text Mining

Research on Categorization of Events Based on Event Attributes

WANG Ya, CAO Cungen

2020, 34(10): 39-50.

Abstract ( ) PDF ( )

Knowledge map

Save

Aimed at the categorization of events with procedure, we propose an event semantic categorization method based on event attributes. We extract the characteristic attributes of an event from its definition and assign the weight to each characteristic attribute. We adopt the frame semantics to represent an event, which consists of characteristic attributes and private attributes. This paper utilizes the class of "dissemination events" as an example to demonstrate our categorization method. We prove that a clear semantic categorization structure of events can be obtained with this method. We use description logics to formalize the events and the relationships between these events. According to this event classification system, we can effectively acquire commonsense knowledge related to event attributes.

Select

Information Extraction and Text Mining

Named Entity Recognition Based on JCWA-DLSTM for Legal Instruments

WANG Dexian, WANG Suge, PEI Wensheng, LI Deyu

2020, 34(10): 51-58.

Abstract ( ) PDF ( )

Knowledge map

Save

The correct extraction of the entities such as the evidence name, proof content and file number in the judgment can improve the handling of case. However, different from popular entities, these entities have the characteristics of multi-character and strong correlation. Therefore, this paper proposes a legal entity recognition method based on JCWA-DLSTM. The proposed method uses the character-level language model to obtain the word-level representation, at the same time, self-attention mechanism is used to calculate the weight of each word in the input sentence, so as to obtain the internal representation of the sentence. On this basis, bidirectional LSTM is adopted to encode the concatenation of internal representation of a sentence with the word vector and the concatenation of character-level vector. Finally, the semantic representation of the sentence is decoded by CRF to obtain the optimal tag sequence. The experimental results show that the proposed method can effectively improve the identification results.

Select

Information Extraction and Text Mining

Mine Disaster Event Detection Model Based on A Novel Convolutional Neural Network

LIU Peng, WEI Huizi, LU Xiaolong, LIU Mingming

2020, 34(10): 59-68.

Abstract ( ) PDF ( )

Knowledge map

Save

Event detection (ED) is one of the core tasks in Natural Language Processing, with state-of-the-art solution by Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN). By combining Iterated Dilated CNN (IDCNN) and Highway Network (HN), this paper proposes the Highway Iterated Dilated CNN (HIDCNN). Meanwhile, a mixed feature building method is presented. The experimental results indicate the proposed model can achieve superior detection effect, better convergence and higher training efficiency.

Select

Information Extraction and Text Mining

An Approach to Document-Level Event Factuality Identification Based on Sentence-Level Representations

ZHANG Liumin, ZHANG Yun, LI Peifeng

2020, 34(10): 69-75,84.

Abstract ( ) PDF ( )

Knowledge map

Save

Event factuality denotes the factual nature of events in texts, indicating whether an event is a fact, a possibility, or an impossible situation. As an important semantic task in natural language processing, the existing studies on event factuality identification are focused on sentences-level. Based on the convolutional neural network, this paper proposes document-level factuality by introducting the sentence-level features in the text, including the semantic, grammar and clues of the sentence. Experimental results on both the Chinese and English corpus show that, 1) the micro-average F₁ is increased by 3.51% and 6.02%, respectively; 2) the macro-average F₁ is increased by 4.63% and 9.97%, respectively. The training speed of this method is also four times faster than the baseline.

Select

Information Retrieval and Question Answering

Unsupervised Question Template Extraction Based on Improved Apriori Algorithm

KE Wenjun, GAO Jinhua, SHEN Huawei, LIU Yue, CHENG Xueqi

2020, 34(10): 76-84.

Abstract ( ) PDF ( )

Knowledge map

Save

For domain-specific question answering (QA) systems, question retrieval via template matching proves to be effective and stable. However, existing template extraction methods usually work in a supervised manner, resulting in heavy dependence on manually annotated data and poor extensibility among different domains. To address this issue, this paper proposes an unsupervised template extraction method based on an improved Apriori algorithm. For given samples of question utterances, the frequently occurred phrases are first orderly extracted as frame words of candidate templates. The information inhabited in candidate templates is measured via TF-IDF, and candidates with low information are filtered out. In particular, to allow longer templates, an adaptive updating mechanism for support threshold is proposed. Finally, NER methods are adopted to locate slots, and question templates are obtained by combining frame words and the corresponding slots. Experimental results show that our method can effectively extract question templates for specific domains and obtain better results than baseline models.

Select

NLP Application

Medical Knowledge Graph Based Auxiliary Diagnosis of Complications

LIU Kan, ZHANG Yaquan

2020, 34(10): 85-93,104.

Abstract ( ) PDF ( )

Knowledge map

Save

Aiming at the accurate and rapid diagnosis of complications, this paper proposes an auxiliary diagnosis model based on knowledge graph, representation model and deep neural network. Firstly, a medical knowledge graph is constructed, which is represented by the vector for each entity and relation. Then according to chief complaints of the patients, the symptom entities are detected and again represented by vectors. Eventually, the above two kind of vectors are input to the CNN-DNN classification model joint with the index representation to diagnose the complications. The experiment chooses three complications of diabetes: hypertension, diabetic nephropathy and diabetic retinopathy. The accuracy of the proposed model is improved by 5%, 5%, 14% compared with the classical machine learning methods, respectively; and 27%, 6%, 9% higher than that of previous DNN model.

Select

NLP Application

A Comparative Study on Three Genres of Lu Xun Based on Multidimensional Analysis

FAN Chulin, LIU Ying

2020, 34(10): 94-104.

Abstract ( ) PDF ( )

Knowledge map

Save

We collected 376 linguistic features from Lu Xuns letter, fiction and essay and used random forests and k-means clustering to select 58 features that could effectively distinguish the three genres. We used Bibers multidimensional analysis to perform factor analysis on these features and extracted 7 important factors. Based on the linguistic features with significant factor loadings, we interpreted 4 factors as dimensions and 3 factors as feature combinations. The results show that letter and fiction are similar in interaction, while letter tends to be more argumentative, classical and detailed, and fiction tends to be more descriptive, colloquial and brief. Letter and essay are similar in argumentation and detailed structure, while letter tends to be more interactive. Fiction and essay lack similar dimensions.

Select

NLP Application

Intrusion Detection Model Based on Principle Component Analysis and Recurrent Neural Network

LIU Jinghao, SUN Xiaowei, JIN Jie

2020, 34(10): 105-112.

Abstract ( ) PDF ( )

Knowledge map

Save

To address the issue of high feature dimension of network data for a better intrusion detection method, this paper proposes an intrusion detection method based on PCA(principal component analysis) and RNN(recurrent neural network). PCA is used to perform feature dimension reduction and noise reduction on the data, detecting the subset of principal component features with the largest information. And then RNN is used to classify the processed data. Experimented on the NSL-KDD data set, the results show that the proposed intrusion detection algorithm can effectively improve the accuracy of detection compared with the popular intrusion detection technology based on machine learning and deep learning methods.

Please choose a citation manager

Content to export

2020 Volume 34 Issue 10 Published: 23 November 2020