2020 Volume 34 Issue 10 Published: 23 November 2020
  

  • Select all
    |
    Language Analysis and Calculation
  • Language Analysis and Calculation
    HUANG Tong, LI Bin, YAN Peiyi, DAI Yuling, QU Weiguang
    2020, 34(10): 1-9,18.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a structure without direct correspondence to its literal meaning, the construction is quite different from the regular sentences yet pose a great influence on the accuracy of parser. To facilitate the automatic analysis of construction, it is necessary to build a corpus for construction for the study of its internal structure. In this paper, Abstract Meaning Representation (AMR) is used to annotate the semantic structure of constructions. According to 1,057 construction with annotation, it is found that 61.2% of constructions can be described by the principle of compositionality in Chinese AMR. As for the remaining 38.8% of the constructions beyond the principle of compositionality (lack of concepts, difficult to separate components, and difficult to express rhetorical meaning), this paper proposes to label the whole structure as word or only annotate its surface meaning. The completed Chinese construction corpus provide data for both theoretical study and automatic analysis of the meaning of construction.
  • Language Analysis and Calculation
    ZHU Jie, LI Junhui
    2020, 34(10): 10-18.
    Abstract ( ) PDF ( ) Knowledge map Save
    Abstract semantic representation to text (AMR-to-Text) generation is the task of generating text of the same meaning for a given AMR graph. This task can be viewed as a translation task from the source AMR graph to the target sentence. In contrast to the state-of-the-art solution of graph-to-sequence (graph2seq) model, this paper proposes a direct and effective AMR-to-Text generation method based on transformer under the seq2seq framework. The byte pair coding (BPE) and the shared vocabulary techniques are introduced to deal with the OOV (out-of-vocabulary) issue. On two English benchmark datasets, the experimental results show that the proposed method achieves best performance compared with those reported in the literature.
  • Language Analysis and Calculation
    FENG Wenhe, XU Yuyi, LI Qingchun
    2020, 34(10): 19-26.
    Abstract ( ) PDF ( ) Knowledge map Save
    The discourse dependency structure is generally expressed as the dominant relationship between the minimum discourse units (clauses), which can effectively depict the direct relationship between the minimum discourse units and its concentric nature compared with the rhetorical structure. Based on the annotating practice of Chinese discourse dependency structure corpus, this paper is focused on the analysis of the annotation difficulties and the possible solutions, including clause segmentation, clause relevance, dependency head and other important analysis tasks. In fact, these difficulties also challenge the automatic annotation. The solutions benefit both the construction of high-quality corpus and the research on parsing.
  • Language Analysis and Calculation
    SHE Qixing, WANG Bicong, LIU Ming, QIN Bing, WANG Lifeng
    2020, 34(10): 27-32.
    Abstract ( ) PDF ( ) Knowledge map Save
    Synonym discovery is a typical task in natural language processing, aiming at predicting whether a given word is a synonym of another word. With pre-trained word embedding appeared recently, a simple and effective distributional based approach is available by exploiting the similarity between word embeddings. To further augmenting external knowledge such as synonym tuples, this paper proposes a word embedding fine-tuning approach based on synonym tuples in Tongyi Cilin, so as to enhance the semantic representation of embedding. Our experiments show this approach is effective for predicting synonyms.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    LONG Congjun, ZHOU Maoke, LIU Huidan
    2020, 34(10): 33-38,50.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word vectors play an important role in various fields of natural language processing. This paper tries to reveal the relationship between word vector technology and linguistic theory. Based on the features of word vectors, this paper proposes an approach to construct knowledge base of semantic similar Tibetan words. Based on the Chinese <Cilin> dictionary, published by Harbin University of Technology, we compute the differences between every word vector and the average word vectors of the atomic word group. With the help of Chinese-Tibetan bilingual dictionary, we deploy such differences to select the similar words from word vectors by Tibetan words and Tibetan syllables, respectively. Compared with those of manual verification, we find that the results of word vector computing are consistent with human language intuition. This approach may improve the efficiency of constructing Tibetan knowledge base of semantic similar words.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    WANG Ya, CAO Cungen
    2020, 34(10): 39-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    Aimed at the categorization of events with procedure, we propose an event semantic categorization method based on event attributes. We extract the characteristic attributes of an event from its definition and assign the weight to each characteristic attribute. We adopt the frame semantics to represent an event, which consists of characteristic attributes and private attributes. This paper utilizes the class of "dissemination events" as an example to demonstrate our categorization method. We prove that a clear semantic categorization structure of events can be obtained with this method. We use description logics to formalize the events and the relationships between these events. According to this event classification system, we can effectively acquire commonsense knowledge related to event attributes.
  • Information Extraction and Text Mining
    WANG Dexian, WANG Suge, PEI Wensheng, LI Deyu
    2020, 34(10): 51-58.
    Abstract ( ) PDF ( ) Knowledge map Save
    The correct extraction of the entities such as the evidence name, proof content and file number in the judgment can improve the handling of case. However, different from popular entities, these entities have the characteristics of multi-character and strong correlation. Therefore, this paper proposes a legal entity recognition method based on JCWA-DLSTM. The proposed method uses the character-level language model to obtain the word-level representation, at the same time, self-attention mechanism is used to calculate the weight of each word in the input sentence, so as to obtain the internal representation of the sentence. On this basis, bidirectional LSTM is adopted to encode the concatenation of internal representation of a sentence with the word vector and the concatenation of character-level vector. Finally, the semantic representation of the sentence is decoded by CRF to obtain the optimal tag sequence. The experimental results show that the proposed method can effectively improve the identification results.
  • Information Extraction and Text Mining
    LIU Peng, WEI Huizi, LU Xiaolong, LIU Mingming
    2020, 34(10): 59-68.
    Abstract ( ) PDF ( ) Knowledge map Save
    Event detection (ED) is one of the core tasks in Natural Language Processing, with state-of-the-art solution by Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN). By combining Iterated Dilated CNN (IDCNN) and Highway Network (HN), this paper proposes the Highway Iterated Dilated CNN (HIDCNN). Meanwhile, a mixed feature building method is presented. The experimental results indicate the proposed model can achieve superior detection effect, better convergence and higher training efficiency.
  • Information Extraction and Text Mining
    ZHANG Liumin, ZHANG Yun, LI Peifeng
    2020, 34(10): 69-75,84.
    Abstract ( ) PDF ( ) Knowledge map Save
    Event factuality denotes the factual nature of events in texts, indicating whether an event is a fact, a possibility, or an impossible situation. As an important semantic task in natural language processing, the existing studies on event factuality identification are focused on sentences-level. Based on the convolutional neural network, this paper proposes document-level factuality by introducting the sentence-level features in the text, including the semantic, grammar and clues of the sentence. Experimental results on both the Chinese and English corpus show that, 1) the micro-average F1 is increased by 3.51% and 6.02%, respectively; 2) the macro-average F1 is increased by 4.63% and 9.97%, respectively. The training speed of this method is also four times faster than the baseline.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    KE Wenjun, GAO Jinhua, SHEN Huawei, LIU Yue, CHENG Xueqi
    2020, 34(10): 76-84.
    Abstract ( ) PDF ( ) Knowledge map Save
    For domain-specific question answering (QA) systems, question retrieval via template matching proves to be effective and stable. However, existing template extraction methods usually work in a supervised manner, resulting in heavy dependence on manually annotated data and poor extensibility among different domains. To address this issue, this paper proposes an unsupervised template extraction method based on an improved Apriori algorithm. For given samples of question utterances, the frequently occurred phrases are first orderly extracted as frame words of candidate templates. The information inhabited in candidate templates is measured via TF-IDF, and candidates with low information are filtered out. In particular, to allow longer templates, an adaptive updating mechanism for support threshold is proposed. Finally, NER methods are adopted to locate slots, and question templates are obtained by combining frame words and the corresponding slots. Experimental results show that our method can effectively extract question templates for specific domains and obtain better results than baseline models.
  • NLP Application
  • NLP Application
    LIU Kan, ZHANG Yaquan
    2020, 34(10): 85-93,104.
    Abstract ( ) PDF ( ) Knowledge map Save
    Aiming at the accurate and rapid diagnosis of complications, this paper proposes an auxiliary diagnosis model based on knowledge graph, representation model and deep neural network. Firstly, a medical knowledge graph is constructed, which is represented by the vector for each entity and relation. Then according to chief complaints of the patients, the symptom entities are detected and again represented by vectors. Eventually, the above two kind of vectors are input to the CNN-DNN classification model joint with the index representation to diagnose the complications. The experiment chooses three complications of diabetes: hypertension, diabetic nephropathy and diabetic retinopathy. The accuracy of the proposed model is improved by 5%, 5%, 14% compared with the classical machine learning methods, respectively; and 27%, 6%, 9% higher than that of previous DNN model.
  • NLP Application
    FAN Chulin, LIU Ying
    2020, 34(10): 94-104.
    Abstract ( ) PDF ( ) Knowledge map Save
    We collected 376 linguistic features from Lu Xuns letter, fiction and essay and used random forests and k-means clustering to select 58 features that could effectively distinguish the three genres. We used Bibers multidimensional analysis to perform factor analysis on these features and extracted 7 important factors. Based on the linguistic features with significant factor loadings, we interpreted 4 factors as dimensions and 3 factors as feature combinations. The results show that letter and fiction are similar in interaction, while letter tends to be more argumentative, classical and detailed, and fiction tends to be more descriptive, colloquial and brief. Letter and essay are similar in argumentation and detailed structure, while letter tends to be more interactive. Fiction and essay lack similar dimensions.
  • NLP Application
    LIU Jinghao, SUN Xiaowei, JIN Jie
    2020, 34(10): 105-112.
    Abstract ( ) PDF ( ) Knowledge map Save
    To address the issue of high feature dimension of network data for a better intrusion detection method, this paper proposes an intrusion detection method based on PCA(principal component analysis) and RNN(recurrent neural network). PCA is used to perform feature dimension reduction and noise reduction on the data, detecting the subset of principal component features with the largest information. And then RNN is used to classify the processed data. Experimented on the NSL-KDD data set, the results show that the proposed intrusion detection algorithm can effectively improve the accuracy of detection compared with the popular intrusion detection technology based on machine learning and deep learning methods.