2020 Volume 34 Issue 11 Published: 09 December 2020
  

  • Select all
    |
    Language Resources Construction
  • Language Resources Construction
    XING Dan, RAO Gaoqi, XUN Endong, WANG Chengwen
    2020, 34(11): 1-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    Prescription structure is of great significance to natural language processing and language teaching research. This paper constructs a high-quality preposition structure collocation base from large-scale corpus. First, we determine the classification scheme of preposition and collect preposition collocation. Then, we design and acquire rules of prepositional structure collocation from large data. Finally, we test and analyze the extracted result.
  • Language Resources Construction
    LIU Lei, HE Ben, SUN Le
    2020, 34(11): 9-18,48.
    Abstract ( ) PDF ( ) Knowledge map Save
    Reading Chinese ancient poems with appropriate difficulty is beneficial to readers’ literacy appreciation skills. However the automatic analysis of ancient Chinese poetry readability is less addressed owing to the lack of large-scale high-quality corpus. This paper provides a collection of 1 915 ancient Chinese poetry with manually annotated readability levels. We provide three readability classes in the initial APRD dataset. Then we further refine two kinds of granularities of readability and provide the APRD+ dataset with six readability classes. The Spearman correlation of the golden standard with APRD and APRD is 0.786 and 0.804, respectively. The SVM and random forest algorithms are applied to classify the difficulty levels of poetries, and the experimental results are provided.
  • Language Resources Construction
    XIAO Dan, YANG Erhong, ZHANG Minghui, LU Tianying, YANG Liner
    2020, 34(11): 19-28,36.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese inter-language is accompanied by Chinese international education. With growing development of Chinese language learning in the world, the scale of inter-language in Chinese has been expanding. Considering the uniqueness of using inter-language, it has become a unique resource for language information processing and intelligent language assisting learning. Compared with inter-language in English with dependency grammar annotation corpus, the current Chinese inter-language corpora even have no annotation guideline for dependency syntax.Aiming to construct the corpus of inter-language dependency annotation in Chinese, this paper, develops a new dependency annotation guideline for Chinese inter-language based on the Universal Dependencies. And a corpus of Chinese inter-language annotated with dependency sturucture is finally achieved with consideration of its characteristics.
  • Knowledge Representation and Acquisition
  • Knowledge Representation and Acquisition
    LIU Junnan, LIU Haiyan, CHEN Xiaohui, GUO Xuan, ZHU Xinming
    2020, 34(11): 29-36.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid development of 3S technology, there is an explosive growth in geo-spatial data. It has become an urgent scientific problem to construct knowledge graph based on geo-spatial data so as to realize the transformation from data to knowledge. In the general knowledge graph, geo-spatial knowledge is only represented for the attribute or the semantic relationship, with the spatial relationship missed. This paper first designs the representation method of spatial relationship. Then, it proposes the technical map of knowledge graph construction based on spatial relationship, focusing on the spatial relationship extraction and multi-source geographic data fusion. We also discuss the application direction of knowledge graph in the field of geo-spatial: promoting the integration of geo-spatial data and semantic web technologies.
  • Knowledge Representation and Acquisition
    LIU Xuli, JIN Jihao, RUAN Tong, GAO Daqi, YIN Yichao, GE Xiaoling
    2020, 34(11): 37-48.
    Abstract ( ) PDF ( ) Knowledge map Save
    Clinical research based on observational data of electronic medical records has become a hot topic. In this paper, a new representation model of medical observation data based on RDF is proposed. The model can clearly represent multiple event types such as clinical examination, diagnosis, treatment as well as temporal relationships between events. Base on electronic medical records from hospitals, clinical event graphs are constructed by four steps: data preprocessing, RDF format conversion, time sequence construction and knowledge fusion. Specifically, using the electronic medical records of three first-class hospitals in Shanghai, we constructed a medical dataset including three specialties, 173 395 medical events, 501 335 temporal relationships of events, and linked with 5 313 concepts in the knowledge base. This paper further provides 40 sample queries for clinical retrospective research including etiology analysis and treatment analysis, with demonstration in contrast to the traditional database in terms of query formulation and retrieval process. The dataset follows the Open Link Standard and is published on OpenKG with online SPARQL site (https://peg.ecustnlplab.com/dataset.html).
  • Knowledge Representation and Acquisition
    LIU Peng, YE Shuai, SHU Ya, LU Xiaolong, LIU Mingming
    2020, 34(11): 49-59.
    Abstract ( ) PDF ( ) Knowledge map Save
    Coal mining enterprises are developing beyond information construction into intelligence era, motivated by new network technologies like big data and artificial intelligence. In this paper, knowledge graph is introduced into the domain of coalmine safety. The domain knowledge concept is first classified, stored in the graph database, and visually presented for its concept relations. Then, to facilitate the query search over this knowledge graph, a question classification approach is implemented to identify the best query types for a specific question. The experiment results show that the proposed entity extraction method has higher scores on recall and precision, and the Spark-based parallel question classification algorithm significantly improves efficiency while ensuring the accuracy.
  • Machine Translation
  • Machine Translation
    MAO Cunli, WU Xia, ZHU Junguo, YU Zhengtao, LI Yunlong, WANG Zhenhan
    2020, 34(11): 60-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    Bilingual parallel corpus is a key resources to improve the quality of machine translation. We propose a Chinese-Burmese parallel sentence pair extraction method based on CNN-CorrNet network. Specifically, we first use BERT to obtain vector representations of Chinese and Burmese words, and use convolution neural network to represent sentences in Chinese and Burmese to capture important feature information of sentences. Then, in order to ensure the maximum correlation between the cross-language representations of the two languages, the existing Chinese and Burmese parallel sentence pairs are used as constraints, and CorrNet (Correlational Neural Networks) is applied to map the Chinese and Burmese sentence representation into the common semantic space. Finally, the distance of Chinese and Burmese sentences in the public semantic space is calculated to determine the true bilingual sentence pairs. The experiment results show that, compared with the maximum entropy model and the siamese network model, the F1 value of the method proposed in this paper is increased by 13.3% or 5.1%, respectively.
  • Machine Translation
    CIZHEN Jiacuo, SANGJIE Duanzhu, SUN Maosong, ZHOU Maoxian, SE Chajia
    2020, 34(11): 67-73,83.
    Abstract ( ) PDF ( ) Knowledge map Save
    Tibetan-Chinese machine translation is one of the most important research topics in Tibetan Natural Language Processing. Due to the limitation of parallel corpus between Tibetan and Chinese available, this paper is focused on improving Tibetan-Chinese machine translation by dealing with the low resource issue. Based on transformer architecture, we apply iterative back translation strategy and automatic translation filtering mechanism. In experiments with only 1.641M mono Tibetan sentences, we achieve 6.7 and 9.8 improvements in term of BLUE score over the baseline model, respectively.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    LIU Yuhan, LIU Changjian, XU Ruifeng, LUO Wangda, CHEN Yi, JI Zhongsheng, YING Nengtao
    2020, 34(11): 74-83.
    Abstract ( ) PDF ( ) Knowledge map Save
    To deal with Chinese named entity recognition in finance domain, this paper presents a novel neural network model combining glyph feature and iterative learning, Based on the framework of bidirectional long-short term memory networks and conditional random fields, this model encodes wubi input code of Chinese characters for information enhancement and use iterative learning to continuously update predict results. We manually annotate a large-scale financial named entity corpus named HITSZ-Finance, including 31210 sentences and 4 types of entities. Experiment results on HITSZ-Finance corpus demonstrate the effectiveness of the model.
  • Information Extraction and Text Mining
    WU Gaobo, WANG Liming, CHAI Yumei, LIU Zhen
    2020, 34(11): 84-95.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text classification is one of the focuses of the research with wide applications. This paper optimizes the NMF-SVM classification method to deal with the lack of hierarchical features in the text classification process, achieving a hierarchical classification model. Secondly, to capture the different influences between keywords and the non-keywords on the classification result, we introduce the SEAN attention mechanism to obtain the attention between different words in relation to detect four type of entities: time, place, person and event. Finally, to handle the differences in the strength of sentence connections, an inter-sentence affinity model is proposed for texts of news, novels, reading comprehension, and Weibo, which are rich in above four entities. On the news data set, the model is demonstrated with a better classification result compared with the deep learning text classification model and the hybrid model with attention mechanism.
  • Information Extraction and Text Mining
    SONG Wei, ZHU Fuxin
    2020, 34(11): 96-103.
    Abstract ( ) PDF ( ) Knowledge map Save
    Relation extraction aims to identify the relationship between entities from a large amount of unstructured data. This paper proposes a relation extraction method based on global and local feature-aware network. This method first adopts the self-attention mechanism and recurrent neural network to obtain the correlated sequence features of each word. Then, a multi-branch feature-aware convolutional neural network is constructed to obtain global and local features without their mutual interference. Moreover, the obtained two features are concatenated to fully represent the important semantic features of the sentence. Experimental results show that the proposed method performs better than the state-of-the-art methods based on convolutional neural networks and recurrent neural networks, with the F1 of 86.1% and 64.9% on the standard SemEval-2010 Task 8 and KBP37 datasets, respectively.
  • Question Answering and Dialogue System
  • Question Answering and Dialogue System
    WANG Siyu, QIU Jiangtao, HONG Chuanyang, JIANG Ling
    2020, 34(11): 104-112.
    Abstract ( ) PDF ( ) Knowledge map Save
    In general, Question Answering System (QAS) for the commodity is mainly built via the intention identification and answer configuration. However, the configuration of answers of questions depends on manual labor, which easily results in poor quality of answers. With the introduction and development of Knowledge Graph (KG) technology, the KG-based QAS has gradually become a hot research topic. At present, the KG-based QAS for commodity is mainly implemented by employing rules to transform questions to queries in the KG. Although the manual configuration work is reduced, the performance of QAS is limited by the quality and quantity of the rules. In order to solve above problems, this paper proposes a question answering method for online commodities based on KG and rule reasoning. The main contributions include: (1) we built an LSTM-based property attention network named SiameseATT(Siamese Attention Network) for attribute selection; (2) we employed KG to infer rules, consequently generate a large number of triples to respond more questions. Finally, experiments on the NLPCC-ICCPOL 2016 dataset show that the model obtains good performance. Our QAS is more suitable for e-commerce applications.