Journal of Chinese Information Processing

Select

Language Resources Construction

Large-scale Corpus Based Preposition Structure Collocation Base

XING Dan, RAO Gaoqi, XUN Endong, WANG Chengwen

2020, 34(11): 1-8.

Abstract ( ) PDF ( )

Knowledge map

Save

Prescription structure is of great significance to natural language processing and language teaching research. This paper constructs a high-quality preposition structure collocation base from large-scale corpus. First, we determine the classification scheme of preposition and collect preposition collocation. Then, we design and acquire rules of prepositional structure collocation from large data. Finally, we test and analyze the extracted result.

Select

Language Resources Construction

A Corpus of Ancient Chinese Poetry Annotated with Readability

LIU Lei, HE Ben, SUN Le

2020, 34(11): 9-18,48.

Abstract ( ) PDF ( )

Knowledge map

Save

Reading Chinese ancient poems with appropriate difficulty is beneficial to readers’ literacy appreciation skills. However the automatic analysis of ancient Chinese poetry readability is less addressed owing to the lack of large-scale high-quality corpus. This paper provides a collection of 1 915 ancient Chinese poetry with manually annotated readability levels. We provide three readability classes in the initial APRD dataset. Then we further refine two kinds of granularities of readability and provide the APRD+ dataset with six readability classes. The Spearman correlation of the golden standard with APRD and APRD is 0.786 and 0.804, respectively. The SVM and random forest algorithms are applied to classify the difficulty levels of poetries, and the experimental results are provided.

Select

Language Resources Construction

Dependency Annotation Guideline for Chinese Inter-language

XIAO Dan, YANG Erhong, ZHANG Minghui, LU Tianying, YANG Liner

2020, 34(11): 19-28,36.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese inter-language is accompanied by Chinese international education. With growing development of Chinese language learning in the world, the scale of inter-language in Chinese has been expanding. Considering the uniqueness of using inter-language, it has become a unique resource for language information processing and intelligent language assisting learning. Compared with inter-language in English with dependency grammar annotation corpus, the current Chinese inter-language corpora even have no annotation guideline for dependency syntax.Aiming to construct the corpus of inter-language dependency annotation in Chinese, this paper, develops a new dependency annotation guideline for Chinese inter-language based on the Universal Dependencies. And a corpus of Chinese inter-language annotated with dependency sturucture is finally achieved with consideration of its characteristics.

Select

Knowledge Representation and Acquisition

Construction of Knowledge Graph Based on Geo-Spatial Data

LIU Junnan, LIU Haiyan, CHEN Xiaohui, GUO Xuan, ZHU Xinming

2020, 34(11): 29-36.

Abstract ( ) PDF ( )

Knowledge map

Save

With the rapid development of 3S technology, there is an explosive growth in geo-spatial data. It has become an urgent scientific problem to construct knowledge graph based on geo-spatial data so as to realize the transformation from data to knowledge. In the general knowledge graph, geo-spatial knowledge is only represented for the attribute or the semantic relationship, with the spatial relationship missed. This paper first designs the representation method of spatial relationship. Then, it proposes the technical map of knowledge graph construction based on spatial relationship, focusing on the spatial relationship extraction and multi-source geographic data fusion. We also discuss the application direction of knowledge graph in the field of geo-spatial: promoting the integration of geo-spatial data and semantic web technologies.

Select

Knowledge Representation and Acquisition

Construction of An Open Dataset for Clinical Event Graph

LIU Xuli, JIN Jihao, RUAN Tong, GAO Daqi, YIN Yichao, GE Xiaoling

2020, 34(11): 37-48.

Abstract ( ) PDF ( )

Knowledge map

Save

Clinical research based on observational data of electronic medical records has become a hot topic. In this paper, a new representation model of medical observation data based on RDF is proposed. The model can clearly represent multiple event types such as clinical examination, diagnosis, treatment as well as temporal relationships between events. Base on electronic medical records from hospitals, clinical event graphs are constructed by four steps: data preprocessing, RDF format conversion, time sequence construction and knowledge fusion. Specifically, using the electronic medical records of three first-class hospitals in Shanghai, we constructed a medical dataset including three specialties, 173 395 medical events, 501 335 temporal relationships of events, and linked with 5 313 concepts in the knowledge base. This paper further provides 40 sample queries for clinical retrospective research including etiology analysis and treatment analysis, with demonstration in contrast to the traditional database in terms of query formulation and retrieval process. The dataset follows the Open Link Standard and is published on OpenKG with online SPARQL site (https://peg.ecustnlplab.com/dataset.html).

Select

Knowledge Representation and Acquisition

Coalmine Safety: Knowledge Graph Construction and Its QA Approach

LIU Peng, YE Shuai, SHU Ya, LU Xiaolong, LIU Mingming

2020, 34(11): 49-59.

Abstract ( ) PDF ( )

Knowledge map

Save

Coal mining enterprises are developing beyond information construction into intelligence era, motivated by new network technologies like big data and artificial intelligence. In this paper, knowledge graph is introduced into the domain of coalmine safety. The domain knowledge concept is first classified, stored in the graph database, and visually presented for its concept relations. Then, to facilitate the query search over this knowledge graph, a question classification approach is implemented to identify the best query types for a specific question. The experiment results show that the proposed entity extraction method has higher scores on recall and precision, and the Spark-based parallel question classification algorithm significantly improves efficiency while ensuring the accuracy.

Select

Machine Translation

Chinese-Burmese Parallel Sentence Pair Extraction Based on CNN-CorrNet

MAO Cunli, WU Xia, ZHU Junguo, YU Zhengtao, LI Yunlong, WANG Zhenhan

2020, 34(11): 60-66.

Abstract ( ) PDF ( )

Knowledge map

Save

Bilingual parallel corpus is a key resources to improve the quality of machine translation. We propose a Chinese-Burmese parallel sentence pair extraction method based on CNN-CorrNet network. Specifically, we first use BERT to obtain vector representations of Chinese and Burmese words, and use convolution neural network to represent sentences in Chinese and Burmese to capture important feature information of sentences. Then, in order to ensure the maximum correlation between the cross-language representations of the two languages, the existing Chinese and Burmese parallel sentence pairs are used as constraints, and CorrNet (Correlational Neural Networks) is applied to map the Chinese and Burmese sentence representation into the common semantic space. Finally, the distance of Chinese and Burmese sentences in the public semantic space is calculated to determine the true bilingual sentence pairs. The experiment results show that, compared with the maximum entropy model and the siamese network model, the F₁ value of the method proposed in this paper is increased by 13.3% or 5.1%, respectively.

Select

Machine Translation

Research on Tibetan-Chinese Machine Translation Method with Iterative Back Translation Strategy

CIZHEN Jiacuo, SANGJIE Duanzhu, SUN Maosong, ZHOU Maoxian, SE Chajia

2020, 34(11): 67-73,83.

Abstract ( ) PDF ( )

Knowledge map

Save

Tibetan-Chinese machine translation is one of the most important research topics in Tibetan Natural Language Processing. Due to the limitation of parallel corpus between Tibetan and Chinese available, this paper is focused on improving Tibetan-Chinese machine translation by dealing with the low resource issue. Based on transformer architecture, we apply iterative back translation strategy and automatic translation filtering mechanism. In experiments with only 1.641M mono Tibetan sentences, we achieve 6.7 and 9.8 improvements in term of BLUE score over the baseline model, respectively.

Select

Information Extraction and Text Mining

Utilizing Glyph Feature and Iterative Learning for Named Entity Recognition in Finance Text

LIU Yuhan, LIU Changjian, XU Ruifeng, LUO Wangda, CHEN Yi, JI Zhongsheng, YING Nengtao

2020, 34(11): 74-83.

Abstract ( ) PDF ( )

Knowledge map

Save

To deal with Chinese named entity recognition in finance domain, this paper presents a novel neural network model combining glyph feature and iterative learning, Based on the framework of bidirectional long-short term memory networks and conditional random fields, this model encodes wubi input code of Chinese characters for information enhancement and use iterative learning to continuously update predict results. We manually annotate a large-scale financial named entity corpus named HITSZ-Finance, including 31210 sentences and 4 types of entities. Experiment results on HITSZ-Finance corpus demonstrate the effectiveness of the model.

Select

Information Extraction and Text Mining

Text Classification Based on Hierarchical Model and Attention Mechanism

WU Gaobo, WANG Liming, CHAI Yumei, LIU Zhen

2020, 34(11): 84-95.

Abstract ( ) PDF ( )

Knowledge map

Save

Text classification is one of the focuses of the research with wide applications. This paper optimizes the NMF-SVM classification method to deal with the lack of hierarchical features in the text classification process, achieving a hierarchical classification model. Secondly, to capture the different influences between keywords and the non-keywords on the classification result, we introduce the SEAN attention mechanism to obtain the attention between different words in relation to detect four type of entities: time, place, person and event. Finally, to handle the differences in the strength of sentence connections, an inter-sentence affinity model is proposed for texts of news, novels, reading comprehension, and Weibo, which are rich in above four entities. On the news data set, the model is demonstrated with a better classification result compared with the deep learning text classification model and the hybrid model with attention mechanism.

Select

Information Extraction and Text Mining

Global and Local Feature-Aware Network for Relation Extraction

SONG Wei, ZHU Fuxin

2020, 34(11): 96-103.

Abstract ( ) PDF ( )

Knowledge map

Save

Relation extraction aims to identify the relationship between entities from a large amount of unstructured data. This paper proposes a relation extraction method based on global and local feature-aware network. This method first adopts the self-attention mechanism and recurrent neural network to obtain the correlated sequence features of each word. Then, a multi-branch feature-aware convolutional neural network is constructed to obtain global and local features without their mutual interference. Moreover, the obtained two features are concatenated to fully represent the important semantic features of the sentence. Experimental results show that the proposed method performs better than the state-of-the-art methods based on convolutional neural networks and recurrent neural networks, with the F₁ of 86.1% and 64.9% on the standard SemEval-2010 Task 8 and KBP37 datasets, respectively.

Select

Question Answering and Dialogue System

Online Commodity KBQA Based on Knowledge Graph

WANG Siyu, QIU Jiangtao, HONG Chuanyang, JIANG Ling

2020, 34(11): 104-112.

Abstract ( ) PDF ( )

Knowledge map

Save

In general, Question Answering System (QAS) for the commodity is mainly built via the intention identification and answer configuration. However, the configuration of answers of questions depends on manual labor, which easily results in poor quality of answers. With the introduction and development of Knowledge Graph (KG) technology, the KG-based QAS has gradually become a hot research topic. At present, the KG-based QAS for commodity is mainly implemented by employing rules to transform questions to queries in the KG. Although the manual configuration work is reduced, the performance of QAS is limited by the quality and quantity of the rules. In order to solve above problems, this paper proposes a question answering method for online commodities based on KG and rule reasoning. The main contributions include: (1) we built an LSTM-based property attention network named SiameseATT(Siamese Attention Network) for attribute selection; (2) we employed KG to infer rules, consequently generate a large number of triples to respond more questions. Finally, experiments on the NLPCC-ICCPOL 2016 dataset show that the model obtains good performance. Our QAS is more suitable for e-commerce applications.

Please choose a citation manager

Content to export

2020 Volume 34 Issue 11 Published: 09 December 2020