Journal of Chinese Information Processing

Select

Survey

Focuses and Frontiers Tendency in Neural Machine Translation Research

LIN Qian, LIU Qing, SU Jinsong, LIN Huan, YANG Jing, LUO Bin

2019, 33(11): 1-14.

Abstract ( ) PDF ( )

Knowledge map

Save

Machine translation is the process of to attempting convert text from one language to another using computers, which has already become the research issues of great importance in artificial intelligence. With the fast growth of deep learning research and application, it has been revealed that neural machine translation become a mainstream of research for machine translation. This paper firstly introduces the influence of neural machine translation in academia and industry in the past year, and then reviews the research progress on neural machine translation, finally we outline the outlook for its future development.

Select

Survey

Research Progress of Event Summarization Based on Social Media

ZHANG Chenxin, RAO Yuan, FAN Xiaobing, WANG Shuo

2019, 33(11): 15-30.

Abstract ( ) PDF ( )

Knowledge map

Save

The event summarization technology based on social media plays an important role in the study of emergency detection, event trend analysis, public opinion analysis and many other aspects. Based on a large number of latest research, this paper summarizes the key technologies in the core steps from the perspective of the realization of event summarization,and puts forward the following four key technical problems and challenges in the process of event context mining and analysis: how to generate event summarization under multimodal information fusion; how to mine event and generate event summarization under cross-media heterogeneous data collaboration; how to map relationship hierarchically and at multi-granularity of complex events and how to recognize event and generate event summarization under real-time conditions. Meanwhile, this paper discuss the related theories, research progresses and research trend, which can provide new research clues and directions for event summarization mining technology based on social media.

Select

Language Analysis and Calculation

TF-IDF Based Lexical Hierarchy Division of Diachronic Distribution

RAO Gaoqi, LI Yuming

2019, 33(11): 31-38.

Abstract ( ) PDF ( )

Knowledge map

Save

In the evolution of the Chinese Language, the use of words is significantly affected by time, resulting the various diachronic distributions of lexicon. In this paper, we employ TF/IDF to hierarchically classify the lexicon of 70-year corpus according to the diachronic distribution. Diachronic text classification, distribution of part of speech and word length, corpus coverage, and distribution of usage over time are analyzed, upon which we propose a diachronic hierarchy division of the Chinese lexicon.

Select

Language Analysis and Calculation

A Multi-task Lao Part-of-Speech Tagging Method Fusing Structural Features of Word

WANG Xingjin, ZHOU Lanjiang, ZHANG Jianan, ZHOU Feng

2019, 33(11): 39-45.

Abstract ( ) PDF ( )

Knowledge map

Save

At present, the research on Lao part-of-speech tagging is in its infancy, with limited tagged corpus available. In particular, Lao has absorbed a variety of foreign words, resulting in the presence of a large number of rare words. This paper studies the structure characteristics of Lao words and proposes a multi-task Lao part-of-speech tagging model with a combination of part-of-speech tagging loss with the main consonant auxiliary loss. To capture the rich affixes indicating part of speech clues in Lao, the model also uses character-level word vectors. In addition, the attention mechanism is employed to deal with the long sentence pattern of Lao. The experimental results show that the proposed method achieves better accuracy of 93.24%.

Select

Language Analysis and Calculation

Entity Disambiguation Based on Context Word Vector and Topic Models

WANG Rui, LI Bicheng, DU Wenqian

2019, 33(11): 46-56.

Abstract ( ) PDF ( )

Knowledge map

Save

To employ both the global and the local features of the entity, an entity disambiguation method based on context word vector and topic model is proposed. Firstly, the context direction vector is added to the traditional word vector model to represent the word order, and the model is used to train the topic vector based on topic model. Secondly, the entity context similarity, the category topic similarity based on the entity topic and the entity theme similarity based on the topic vector are calculated, respectively. Finally, the three similarities are merged, and the entity with the highest similarity is taken as the target entity. The experimental results show that the new method is effective compared to state-of-the-art methods.

Select

Language Analysis and Calculation

Automatic Ancient Chinese Texts Segmentation Based on BERT

YU Jingsong, WEI Yi, ZHANG Yongwei

2019, 33(11): 57-63.

Abstract ( ) PDF ( )

Knowledge map

Save

Ancient Chinese differs from modern Chinese in words and grammar. Since there are no explicit marks among sentences in ancient Chinese texts, today's readers find it's hard to understand them. It is also difficult and requires expertise in a variety of fields to segment ancient text. We investigate to perform automatic texts segmentation and punctuation based on recent deep learning technologies. By pre-training a BERT (Bidirectional Encoder Representations from Transformers) model for ancient Chinese texts ourselves, we get the current state-of-the-art results on both tasks via fine-tuning. Comparing to traditional statistical methods and current BiLSTM+CRF solution, our approach significantly outperforms them by achieving F₁-scores of 89.97% and 91.67% on small-scaled single category corpus and large-scaled multi-categories corpus,respectively. Especially, our approach shows its good generalization ability by achieving F₁-score of 88.76% on a fully new Taoist corpus. On the punctuation task, our method F₁ score reached 70.40%, which exceeded the baseline BiLSTM+CRF model by 12.15%.

Select

Knowledge Representation and Acquisition

A Chinese-Tibetan Bilingual Knowledge Graph System in Tourism Domain

FENG Xiaolan, ZHAO Xiaobing

2019, 33(11): 64-72.

Abstract ( ) PDF ( )

Knowledge map

Save

Tourism is one of the main economic sources in the Tibetan region. However, there is no Tibetan tourism information intelligent service system on the Internet, and the introduction text of Tibetan attractions is also rare. In contrast, Chinese tourism websites have a large amount of information and contain different attractions. To facilitate the understanding of the knowledge related to the attraction, this paper firstly uses the BLSTM neural network model to acquire 11 kinds of attribute knowledge related to scenic spots in the Chinese tourism field. Through the Chinese-Tibetan dictionary of tourism, the Chinese knowledge acquired is transferred to Tibetan, and the translation coverage rate is 70.44%. Finally, a knowledge graph of Chinese-Tibetan bilingual tourism is constructed.

Select

Knowledge Representation and Acquisition

TransRD: Embedding of Knowledge Graph with Asymmetric Features

ZHU Yanli, YANG Xiaoping, WANG Liang, ZHANG Zhiyu

2019, 33(11): 73-82.

Abstract ( ) PDF ( )

Knowledge map

Save

Knowledge graph embedding maps entities and relations into low-dimensional vector spaces. Existing embedding representation methods have two major drawbacks in modeling knowledge graph with asymmetric characteristics. First, they do not consider asymmetry between head and tail entities, assuming that the head and tail entities in knowledge graphs come from the same semantic spaces. Second, they equip each relation with a set of unique projection matrices, ignoring the intrinsic correlations of relations, which hinder the sharing of knowledge between projection matrices and cause poor generalization ability. This paper proposes a novel embedding approach named Trans-RD to deal with the two issues above. TransRD adopts different projection matrices for head and tail entities respectively, and applies ADADELTA algorithm to adjust the learning rate adaptively. Then it uses the same pair of transfer matrices for similar relations to improve the performance of knowledge representation. Empirical results of link prediction based on WN18 and FB15K (public knowledge graph datasets) and MPBC_20 (a subset of Knowledge Graph of Breast Cancer) show that TransRD achieves remarkable improvement in various aspects compared to existing models.

Select

Knowledge Representation and Acquisition

Generative Adversarial Network Based Semantic Representation Learning for Heterogeneous Information Network

ZHAO Yu, TAN Haining, LIU Zhifang, WU Chao

2019, 33(11): 83-94.

Abstract ( ) PDF ( )

Knowledge map

Save

Due to the abundant structural and semantic information in the heterogeneous information network as well as its wide application, network representation learning for heterogeneous information networks has become a vital research issue. The current representation learning models for heterogeneous information network can be divided into generative model based or discriminative model based methods. In this paper, we propose a representation learning model for the heterogeneous information network called HINGAN, which integrates the generative adversarial network into the representation learning process of heterogeneous information network to improve network representational outcomes. Firstly, this model builds a weighted homogeneous information network in the guidance of the meta-path. Then, by employing the GAN for the greatest gain, it updates parameters of the constructed generator and discriminator on the weighted network. According to the experimental results on AMiner and DBLP, HINGAN can get a more outstanding effect than present mainstream network representation methods from the aspects of multi-label classification and visualization. At the same time, HINGAN can be applied to extensively scalable representation and effective calculation of the heterogeneous network data.

Select

Informaton Extraction and Text Mining

Chinese Named Entity Recognition Ensembled with Character

YIN Zhangzhi, LI Xinzi, HUANG Degen, LI Jiuyi

2019, 33(11): 95-100,106.

Abstract ( ) PDF ( )

Knowledge map

Save

Named Entity Recognition(NER) plays an important role in Natural Language Processing. In order to obtain better results without manual features, this paper proposes an NER method based on an ensemble model of BiLSTM. Firstly, we apply the BiLSTM-CRF training on the data, obtaining the character-based model Char-NER and the word-based model Word-NER respectively. Then the score vectors obtained by the two models are merged as the input to the SVM model. The experimental results show that this method achieves 94.04%, 92.15%, 87.05% and 91.73%, 93.20%, 83.15% F-Scores of name, location and organization on the 1998 people's daily and MSRA corpus respectively without hand-crafted features.

Select

Informaton Extraction and Text Mining

Chinese-Vietnamese News Perspective Sentence Extraction Methods Incorporating Multiple Features

LIN Siqi, YU Zhengtao, GUO Junjun, GAO Shengxiang

2019, 33(11): 101-106.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a Chinese-Vietnamese bilingual news perspective sentence extraction method that incorporates multiple features. Firstly, for the problem of unbalanced resources in Chinese and Vietnamese, this method constructs a Chinese-Vietnamese bilingual word embedding model. We use rich Chinese tag resources to make up for the lack of Vietnamese tagging resources. Then, the emotional, topical and positional features of sentences are integrated into the word vector and attention mechanism respectively. Experiments show that this method can effectively improve the accuracy of Vietnamese news perspective sentence extraction.

Select

Informaton Extraction and Text Mining

EntropyRank: Keyphrase Extraction Algorithm Based on Topic Entropy

YIN Hong, CHEN Yan, LI Ping

2019, 33(11): 107-114.

Abstract ( ) PDF ( )

Knowledge map

Save

Key-phrase extraction aims to automatically identify important key-phrases from documents. Most existing methods are focused on the words' importance and the relation between words. Considering that key-phrase should closely related to the article's topics, we proposed an improved method based on topic entropy. Our work firstly use Latent Dirichlet Allocation to train the theme distribution of documents and words, and combine them to get the words' topic distribution of a specific document. Then words' topic entropy are worked out to represent the words' importance. Finally, we use random walk on words' co-occurrence graph to calculate the score of each candidate phrase. Experimental results show that proposed method has an improvement of 2.61%-6.98% in F1 score compared with the existing methods.

Select

Informaton Extraction and Text Mining

Chinese Classical Poetry and Couplet Generation Based on Multi-task Learning

WEI Wancheng, HUANG Wenming, WANG Jing, DENG Zhenrong

2019, 33(11): 115-124.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a novel multi-task learning model for the automatic generation of classical poetry and couplet, which uses an encoder-decoder structure and the attention mechanism. The encoder consists of two BiLSTMs, one for keyword input, the other for classical poetry and couplet input. The decoder consists of two LSTMs, one for classical poetry output, the other for couplet output. In the multi-task learning model, the encoder parameters are shared and the decoder parameters are not shared. The encoder of model can learn the common features of classical poetry and couplet, the decoder of classical model can learn the unique features of classical poetry and couplet. So, the generalization ability of the model will be enhanced, and the performance will be much better than the single task model. At the same time, this paper innovatively introduces keyword information in the model, so that the generated classical poetry and couplet are consistent with the user's intention. At the end of this paper, automa-tic evaluation and manual evaluation are used to verify the effectiveness of the method.

Select

Information Retrieval and Question Answering

An Entity Linking Approach for Knowledge Base Question Answering

ZHAO Chang, LI Huiying

2019, 33(11): 125-133.

Abstract ( ) PDF ( )

Knowledge map

Save

Entity linking for knowledge base question answering is to link the entity mention in the natural language question to a target entity in the knowledge base. This paper employs the candidate entity's types, relationships and neighboring entities as the candidate entity representation, so as to solve the problem of insufficient description information of the entity in the knowledge base. At the same time, the similar entity mentions obtained by training the corpus are considered as the mention's background knowledge. Finally, the proposed features combine the entity popularity feature to solve the entity disambiguation problem. The experimental results on the data set show that the linear combination of all the above-mentioned features is better than the single feature.

Select

Information Retrieval and Question Answering

Using Representative Answers and Attentions for Short Answer Grading

TAN Hongye, WU Zepeng, LU Yu, DUAN Qinglong, LI Ru, ZHANG Hu

2019, 33(11): 134-142.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic short answer grading (ASAG) is a key issue in intelligent tutoring systems. The main challenges in ASAG lie in 1) the reference answer for a given question cannot cover the diverse student answers; and 2) the similarity between student answer and the reference is hard to estimate. This paper applies clustering and maximum similarity to select representative answers, constructing the reference answer set to cover various student answers. Then, this paper employs a deep neural network model based on the attention mechanism to approximate the similarity between the student answer and the reference answer set. Experimental results show that the proposed model effectively improves the accuracy of automatic scoring.

Please choose a citation manager

Content to export

2019 Volume 33 Issue 11 Published: 11 November 2019