Journal of Chinese Information Processing

Select

Survey

A Survey to Text Summarization: Popular Datasets and Methods

HOU Shengluan, ZHANG Shuhan, FEI Chaoqun

2019, 33(5): 1-16.

Abstract ( ) PDF ( )

Knowledge map

Save

Text summarization has become an essential way of knowledge acquisition from mass text documents on the Internet. The existing surveys to text summarization are mostly focused on methods, without reviewing on the experimental datasets. This survey concentrates on evaluation datasets and summarizes the public and private datasets together with corresponding approaches. The public datasets are recorded for the data source, language and the way of access, and the private dataset are recorded with the scale, access and annotation methods. In addition, the formal definition of text summarization by each public dataset are provided. We analyze the experimental results of classical and latest text summarization methods on one specific dataset. We conclude with the present situation of existing datasets and methods, and some issues concerning them.

Select

Language Analysis and Calculation

Design and Research on Chinese Word Embedding Model Based on Strokes

ZHAO Haoxin, YU Jingsong, LIN Jie

2019, 33(5): 17-23.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese characters have a two-dimensional complex structure that spreads horizontally and vertically. Most of the studies about Chinese word embedding explore Chinese character level without considering the strokes sequence. This paper proposes a novel Stroke2Vec model that generates word embedding based on its stroke sequence. The model expands CBOW of Word2Vec model by using CNN and attention model instead of matrix. The Stroke2Vec aims to simulate the rule of strokes structure of Chinese characters and produce better character embedding with only strokes sequence. Compared with the Word2Vec and GloVe in NER task, the results show that our model achieves 81.49% F₁-score, out-performing Word2Vec by 1.21%, and GloVe by nearly 0.21%. And combining Stroke2Vecs and Word2Vecs leads to an F₁-score of 81.55%.

Select

Language Analysis and Calculation

A Bridge-based Lexicon Learning Method for Semantic Parsing

CHEN Bo, SUN Le, HAN Xianpei

2019, 33(5): 24-30.

Abstract ( ) PDF ( )

Knowledge map

Save

Current Semantic Parsers are mainly based on compositional semantics, with a strong dependent on lexicon. Lexicon is a set of vocabularies, which define the mappings between words or phrases from natural language sentences and predicates from knowledge base ontology. In order to deal with the low-coverage of lexicon, this paper proposes a bridge based lexicon learning method for semantic parsing on the basis of existing work. This method can bring in new vocabularies during training and learn a new lexicon with high-coverage. Furthermore, this paper designs a new word-predicate feature template and utilizes voting to gain core vocabularies for more accurate lexicon. Experiments results on two benchmarks: WebQuestions and Free917, show that our method can learn new vocabularies for improving the coverage of lexicon, with a side-effect on parsing improvement.

Select

Language Analysis and Calculation

Linguistic Features Enhanced Convolutional Neural Networks for Irony Recognition

LU Xin, LI Yang, WANG Suge

2019, 33(5): 31-38.

Abstract ( ) PDF ( )

Knowledge map

Save

Irony is popular in Weibo comments, which is less addressed in past sentiment analysis community. To improve the accuracy of Weibo sentiment analysis, we study irony recognition in this paper. By analyzing the characteristics of Chinese language and social networks, irony linguistic features are summarized. Combining these linguistic features with convolutional neural networks (CNN), a novel method is proposed for recognizing irony. The proposed method combines irony features representation and sentences representation using word embedding as the input of a convolution network. The experimental results indicate that the proposed method is superior to classical machine learning methods for irony recognition.

Select

Language Analysis and Calculation

GMN-based Nuclearity Recognition in Chinese Discourse

WANG Tishuang, LI Peifeng, ZHU Qiaoming

2019, 33(5): 39-46.

Abstract ( ) PDF ( )

Knowledge map

Save

Discourse parsing is the basis of Natural Language Understanding. As one of the important tasks of discourse parsing, nuclearity recognition in Chinese discourse is still an emerging issue. In this paper, we propose a method based on gated memory network (GMN) to recognize nuclearity in Chinese discourse. The method first uses Bi-LSTM and CNN to capture both the remote information and the local information of each discourse units. Then, the two basic discourse units information are merged and a gated unit is created. Finally, gated unit captures relatively important feature representation from the basic chapter unit to identify the Nucleus unit. Experimental results on the Chinese Discourse Treebank (CDTB) show that the proposed method improves the macro-F₁ and micro F₁ compared to state-of-the-art systems.

Select

Language Analysis and Calculation

Implicit Discourse Relation Analysis Based on Multi-task Bi-LSTM

TIAN Wenhong, GAO Yinquan, HUANG Houwen, LI Zaiwan, ZHANG Zhaoyang

2019, 33(5): 47-53.

Abstract ( ) PDF ( )

Knowledge map

Save

Implicit discourse relation recognition is an important issue in the task of discourse relationship recognition. Nowadays, the corpus of implicit discourse relationship does not provide enough information for good results. To make full use of the fact that the sentences of implicit discourse and explicit discourse have some contact in semantic or some other aspects, this paper adopts multi-task learning method to handle the recognition task. The bidirectional long short term memory (Bi-LSTM) network is applied to learn the related features of the sentences. At the same time, the method of merging word vector has been adopted together with prior knowledge. Compared with other results, experiment results on the HIT-CDTB show that the average F₁ score of this paper reaches 53% (about 13% relative improvement), and the average recall score reaches 51% (about 9% relative improvement).

Select

Language Analysis and Calculation

Regularized Cross-validation Method for Text Data Sets

WANG Ruibo, WANG Yu, LI Jihong

2019, 33(5): 54-65.

Abstract ( ) PDF ( )

Knowledge map

Save

When building models on text data sets, cross-validation is a commonly used method in the tasks of feature selection and model comparison. Many studies have revealed that the estimation of performance of models on text data sets is sensitive to the data partitioning used in a cross-validation method. Unreasonable partitioning would lead to a less reliable estimation of the performance, as well as experimental results not repeatable by other researchers. This paper aims to improve the estimation and comparison of the performances by constructing a regularized m×2 cross-validation method (abbreviated as m×2 BCV). The method performs m times of two-fold cross-validation partitioning, and simultaneously introduces the constraints of divergence of distributions of training set and validation set into the partitioning. Specifically, the chi-square statistic is employed to measure the divergence of difference of distributions of the training set and the validation set. Then, the measurement is used to construct regularization conditions for data partitioning. Furthermore, by aiming to maximize signal-to-noise ratio of the estimation of the performance, the data partitioning of m×2 BCV is constructed through filtering out the partitions that satisfy all the preset regularization conditions. In experiments, models in semantic role labeling tasks of Chinese Framenet are investigated to compare different cross-validation methods. All experimental results validate the effectiveness of the proposed m×2 BCV method.

Select

Knowledge Representatoin and Acquisition

An End-to-End Method of Entity Description Generation with Multi-hop Facts on Knowledge Bases

MENG Qingsong, ZHANG Xiang, HE Shizhu, LIU Kang, ZHAO Jun

2019, 33(5): 66-74.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic generation of entity description is beneficial to the application of knowledge graphs. Good descriptions are usually written in fluent language, which is an important indicator of text quality. This paper proposes to utilize the multi-hop facts on knowledge graphs to generate entity descriptions, which are expected to match the writing style of human editors and improve the text fluency. Specifically, this paper adopts the encoder-decoder framework and proposes an end-to-end neural network model, encoding multi-hop facts with an attention mechanism in the decoding phase. Experiments show that, compared with the baseline, the proposed model trained with multi-hop facts obtains promising improvement in BLEU-2 by 8.9% and ROUGE-L by 7.3%, respectively.

Select

Machine Translation

Integrating Glyph Features of Chinese Character into Chinese-English Neural Machine Translation Model

CAI Zilong, XIONG Deyi

2019, 33(5): 75-81.

Abstract ( ) PDF ( )

Knowledge map

Save

The technology of neural machine translation is currently the best way to achieve the state-of-the-art results in application. Introducing external linguistic knowledge such as part-of-speech and dependency syntax tags into the neural machine translation system to improve translation performance has been proved effective. Compared with other phonetic characters, Chinese is a kind of semantic-phonetic compound character, which not only has the function of pronunciation but also contains semantic information. We propose a new method of incorporating glyph features into the end-to-end model based on the work of Marta R, et al, applying it to Chinese-English translation. Compared with the benchmark system, this method achieves a significant increase of 1.1 points in average on the NIST evaluation set, demonstrating that the glyph features of Chinese character can improve the neural machine translation model effectively.

Select

Informaton Extraction and Text Mining

Frame Semantics Based Training Data Expansion for Supervised Event Detecting

ZHANG Jingli, ZHOU Wenxuan, HONG Yu, YAO Jianmin, ZHOU Guodong, ZHU Qiaoming

2019, 33(5): 82-92,131.

Abstract ( ) PDF ( )

Knowledge map

Save

Event detection is an important research issue in the field of information extraction. The current methods of event detection generally suffer from data sparseness, imbalanced distribution and ambiguity. This paper proposes to construct the correspondence between event types and frames in FrameNet (FN), so as to get additional samples to train the supervised detection models. It is revealed that FN consists of richer examples of events which have been annotated with the tags of frame semantics. In addition, the frame defined in FN shares high similarity with that in ACE: e. g. the lexical units and a set of frame elements inherently correspond to the event triggers and arguments in the ACE corpus, and many frames in FN can represent certain types of events. Experimental results show that the proposed method performs well both in trigger identification and event type recognition.

Select

Informaton Extraction and Text Mining

Text Structure Oriented Hybrid Hierarchical Attention Networks for Topic Classification

CHE Lei, YANG Xiaoping, WANG Liang, LIANG Tianxin, HAN Zhenyuan

2019, 33(5): 93-102,112.

Abstract ( ) PDF ( )

Knowledge map

Save

To better utilize text logical structure features and text organizational structure features in topic classification, this paper proposes a text structure oriented hybrid hierarchical attention network for this task. The logical structure usually includes information such as title and text, and the organizational structure includes character-word-sentence layer. The model integrates text headings and text bodies to improve the role of logical structure features in topic classification, and improves the role of text organizational structure features in topic classification based on the attention mechanism of char-sentence and word-sentence levels. Experimental results on 4 datasets show that the proposed model can improve the accuracy of topic classification tasks.

Select

Informaton Extraction and Text Mining

Research on Web Page Information Extraction Based on Visual Features

WANG Xianfa, GUO Yan, LIU Yue, YU Xiaoming, CHENG Xueqi

2019, 33(5): 103-112.

Abstract ( ) PDF ( )

Knowledge map

Save

Facing with the large-scale heterogeneous web pages, web extraction methods based on visual features tend to have poor generality and low extraction efficiency. To deal with the issue of poor generality, this paper proposes WEMLVF, a Web page information extraction framework based on visual features using supervised machine learning. This framework has good versatility. The effectiveness of the framework is validated through experiments on forum sites and news review sites. Then, to deal with the issue of low efficiency, the framework WEMLVF is utilized and method is proposed for automatically generating information extraction templates based on XPath and SoftMealy (a wrapper induction algorithm). These two methods use visual features to automatically generate information extraction templates without visual features. It makes full use of visual features information extraction and significantly improve the efficiency of information extraction, which is empirically verified.

Select

Question Answering,Dialogue System and Machine Reading Comprehension

A Chinese Conversation Model Using Pinyin for Dimension Reduction

WU Bangyu, ZHOU Yue, ZHAO Qunfei, ZHANG Pengzhu

2019, 33(5): 113-121.

Abstract ( ) PDF ( )

Knowledge map

Save

Conversation is an important research field in natural language processing with wide applications. However, when training the Chinese conversation model, we have to face the problem of excessively high model complexity due to the large number of words. To deal with this issue, this paper proposes to convert the Chinese input into Pinyin and divide it into initials, finals and tones three parts, thereby reducing the number of words. Then, the Pinyin information is combined into image form using embedding method. We extract the Pinyin feature through a Fully Convolutional Network (FCN) and a bi-directional Long Short Term Memory (LSTM) network. Finally, we use a 4-layer Gated Recurrent Units (GRU) network to decode the Pinyin feature for solving the problem of long time memory, and obtain the output of the conversation model. On this basis, the attention mechanism is added in the decoding stage so that the output can correspond with the input better. In the experiment, we set up a conversation database in the medical field, and use BLEU and ROUGE_L as an evaluation indicator to test our model on the database.

Select

Question Answering,Dialogue System and Machine Reading Comprehension

Questions Intent Classification Based on Dual Channel Convolutional Neural Network

YANG Zhiming, WANG Laiqi, WANG Yong

2019, 33(5): 122-131.

Abstract ( ) PDF ( )

Knowledge map

Save

Human-machine conversation technology has received extensive attention from the academic and industrial fields in recent years. The users question intention classification is an important key issues with direct effect on the quality of human-machine dialogue. In this paper, we propose an intent classification dual-channel Convolutional Neural Networks (ICDCNN) : we first extract semantic features by using Word2vec and Embedding layer to train the word vector ; then, two different channels are used for convolution, one for character level word vector, the other for word level word vector; thirdly, the character level word vectors (fine-grained) are combined with word level word vectors to mine deeper semantic information of natural language question; finally, with convolution kernels of different sizes, deeper abstract features inside the questions are learnt. Experimental results show that the algorithm achieves high accuracy on Chinese datasets, which has certain advantages compared to other methods.

Select

Sentiment Analysis and Social Computing

Imbalanced Emotion Classification Based on Word Vector Pre-training

LIN Huaiyi, LIU Zhen, CHAI Yumei, LIU Tingting, CHAI Yanjie

2019, 33(5): 132-142.

Abstract ( ) PDF ( )

Knowledge map

Save

The main methods to deal with imbalance problems in deep learning are focused on cost function and sampling technique. Based on word vector migration, this paper proposes a pre-training task selection method and initializes the target model with a pre-trained word vector that facilitates minority classes differentiation. Combined with balanced oversampling, the sample information is used to maintain the accuracy of the model in majority classes, so that the text features extracted by the model are balanced. Compared with the oversampling method, the experimental results show that the proposed method has a better balanced effect in most cases where text emotion classification result have no serious over-fitting. When there is a serious over-fitting, the method has a significant balance effect in three-type classification task. Experiments also verify that pre-training methods can be combined with cost-sensitive methods to improve the balance performance.

Please choose a citation manager

Content to export

2019 Volume 33 Issue 5 Published: 17 May 2019