2019 Volume 33 Issue 5 Published: 17 May 2019
  

  • Select all
    |
    Survey
  • Survey
    HOU Shengluan, ZHANG Shuhan, FEI Chaoqun
    2019, 33(5): 1-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text summarization has become an essential way of knowledge acquisition from mass text documents on the Internet. The existing surveys to text summarization are mostly focused on methods, without reviewing on the experimental datasets. This survey concentrates on evaluation datasets and summarizes the public and private datasets together with corresponding approaches. The public datasets are recorded for the data source, language and the way of access, and the private dataset are recorded with the scale, access and annotation methods. In addition, the formal definition of text summarization by each public dataset are provided. We analyze the experimental results of classical and latest text summarization methods on one specific dataset. We conclude with the present situation of existing datasets and methods, and some issues concerning them.
  • Language Analysis and Calculation
  • Language Analysis and Calculation
    ZHAO Haoxin, YU Jingsong, LIN Jie
    2019, 33(5): 17-23.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese characters have a two-dimensional complex structure that spreads horizontally and vertically. Most of the studies about Chinese word embedding explore Chinese character level without considering the strokes sequence. This paper proposes a novel Stroke2Vec model that generates word embedding based on its stroke sequence. The model expands CBOW of Word2Vec model by using CNN and attention model instead of matrix. The Stroke2Vec aims to simulate the rule of strokes structure of Chinese characters and produce better character embedding with only strokes sequence. Compared with the Word2Vec and GloVe in NER task, the results show that our model achieves 81.49% F1-score, out-performing Word2Vec by 1.21%, and GloVe by nearly 0.21%. And combining Stroke2Vecs and Word2Vecs leads to an F1-score of 81.55%.
  • Language Analysis and Calculation
    CHEN Bo, SUN Le, HAN Xianpei
    2019, 33(5): 24-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    Current Semantic Parsers are mainly based on compositional semantics, with a strong dependent on lexicon. Lexicon is a set of vocabularies, which define the mappings between words or phrases from natural language sentences and predicates from knowledge base ontology. In order to deal with the low-coverage of lexicon, this paper proposes a bridge based lexicon learning method for semantic parsing on the basis of existing work. This method can bring in new vocabularies during training and learn a new lexicon with high-coverage. Furthermore, this paper designs a new word-predicate feature template and utilizes voting to gain core vocabularies for more accurate lexicon. Experiments results on two benchmarks: WebQuestions and Free917, show that our method can learn new vocabularies for improving the coverage of lexicon, with a side-effect on parsing improvement.
  • Language Analysis and Calculation
    LU Xin, LI Yang, WANG Suge
    2019, 33(5): 31-38.
    Abstract ( ) PDF ( ) Knowledge map Save
    Irony is popular in Weibo comments, which is less addressed in past sentiment analysis community. To improve the accuracy of Weibo sentiment analysis, we study irony recognition in this paper. By analyzing the characteristics of Chinese language and social networks, irony linguistic features are summarized. Combining these linguistic features with convolutional neural networks (CNN), a novel method is proposed for recognizing irony. The proposed method combines irony features representation and sentences representation using word embedding as the input of a convolution network. The experimental results indicate that the proposed method is superior to classical machine learning methods for irony recognition.
  • Language Analysis and Calculation
    WANG Tishuang, LI Peifeng, ZHU Qiaoming
    2019, 33(5): 39-46.
    Abstract ( ) PDF ( ) Knowledge map Save
    Discourse parsing is the basis of Natural Language Understanding. As one of the important tasks of discourse parsing, nuclearity recognition in Chinese discourse is still an emerging issue. In this paper, we propose a method based on gated memory network (GMN) to recognize nuclearity in Chinese discourse. The method first uses Bi-LSTM and CNN to capture both the remote information and the local information of each discourse units. Then, the two basic discourse units information are merged and a gated unit is created. Finally, gated unit captures relatively important feature representation from the basic chapter unit to identify the Nucleus unit. Experimental results on the Chinese Discourse Treebank (CDTB) show that the proposed method improves the macro-F1 and micro F1 compared to state-of-the-art systems.
  • Language Analysis and Calculation
    TIAN Wenhong, GAO Yinquan, HUANG Houwen, LI Zaiwan, ZHANG Zhaoyang
    2019, 33(5): 47-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    Implicit discourse relation recognition is an important issue in the task of discourse relationship recognition. Nowadays, the corpus of implicit discourse relationship does not provide enough information for good results. To make full use of the fact that the sentences of implicit discourse and explicit discourse have some contact in semantic or some other aspects, this paper adopts multi-task learning method to handle the recognition task. The bidirectional long short term memory (Bi-LSTM) network is applied to learn the related features of the sentences. At the same time, the method of merging word vector has been adopted together with prior knowledge. Compared with other results, experiment results on the HIT-CDTB show that the average F1 score of this paper reaches 53% (about 13% relative improvement), and the average recall score reaches 51% (about 9% relative improvement).
  • Language Analysis and Calculation
    WANG Ruibo, WANG Yu, LI Jihong
    2019, 33(5): 54-65.
    Abstract ( ) PDF ( ) Knowledge map Save
    When building models on text data sets, cross-validation is a commonly used method in the tasks of feature selection and model comparison. Many studies have revealed that the estimation of performance of models on text data sets is sensitive to the data partitioning used in a cross-validation method. Unreasonable partitioning would lead to a less reliable estimation of the performance, as well as experimental results not repeatable by other researchers. This paper aims to improve the estimation and comparison of the performances by constructing a regularized m×2 cross-validation method (abbreviated as m×2 BCV). The method performs m times of two-fold cross-validation partitioning, and simultaneously introduces the constraints of divergence of distributions of training set and validation set into the partitioning. Specifically, the chi-square statistic is employed to measure the divergence of difference of distributions of the training set and the validation set. Then, the measurement is used to construct regularization conditions for data partitioning. Furthermore, by aiming to maximize signal-to-noise ratio of the estimation of the performance, the data partitioning of m×2 BCV is constructed through filtering out the partitions that satisfy all the preset regularization conditions. In experiments, models in semantic role labeling tasks of Chinese Framenet are investigated to compare different cross-validation methods. All experimental results validate the effectiveness of the proposed m×2 BCV method.
  • Knowledge Representatoin and Acquisition
  • Knowledge Representatoin and Acquisition
    MENG Qingsong, ZHANG Xiang, HE Shizhu, LIU Kang, ZHAO Jun
    2019, 33(5): 66-74.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic generation of entity description is beneficial to the application of knowledge graphs. Good descriptions are usually written in fluent language, which is an important indicator of text quality. This paper proposes to utilize the multi-hop facts on knowledge graphs to generate entity descriptions, which are expected to match the writing style of human editors and improve the text fluency. Specifically, this paper adopts the encoder-decoder framework and proposes an end-to-end neural network model, encoding multi-hop facts with an attention mechanism in the decoding phase. Experiments show that, compared with the baseline, the proposed model trained with multi-hop facts obtains promising improvement in BLEU-2 by 8.9% and ROUGE-L by 7.3%, respectively.
  • Machine Translation
  • Machine Translation
    CAI Zilong, XIONG Deyi
    2019, 33(5): 75-81.
    Abstract ( ) PDF ( ) Knowledge map Save
    The technology of neural machine translation is currently the best way to achieve the state-of-the-art results in application. Introducing external linguistic knowledge such as part-of-speech and dependency syntax tags into the neural machine translation system to improve translation performance has been proved effective. Compared with other phonetic characters, Chinese is a kind of semantic-phonetic compound character, which not only has the function of pronunciation but also contains semantic information. We propose a new method of incorporating glyph features into the end-to-end model based on the work of Marta R, et al, applying it to Chinese-English translation. Compared with the benchmark system, this method achieves a significant increase of 1.1 points in average on the NIST evaluation set, demonstrating that the glyph features of Chinese character can improve the neural machine translation model effectively.
  • Informaton Extraction and Text Mining
  • Informaton Extraction and Text Mining
    ZHANG Jingli, ZHOU Wenxuan, HONG Yu, YAO Jianmin, ZHOU Guodong, ZHU Qiaoming
    2019, 33(5): 82-92,131.
    Abstract ( ) PDF ( ) Knowledge map Save
    Event detection is an important research issue in the field of information extraction. The current methods of event detection generally suffer from data sparseness, imbalanced distribution and ambiguity. This paper proposes to construct the correspondence between event types and frames in FrameNet (FN), so as to get additional samples to train the supervised detection models. It is revealed that FN consists of richer examples of events which have been annotated with the tags of frame semantics. In addition, the frame defined in FN shares high similarity with that in ACE: e. g. the lexical units and a set of frame elements inherently correspond to the event triggers and arguments in the ACE corpus, and many frames in FN can represent certain types of events. Experimental results show that the proposed method performs well both in trigger identification and event type recognition.
  • Informaton Extraction and Text Mining
    CHE Lei, YANG Xiaoping, WANG Liang, LIANG Tianxin, HAN Zhenyuan
    2019, 33(5): 93-102,112.
    Abstract ( ) PDF ( ) Knowledge map Save
    To better utilize text logical structure features and text organizational structure features in topic classification, this paper proposes a text structure oriented hybrid hierarchical attention network for this task. The logical structure usually includes information such as title and text, and the organizational structure includes character-word-sentence layer. The model integrates text headings and text bodies to improve the role of logical structure features in topic classification, and improves the role of text organizational structure features in topic classification based on the attention mechanism of char-sentence and word-sentence levels. Experimental results on 4 datasets show that the proposed model can improve the accuracy of topic classification tasks.
  • Informaton Extraction and Text Mining
    WANG Xianfa, GUO Yan, LIU Yue, YU Xiaoming, CHENG Xueqi
    2019, 33(5): 103-112.
    Abstract ( ) PDF ( ) Knowledge map Save
    Facing with the large-scale heterogeneous web pages, web extraction methods based on visual features tend to have poor generality and low extraction efficiency. To deal with the issue of poor generality, this paper proposes WEMLVF, a Web page information extraction framework based on visual features using supervised machine learning. This framework has good versatility. The effectiveness of the framework is validated through experiments on forum sites and news review sites. Then, to deal with the issue of low efficiency, the framework WEMLVF is utilized and method is proposed for automatically generating information extraction templates based on XPath and SoftMealy (a wrapper induction algorithm). These two methods use visual features to automatically generate information extraction templates without visual features. It makes full use of visual features information extraction and significantly improve the efficiency of information extraction, which is empirically verified.
  • Question Answering,Dialogue System and Machine Reading Comprehension
  • Question Answering,Dialogue System and Machine Reading Comprehension
    WU Bangyu, ZHOU Yue, ZHAO Qunfei, ZHANG Pengzhu
    2019, 33(5): 113-121.
    Abstract ( ) PDF ( ) Knowledge map Save
    Conversation is an important research field in natural language processing with wide applications. However, when training the Chinese conversation model, we have to face the problem of excessively high model complexity due to the large number of words. To deal with this issue, this paper proposes to convert the Chinese input into Pinyin and divide it into initials, finals and tones three parts, thereby reducing the number of words. Then, the Pinyin information is combined into image form using embedding method. We extract the Pinyin feature through a Fully Convolutional Network (FCN) and a bi-directional Long Short Term Memory (LSTM) network. Finally, we use a 4-layer Gated Recurrent Units (GRU) network to decode the Pinyin feature for solving the problem of long time memory, and obtain the output of the conversation model. On this basis, the attention mechanism is added in the decoding stage so that the output can correspond with the input better. In the experiment, we set up a conversation database in the medical field, and use BLEU and ROUGE_L as an evaluation indicator to test our model on the database.
  • Question Answering,Dialogue System and Machine Reading Comprehension
    YANG Zhiming, WANG Laiqi, WANG Yong
    2019, 33(5): 122-131.
    Abstract ( ) PDF ( ) Knowledge map Save
    Human-machine conversation technology has received extensive attention from the academic and industrial fields in recent years. The users question intention classification is an important key issues with direct effect on the quality of human-machine dialogue. In this paper, we propose an intent classification dual-channel Convolutional Neural Networks (ICDCNN) : we first extract semantic features by using Word2vec and Embedding layer to train the word vector ; then, two different channels are used for convolution, one for character level word vector, the other for word level word vector; thirdly, the character level word vectors (fine-grained) are combined with word level word vectors to mine deeper semantic information of natural language question; finally, with convolution kernels of different sizes, deeper abstract features inside the questions are learnt. Experimental results show that the algorithm achieves high accuracy on Chinese datasets, which has certain advantages compared to other methods.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    LIN Huaiyi, LIU Zhen, CHAI Yumei, LIU Tingting, CHAI Yanjie
    2019, 33(5): 132-142.
    Abstract ( ) PDF ( ) Knowledge map Save
    The main methods to deal with imbalance problems in deep learning are focused on cost function and sampling technique. Based on word vector migration, this paper proposes a pre-training task selection method and initializes the target model with a pre-trained word vector that facilitates minority classes differentiation. Combined with balanced oversampling, the sample information is used to maintain the accuracy of the model in majority classes, so that the text features extracted by the model are balanced. Compared with the oversampling method, the experimental results show that the proposed method has a better balanced effect in most cases where text emotion classification result have no serious over-fitting. When there is a serious over-fitting, the method has a significant balance effect in three-type classification task. Experiments also verify that pre-training methods can be combined with cost-sensitive methods to improve the balance performance.