2019 Volume 33 Issue 3 Published: 15 March 2019
  

  • Select all
    |
    Language Analysis and Calculation
  • Language Analysis and Calculation
    ZHOU Yi, CHU Xiaomin, ZHU Qiaoming, JIANG Feng, LI Peifeng
    2019, 33(3): 1-7,24.
    Abstract ( ) PDF ( ) Knowledge map Save
    The macro discourse analysis aims to analyze the semantic relations between adjacent paragraphs or paragraph groups, which is a less-addressed fundamental work of in the field of natural language processing. This paper proposes a classification model to decide the relation in macro discourse-level. This model introduces a distribute representation of macro discourse semantics on word vectors and a set of structure features to improve the performance. The experimental results on the Macro Chinese Discourse Tree Bank (MCDTB) show that the F1 value of our model reaches 68.22%, achieving 4.17% improvement.
  • Language Analysis and Calculation
    HUO Huan, XUE Yaohuan, HUANG Junyang, JIN Xuancheng, ZOU Yiting
    2019, 33(3): 8-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    Current methods of combining constituent trees with LSTM (C-TreeLSTM) suffere from low accuracy for text modeling due to withouth computing the words in hidden state of internal nodes. This paper proposes a hybrid neural network model, i.e. SC-TreeLSTM, based on the constituent tree structure. The model enhances nodes memory of text semantics by injecting phrase semantic vectors which is covered by corresponding node during encoding. The experimental results show that the SC-TreeLSTM achieves excellent performance in both sentiment classification and machine reading comprehension tasks.
  • Language Resources Construction
  • Language Resources Construction
    LI Jing, ZHANG Haisong, SONG Yan
    2019, 33(3): 17-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a large-scale corpus for non-task-oriented dialogue systems, which contains over 27K distinct prompts with more than 82K responses collected from social media. To annotate this corpus, we define a 5-grade rating scheme (bad, mediocre, acceptable, good, and excellent) with respect to the relevance, coherence, informativeness, interestingness, and the potential to move a conversation forward. To test the validity and usefulness of the produced corpus, we compare various unsupervised and supervised models for response selection. Experimental results confirm that the proposed corpus is helpful in training response selection models.
  • Knowledge Representation and Acquisition
  • Knowledge Representation and Acquisition
    ZHANG Minghua, WU Yunfang, LI Weikang, ZHANG Yangsen
    2019, 33(3): 25-32.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to learn distributed representations of text sequences, the previous methods focus on complex recurrent neural networks or supervised learning. In this paper, we propose a gated mean-max autoencoder both for Chinese and English text representations. In our model, we simply rely on the multi-head self-attention mechanism to construct the encoder and decoder. In the encoding we propose a mean-max strategy that applies both mean and max pooling operations over the hidden vectors to capture diverse information of the input. To enable the information to steer the reconstruction process, the decoder employ element-wise gate to select between mean and max representations dynamically. By training our model on a large amount of Chinese and English un-labelled data respectively, we obtain high-quality text encoders for publicl available. Experimental results of reconstructing coherent long texts from the encoded representations demonstrate the superiority of our model over the traditional recurrent neural network, in terms of both performance and complexity.
  • Knowledge Representation and Acquisition
    ZHU Jingwen, YANG Yuji, XU Bin, LI Juanzi
    2019, 33(3): 33-41.
    Abstract ( ) PDF ( ) Knowledge map Save
    HowNet is a large-scale and high-quality cross-lingual commonsense knowledge base, containing a wealth of semantic information. This paper disassembles HowNets complex structure and obtains HownetGraph in the form of knowledge graph. Then Network Representation Learning and Knowledge Representation Learning methods are applied to obtain cross-lingual vector representation of different semantic units, i.e., word, sense, DEF_CONCEPT and sememe. Two series of experiments (word similarity and word analogy) are conducted on Chinese and English datasets, and the results show the proposed method achieves the best results.
  • Machine Translation
  • Machine Translation
    LI Bei, WANG Qiang, XIAO Tong, JIANG Yufan, ZHANG Zheyang, LIU Jiqiang, ZHANG Li, YU Qing
    2019, 33(3): 42-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    Ensemble learning has been extensively proved valid in machine translation evaluation campaigns, but the sub-model selection and integration strategies are not well addressed. This paper examines the two kinds of ensemble learning methods: parameter averaging and model fusion in machine translation tasks, and investigates the impact of diversity and model quantity on system performance from the perspectives of data and model. Experimental results show that the best result yields improvements of 3.19 BLEU points over the strong Transformer baseline on WMT Chinese-English MT tasks.
  • Machine Translation
    ZHANG Nan, LI Xiang, JIN Xiaoning, CHEN Wei
    2019, 33(3): 52-58.
    Abstract ( ) PDF ( ) Knowledge map Save
    To realize the case-sensitive English words for machine translation, this paper proposes a method for jointly predicting English words and their case in neural machine translation. It predicts words and their case in the same decoder. respectively, by emplying both the information of the source corpus and that in the target corpus. The method not only decreases the size of the words list, reduces the parameters amount of the model, but also improves the quality of translation. Compared to the baseline system in the WMT 2017 Chinese-English news translation task test set, the proposed method achieves 0.97 BLEU improvement in case-insensitive settings, and 1.01 in case-insensitive settings.
  • Ethnic Language and Cross Language Information Processing
  • Ethnic Language and Cross Language Information Processing
    BI Yude, JIANG Bowen
    2019, 33(3): 59-63,101.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentence similarity computing is a fundamental task in the field of natural language processing, e.g. it directly affects the quality of translation in EBMT(Example-based Machine Translation)system. Focused on Korean, this paper puts forward a method for Korean sentence structure similarity computing according to the Korean sentence characteristics. This method first extracts the skeleton of the Korean sentence and then further processed the skeleton with the transformation rules designed in this paper. The final sentence similarity are measured in this kind of structure space, which are validated for the feasibility and efficiency by the experiment.
  • Ethnic Language and Cross Language Information Processing
    WANG Lulu, AISHAN Wumaier, TUERGEN Yibulayin, MAIHEMUTI Maimaiti, KAHAERJIANG Abiderexiti
    2019, 33(3): 64-70.
    Abstract ( ) PDF ( ) Knowledge map Save
    The current Uyghur named entity recognition methods are based on statistical learning methods such as conditional random fields, which depends heavily on manual feature engineering and domain knowledge extraction. To resolve this issue, this paper proposes a method based on deep neural network and introduces different feature vector representation for Uyghur named entity recognition. The word embedding and the character-level embedding are combined by the attention-based method and then fed into the Bi-LSTM-CRF. The experimental results show that the proposed method achieves an F-value of 90.13% for Uyghur named entity recognition.
  • Ethnic Language and Cross Language Information Processing
    HE Junqing, HUANG Xian, ZHAO Xuemin, ZHANG Keliang
    2019, 33(3): 71-78.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper aims at identification similar languages such as Uyghur and Kazakh from short conversational texts. To alleviate the severe data imbalance resulted from the low-recource Kazakh, we leverage a compensation strategy and an assimilation method by selecting appropriate out-of-domain data. Then we constructed a maximum entropy MaxEnt classifier based on morphologic features to discriminate between the two languages and investigated the contribution of each feature. Experimental results suggest that the MaxEnt classifier effectively discriminates between Uyghur and Kazakh on the test set with an accuracy of 95.7%, outperforming the champion of the VarDial’2016 DSL shared task on test sets B1 and B2 by 0.6% and 1.2%.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    SHEN Zizhuo, YANG Ying, SHAO Yanqiu
    2019, 33(3): 79-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents an investigation into the BaYin, the eight classical musical instruments, depicted in the ancient Chinese poetry. Specifically, we applied LDA and NMF to establish the Author-Topic-Model based author similarity. From the corpus of Tang Poetry and Song Poetry, this paper delivers a panoramic view on the poems, poets, the topics the verbs related to Bayin.
  • Information Extraction and Text Mining
    ZHAO Yun, WU Fan, WANG Zhongqing, LI Shoushan, ZHOU Guodong
    2019, 33(3): 87-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the development of social media, the relationship network between users has greatly helped the analysis of social media. Focusing on predicting the user relationship based on the text information published by the user, this paper proposes a friendship prediction model based on attention mechanism and Long Short-Term Memory(LSTM), which separates the comments between friends, and determines whether there is a certain friend relationship by analyzing the users comments. This model takes as input the concatenated results of the two friends and applies the attention mechanism to the output of the LSTM. Experiment shows that the proposed model achieved an accuracy of 77% without adding any other non-text features.
  • Information Extraction and Text Mining
    XUAN Zhenyu, JIANG Shengyi, Zhang Liming, BAO Rui
    2019, 33(3): 94-101.
    Abstract ( ) PDF ( ) Knowledge map Save
    Person name in the movie reviews is featured by abbreviations and neologisms, which decreases the performances of classical models (e.g. CRF). To deal with this issue, this paper proposes a novel person name recognition method named Multi-Feature Bi-LSTM-CRF Model. This model extracts relevant character-level features by using external corpora and unlabeled reviews, then applies Bi-LSTM-CRF to identify the sequence of person names. The experimental results show that our model can effectively identify different forms of person names in the movie reviews.
  • Question Answering, Dialogue System and Machine Reading Comprehension
  • Question Answering, Dialogue System and Machine Reading Comprehension
    TAN Hongye, LIU Bei, WANG Yuanlong
    2019, 33(3): 102-109.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper explores the solutions to the description problems in reading comprehension using QU-NNs model whose frameworks are the Embedding layer, the Encoding layer, the Interaction layer, the Prediction layer, and the answer Post-processing layer. To deal with the high degree of semantic generalization of the questions, we integrate three features of question (question type, question topic, question focus) in the Encoding layer and the Interaction layer of the model to better understand the question. Specifically, the question type is identified by a convolutional neural network, and the question topic and question focus are obtained through syntactic analysis. Further, a heuristic method is designed to identify the noise and redundant information in the answer. Experiments show that adding question features and removing redundant information increased the performance by 2%~10%.
  • NLP Application
  • NLP Application
    ZHANG Kai, LI Junhui, ZHOU Guodong
    2019, 33(3): 110-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    Due to the publically available large-scale image dataset with manually labeled English captions, most studies on image caption aim at generating captions in a single language (e.g., English). In this paper, we explore zero-resource image caption to generate Chinese captions via English as the pivot language. Specifically, we propose and compare two approaches by taking advantage of recent advances in neural machine translation. The first approach, called pipeline approach, first generates English caption for a given image and then translates the English caption into Chinese. The second approach, called building pseudo-training set approach, first translates all English captions in training sets and development set into Chinese to obtain image-Chinese caption datasets, and then directly train a model to generate Chinese caption for a given image. Experimental results show that the second approach, i.e., the character-based Chinese caption generation model on the pseudo-training set, is superior to the pipeline approach.
  • NLP Application
    ZHANG Chenlin, WANG Mingwen, TAN Yiming, CHEN Zhiming, ZUO Jiali, LUO Yuansheng
    2019, 33(3): 118-125,135.
    Abstract ( ) PDF ( ) Knowledge map Save
    As one of the Four Great Classical Novels, Journey to the West left lots of foreshadowing to interpret. In this paper, we conduct a case study on Monkey King by using sentiment analysis. We apply NLP technologies: automatic segmentation and sentiment lexicon collection to calculate the sentiment of Monkey King. By judging the changes of the sentiment of Monkey King before and after the episode of “Real and Fake Monkey King”, we finally proposed such points as: “Monkey King was not killed by Rulai, the supreme Buddha”, and he changed to bend to obey the authority after the episode. This paper made a tentative exploration on sentiment analysis for literary studies.
  • NLP Application
    LIANG Jiannan, SUN Maosong, YI Xiaoyuan, YANG Cheng, CHEN Huimin, LIU Zhenghao
    2019, 33(3): 126-135.
    Abstract ( ) PDF ( ) Knowledge map Save
    Jiju poetry is a special kind of Chinese classical poetry in which each line is selected from existing poems respectively. As a form of art recreation, the reformed poem should not only obey the structural and phonological constraints, but also have an original theme, integrated content, and coherence. In this paper, we propose a novel automatic Jiju poetry generation model based on neural network. We apply Recurrent Neural Network (RNN) to learn the vector representation of each poetry line, then we investigate different methods to measure the context coherence of two lines. Both automatic and human evaluation results show that our model can generate high-quality Jiju poems, outperforming the baseline models significantly.
  • NLP Application
    YIN Heju, ZAN Hongying, CHEN Junyi, ZHAI Xinli
    2019, 33(3): 136-144.
    Abstract ( ) PDF ( ) Knowledge map Save
    This article investigates the automatic judgment on the “traffic accidents” in civil cases of the legal field. The 14 000 samples are collected from the “China Jadgment Document Network.” Three models are examined, i.e. SVM-based model, BI-GRU-based model, and Attention+BI-GRU-based model, to classify the cases from the “China Judgment Document Network” into four-class and eight-class, respectively. The experimental results show that: the Attention+BI-GRU top-ranked with 80.26% F1 in the first task, while the BI-GRU model 48.59% F1 in the latter.