2019 Volume 33 Issue 1 Published: 21 January 2019
  

  • Select all
    |
    Language Analysis and Calculation
  • Language Analysis and Calculation
    CHEN Long, ZHAN Weidong
    2019, 33(1): 1-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic role labeling, which is based on the theories of semantic roles, is an important task in natural language processing. In semantic role labeling, it is required that the semantic role of a verb should be consistent in different sentences. We analyzed the semantic distribution of agents in the semantic role labeling corpus. It is revealed that 85 verbs can't meet the requirement of consistency, which may be accounted by their property of causative verbs instead of volitive verbs. This strategy can be extended to other inconsistency issues in semantic role labeling.
  • Language Analysis and Calculation
    LIU Hang, LIU Mingtong, ZHANG Yujie, XU Jinan, CHEN Yufeng
    2019, 33(1): 10-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the character-based Chinese dependency parsing, it is crucial to best utilize the intermediate results of word segmentation, POS tagging and dependency parsing. To fully exploit the dependency subtree information, this paper proposes a novel Stack-Tree LSTM, which is essentially a character-based neural network joint model by integrating subtree feature and POS feature in addition to N-gram feature. Experiments on Penn Chinese Treebank 5 show that our model is comparable to the best results, out-performing other neural joint models.
  • Language Analysis and Calculation
    ZHANG Wenjing, ZHANG Huimeng, YANG Liner, XUN Endong
    2019, 33(1): 18-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese word segmentation is crucial to Chinese information processing. To achieve the multi-grained word segmentation, we proposed a model based on the lattice-LSTM. Multi-granularity dictionary information is added into our method comparing with the traditional character based LSTM model. With the help of lattice structure, our model has a strong ability to capture word segmentation standards with different granularities, without being confined to any word segmentation standard. Experiments show that the method proposed in this paper has reached the state-of-the-art performance in the field of multi-granularity Chinese word segmentation.
  • Language Analysis and Calculation
    SHEN Longxiang, ZOU Bowei, YE Jing, ZHOU Guodong, ZHU Qiaoming
    2019, 33(1): 25-34.
    Abstract ( ) PDF ( ) Knowledge map Save
    Negative expressions are common phenomena in natural language text and play a critical role in various applications of natural language processing, such as sentiment analysis, information extraction. Negation focus identification task is a finer-grained negative semantic analysis, which aims at identifying the text fragment modified and emphasized by a negative keyword. Treating the negation focus identification as a sequence labeling task, we propose a bidirectional Long Short-Term Memory network with a Conditional Random Field layer (BiLSTM-CRF). It can not only learn the contextual information from both directions, but also learn the dependency between the output tags by the CRF layer. Experimental results on the *SEM2012 dataset shows that the performance of our approach achieves an accuracy of 69.58%, i.e. 2.44% improvement compared to the state-of-the-art methods.
  • Language Analysis and Calculation
    GU Bo, WANG Ruibo, LI Jihong, LI Guochen
    2019, 33(1): 35-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    We construct a 30 000 sentences binary Chinese Treebank which is base on Chinese syntactic theory proposed by Zhu DeXi and Lu JianMin, in which each parse is a full binary tree and represented by Huffman coding for simplicity. To deal with its parsing, we propose a sequential labeling model (RNN-Interval, abbr RNN-INT) based on RNN(recurrent neural network) tagging the intervals between words. We compared our model RNN-INT with primary RNN, LSTM and CRF models, employing the m×2 cross-validated sequential t-test. The experiment results show that the proposed model achieves the best performance with window size 1according to constituency F1 and sentence accuracy, i.e. 71.25% and 43%, respectively.
  • Language Analysis and Calculation
    LIU Yutong, WU Bin, XIE Tao, WANG Bai
    2019, 33(1): 46-55.
    Abstract ( ) PDF ( ) Knowledge map Save
    New word detection, as a fundamental task in natural language processing, is an indispensable step in the computational study of ancient Chinese literature. In this work, we present an AP-LSTM-CRF model to discover new words in ancient Chinese literature. This model consists of three steps. First, the parallelized improved-Apriori algorithm, implemented on Apache Spark (a distributed parallel computing framework), is used to generate candidate character sequences from large-scale raw corpus. Second, a segmentation model which combines recurrent neural network and conditional random field is used to generate segmentation sequences with probabilities. Third, we design a rule based filter to remove noise words in the candidate character sequences. Experimental results demonstrate that the method is capable of detecting new words in large-scale ancient Chinese corpus effectively. The F1 is up to 89.68% and 81.13% in Song Poetry dataset and History of the Song Dynasty dataset, respectively.
  • Language Analysis and Calculation
    LIANG Yongshi, HUANG Peijie, HUANG Peisong, DU Zefeng
    2019, 33(1): 56-67.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word vector representation is the basis for various natural language processing (NLP) systems. Studies have shown that word vectors trained from large corpora can be refined by semantic constraints in various lexical taxonomies. Based on lexicon-vectors interaction and the heterogeneous taxonomies' interaction, we present the method of extracting reliable lexical semantic constraints to better refine word vectors representation. In this method, the word class knowledge from lexical taxonomies is assessed for reliability based on word vectors' calculation. Experimental results on PKU 500 from the NLPCC-ICCPOL 2016 shared task on Chinese word similarity measurement show that the proposed method outperforms in the word similarity calculation with a Spearman score 0.649 7, which gains 25.4% improvement comparing to the best result in the shared task.
  • Language Analysis and Calculation
    WEN Zhi, LI Yang, WANG Suge, LIAO Jian, CHEN Xin
    2019, 33(1): 68-76.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Rhetorical Question is a kind of expression with a strong emotion. To automatically identify Rhetorical Questions, this paper proposes a CNN method combined with the sentence structure of Rhetorical Questions. Firstly, candidate Rhetorical Questions is selected from the microblog according to the feature words and sequence pattern features (>70% confidence). Then, word vectors and the features of the Rhetorical Questions are extracted to generate the representations by multiple convolution kernels. Finally, the softmax classifier is used to classify sentences. The experimental results show that the proposed method achieves 89.5%, 84.2% and 86.7% in terms of accuracy, recall, and F-measure, respectively.
  • Language Analysis and Calculation
    LIU Kai, WANG Hongling
    2019, 33(1): 77-84.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to improve the readability of automatic summaries, this article attempts to apply the discourse rhetorical structure information to Chinese automatic summarization. First, abstracts are extracted based on the rhetorical structure of Chinese texts. Then the LSTM-based methods are adopted to evaluate the coherence of the abstracts. The experimental results show that, automatic abstraction based on discourse rhetorical structure has better ROUGE value than traditional methods. The coherence evaluation results show that the discourse structure information can help the system extract the subject of the article automatically.
  • Machine Translation
  • Machine Translation
    REN Zhong, HOU Hongxu, JI Yatu, WU Ziyu, BAI Tiangang, LEI Ying
    2019, 33(1): 85-92.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the Mongolian-Chinese neural machine translation, the data sparse issue is of substantial effect to the translation quality. This paper applies the sub-word granularity segmentation in the Mongolian-Chinese neural machine translation model. The Byte Pair Encoding algorithm is adopted to alleviate the data sparseness by reducing the low-frequency words into relatively high-frequency sub-units. Experiments show that the sub-word segmentation technique can improve the Mongolian-Chinese neural machine translation, achieving 4.81 and 2.96 improvements in BLEU score, respectively.
  • Machine Translation
    LIN Ye, JIANG Yufan, XIAO Tong, LI Hengyu
    2019, 33(1): 93-102.
    Abstract ( ) PDF ( ) Knowledge map Save
    The model storage compression is to significantly reduce the storage cost by removing redundant model parameters without quality loss. Previous efforts are mostly devoted to computer vision tasks, leaving neural machine translation less touched. In this paper, we compare the model compression methods including pruning, quantification, and low-precision compression on Transformer and RNN models. Finally, we achieve 5.8× and 11.7× compression ratio on the Transformer and RNN models by a combined approach, while maintaining the same BLEU score.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    TURGHUN Osman, YANG Yating, WANG Lei, ZHOU Xi, CHENG Li
    2019, 33(1): 103-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a novel approach to construct Uyghur dependency treebank using Chinese-Uyghur bilingual corpus. The Chinese dependency information is mapped into Uyghur sentences through word alignment. And then the Uyghur dependency information is further optimize by morphological constraints. Experimental results in CoNLL 2017 Shared Task dataset show that the proposed Uyghur parsing model can achieve 34.38% and 52.53% improvements in terms of LAS and UAS accuracy, respectively.
  • Ethnic Language Processing and Cross Language Processing
    SE Chajia, GONG Baocairang, CAI Rangjia
    2019, 33(1): 111-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    The spelling check of Tibetan syllables is the basic task of Tibetan Natural Language Processing. This paper proposes a method of syllable vectorization, which is called syllable matrix, for the structure of Tibetan syllables. Then, a CNN model for spelling checking is trained using 1 364 880 Tibetan syllables. The final test on the 68 244 Tibetan syllables shows that the CNN model of Tibetan syllable spelling is better than the TSRM, RNN and LSTM models, achieving 99.52%, 99.30% and 99.41% in terms of accuracy, recall and F value, respectively.
  • Question-answering and Dialogue System
  • Question-answering and Dialogue System
    FENG Wenzheng, TANG Jie
    2019, 33(1): 118-124.
    Abstract ( ) PDF ( ) Knowledge map Save
    Answer Selection is one of the key tasks in question answering system. Its main purpose is to rank the candidate answers according to the similarity between the questions and the candidate answers and select the more relevant answers to users. It can be seen as a text pair matching problem. In this paper, we use the deeplearning model such as word embedding, bidirectional LSTM, 2D neural network and so on to extract the semantic matching features for question-answer pairs, and incorperate these into a ranking model together with traditonal NLP features. The experiments on the Qatar Living community question answering data show that the answer selection model with deep matching features is about 5% higher than only using traditional features on the MAP values.
  • Question-answering and Dialogue System
    WANG Lu, ZHANG Lu, LI Shoushan, ZHOU Guodong
    2019, 33(1): 125-132.
    Abstract ( ) PDF ( ) Knowledge map Save
    To deal with the informal texts in social media where a question text has several questions and the answer text has several answers, we propose a new task named QA pairing, which means to identify the answer sentence(s) for each question. First, we build a novel QA pairing corpus with informal text, which is collected from a product reviewing website. Then, in order to resolve the noises in informal text, we propose a novel QA pairing approach, namely contextual QA pairing method based on attention network. Empirical studied demonstrate the effectiveness of the proposed approach to QA pairing.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    CHENG Yan, YE Ziming, WANG Mingwen, ZHANG Qiang, ZHANG Guanghe
    2019, 33(1): 133-142.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text sentiment orientation analysis is a fundamental problem in natural language processing. To further improve the deep learning based models used in this issue, this paper proposes a new model named C-HAN, i.e. Convolutional Neural Network-based and Hierarchical Attention Network-based Chinese Sentiment Classification Model. It utilizes a convolution layer to extract a sequence of higher-level phrase representations, which are then fed into a hierarchical attention network to obtain the final representations. On the Chinese sentiment analysis corpus, the character level C-HAN achieves a sentiment prediction accuracy of 92.34%, slightly better than the word level C-HAN yielding 91.96% accuracy.