2020 Volume 34 Issue 2 Published: 15 April 2020
  

  • Select all
    |
    Survey
  • Survey
    PENG Xiaoya, ZHOU Dong
    2020, 34(2): 1-15,26.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the increasing demand of multilingual information on the Internet, cross-lingual word embedding has become an important basic tool, which has been successfully applied to the natural language processing fields such as machine translation, information retrieval and text sentiment analysis. Cross-lingual word embedding is a natural extension of monolingual word embedding. The cross-lingual representation of words transfers knowledge among different languages by mapping different languages into a shared low-dimensional vector space, so as to accurately capture the meaning of words in multilingual scenario. In recent years, there have been a lot of research achievements on the cross-lingual word embedding model. This paper reviews the existing literature on cross-lingual word embedding models, and comprehensively discusses the development of cross-language word vector models, methods and technologies in recent years. According to the different ways of word embedding training, it is divided into three kinds of methods: supervised learning, unsupervised learning and semi-supervised learning. Finally, we summarize the evaluation and application of the cross-lingual word embedding, and analyze the challenges and future development directions.
  • Language Resources Construction
  • Language Resources Construction
    YU Dong, WU Siyuan, GENG Zhaoyang, TANG Yuling
    2020, 34(2): 16-26.
    Abstract ( ) PDF ( ) Knowledge map Save
    We propose a crowd-sourcing annotation approach based on pairwise comparison. With this approach, non-experts annotators' comparative judgements would lead to labelled data with a uniform standard. We construct a textbook-based corpus with 18, 411 Chinese sentences and utilize it to train a machine learning model which is capable of predicting the difficulty of sentences and the relative difficulty of sentence-pairs. We also explore the impact of multi-level linguistic features in two difficulty prediction tasks, in which our model achieves 63.37% and 67.95% accuracy respectively. The results show that Chinese character-level features are of greatest prediction among all the features in the two tasks.
  • Machine Translation
  • Machine Translation
    CAO Yichao, GAO Yi, LI Miao, FENG Tao, WANG Rujing, FU Sha
    2020, 34(2): 27-32,37.
    Abstract ( ) PDF ( ) Knowledge map Save
    To improve the Mongolian-Chinese neural machine translation performance, this paper proposes a method based on monolingual corpora and word embedding alignment. First, the Mongolian and Chinese word embedding spaces are aligned to initialize the embedding layers of the model. Second, jointly training is employed to train Mongolian-to-Chinese translation and Chinese -to-Mongolian translation at the same time. Finally, Mongolian and Chinese monolingual corpora are utilized to train the model as a denoising autoencoder. Experimental results show that the proposed method outperforms the baseline approach and improves the performance of Mongolian-Chinese neural machine translation.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    CAI Rangzhuoma, CAI Zhijie
    2020, 34(2): 33-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    The word segmentation is a classical topic of natural language processing. This paper proposes a strategy and algorithm for Tibetan word segmentation based on the rules of part-of-speech. Compared with traditional methods, this method can not only effectively deal with the ambiguity, but also achieves better performance in processing unknown words in Tibetan.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    TANG Chao, NUO Minghua, HU Yan
    2020, 34(2): 38-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    The main purpose of relational extraction is to transform unstructured or semi-structured text into structured data, focusing on identifying entities from text and especially extracting semantic relationships between entities. This paper explores a hybrid model of ResNet and BiGRU. Based on the characteristics of the ResNet, we combine residual learning CNN with RNN on the extraction of entity relation tasks. The residual block, RNN and attention mechanism are simultaneously used for the weakly-supervised relation extraction. Experimental results indicate that, on NYT-Freebase dataset, the P@N results are improved by 2.9% compared with the single ResNet.
  • Information Extraction and Text Mining
    CHANG Chen, CAO Jianjun, LV Guojun, ZHENG Qibin, WENG Nianfeng
    2020, 34(2): 46-55.
    Abstract ( ) PDF ( ) Knowledge map Save
    To discovery the truth from text data, a truth discovery method by Bi-GRU with attention mechanism is proposed. According to the characteristics of the text data (the multifactorial property of text answers, the diversity of word usages, and the sparseness of the text data), the fine-grained user answers are taken as input of network, and then the Bi-GRU is utilized to extract semantic information of user answers. Moreover, the keyword reliability and answer reliability are learned with attention mechanism. Finally, the context vector for each problem can be learned without supervision according to the general hypothesis of truth discovery. The experiment results show that the proposed algorithm is better than the retrieval based methods and other traditional truth discovery methods.
  • Information Extraction and Text Mining
    HONG Zhuangzhuang, HUANG Zhaohua, WAN Zhongbao, ZHANG Wei, GAO Mengxi
    2020, 34(2): 56-62.
    Abstract ( ) PDF ( ) Knowledge map Save
    The domain texts can be characterized by the complex structure, the high similarity and the dynamic change. With a mixture of continuous and discrete types of data, the existing knowledge discovery method is restricted in the mining efficiency of the text rules. To deal with this issue, this paper proposes a text rule mining method based on GMM and Rough Set. Firstly, the method constructs an information table according to the attribute type of the target data; Then, the Gaussian Mixture Model (GMM) clustering algorithm is applied to cluster the continuous data, on which the data is discretized and the state is reduced, and the decision table is generated; Finally, the rough set theory is used to reduce the attributes of decision table, and the decision rules are extracted through the reduction table. The experimental results show that the proposed method has higher precision and stronger attribute reduction ability, achieving an average precision and F score of 95.0% and 95.7%, respectively.
  • Information Extraction and Text Mining
    ZHOU Yeheng, SHI Jiahan, XU Ruifeng
    2020, 34(2): 63-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    Aiming at text matching task, this paper proposes a method to incorporate large-scale pre-training model and external language knowledge base. On the basis of large-scale pre-training model, this method introduces external linguistic knowledge by generating synonym-antonym knowledge learning task and phrase-collocation knowledge learning task based on WordNet, respectively. Then, the two new generated tasks are joint trained with MT-DNN multi task learning model to further improve the model performance. Finally, the annotated text matching data is used to fine tune. The experimental results on two open datasets, MRPC and QQP, show that the proposed method can effectively improve the performance of text matching by introducing external language knowledge for joint training on the basis of the framework of large-scale pre-training model and fine-tuning.
  • Information Extraction and Text Mining
    HUANG Peisong, HUANG Peijie, DING Jiande, AI Wencheng, ZHANG Jinchuan
    2020, 34(2): 73-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    Attention-based bidirectional long short-term memory network(BiLSTM) models have recently shown promising results in text classification tasks. However, when the amount of training data is restricted, or the distribution of the test data is quite different from the training data, some potential informative words are hard to be captured in training. In this work, we propose a new method to learn co-attention for domain classification. Unlike the past attention mechanism guided only by domain tags of training data, we leveroge using the latent topics in the data set to learn topic attention mechanism, and employ it for BiLSTM. Then the co-attention is obtained by combining the topic attention and the network attention. Experiments on the SMP-ECDT benchmark corpus show that the proposed co-attention mechanism outperforms the state-of-the-art soft mechanism, hard attention mechanism and topic attention mechanism in domain classification, by 2.85%, 1.86% and 1.74% accuracy improvement, respectively.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    SONG Shuangyong, WANG Chao, CHEN Chenglong, ZHOU Wei, CHEN Haiqing
    2020, 34(2): 80-95.
    Abstract ( ) PDF ( ) Knowledge map Save
    AliMe is a recently developed chatbot, focused on intelligent customer service domain. The emotion analysis technologies have been successfully utilized in many modules of AliMe. The technical details of those emotion analysis based modules are presented, including user emotion detection, user emotion comfort, emotional generative chatting, customer service quality control, session satisfaction prediction and intelligent entrance for manual customer service. Furthermore, some user interface examples of those emotional modules are also introduced to improve understanding of their effects.
  • Sentiment Analysis and Social Computing
    LI Yang, WU Zhuojia, WANG Suge, LIANG Jiye
    2020, 34(2): 96-104.
    Abstract ( ) PDF ( ) Knowledge map Save
    Rhetorical question is a rhetorical way to express strong emotion in the form of interrogative sentence, and the effective identification of it can provide the technical support to sentiment analysis in natural language processing. An identification method of rhetorical question based on automatic acquisition of language features is proposed in this paper. Firstly, a data-driven feature extraction model is established by using label attention mechanism to obtain the language features of word, syntactic structure, symbolic and topic from a sentence. Secondly, the target sentence and corresponding language features are expressed by Bi-LSTM model. On this basis, the interactive attention of the both is used to obtain the attention weight vector of words and symbolic flags in the target sentence. By making the attention weight vector act on the Bi-LSTM expression of the target sentence, a language feature strengthened rhetorical question identification model is established. Comparing with the previous works on a Chinese microblog data, the experimental results show that the proposed method significantly improved the performance of rhetorical question identification.
  • Sentiment Analysis and Social Computing
    HU Shengwei, LI Bicheng, LIN Kongjie, XIONG Yao
    2020, 34(2): 105-112.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposed a unsupervised text sentiment transfer solution by MaskAE model. Firstly, the sentiment words in the sentence are detected by a dictionary and then replaced by "mask". The MaskAE model is used to generate the masked sentiment words, with the other words unchanged, to alleviate the text attribute entanglement problem. In model training, we use sentiment discriminator to control the sentiment of generated sentences, avoding the dependence on parallel corpus. The experimental results show that our model has improved in both automatic evaluation metrics and human evaluation metrics. And the generated sentences also perform better in grammar and semantic correctness.