2019 Volume 33 Issue 12 Published: 16 December 2019
  

  • Select all
    |
    Survey
  • Survey
    ZHUANG Chuanzhi, JIN Xiaolong, ZHU Weijian, LIU Jingwei, BAI Long, CHENG Xueqi
    2019, 33(12): 1-18.
    Abstract ( ) PDF ( ) Knowledge map Save
    Information Extraction(IE) is a task of natural language processing that involves extracting structured information from plain unstructured text. Relation Extraction(RE) is a crucial component in IE. Recently, researchers pay great attention to the method of deep learning, resulting various methods in this field. Starting from the basic concept of relationship extraction, this paper groupes methods from different perspectives, introduces the popular data sets, and outlines the deep learning framework for relationship extraction. This paper analyzes and reviews the details in data preprocessing and model design in these methods. Finally, the future research direction is discussed.
  • Language Analysis and Calculation
  • Language Analysis and Calculation
    HUAN Min, CHENG Haoyi, LI Peifeng
    2019, 33(12): 19-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    Event coreference resolution is a challenging task in natural language processing. Since the semantics of an event is mainly represented by its trigger and arguments, this paper introduces the structured representation of events to a neural network model GAN-SR (gated attention network with structured representation) based on the gated and attention mechanism for Chinese event coreference resolution in document level. Firstly, we apply the semantic role labeling and dependency parsing to analyze the shallow semantics of events, and then uses a quintile to represent their structures. Secondly, we use GRU to encode various kinds of event information and then apply the multi-head attention mechanism to mine the important features between the events and event pairs. The experimental results on the ACE2005 Chinese corpus show that GAN-SR outperforms the state-of-the-art baselines.
  • Language Resources Construction
  • Language Resources Construction
    ZHANG Wenmin, LI Huayong, SHAO Yanqiu
    2019, 33(12): 28-36.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese compound noun phrases are characterized by their wide range of use, unique syntactic structure and complex internal semantics, which has always been an important research object in the field of linguistic analysis and Chinese information processing. We extend the existing study on noun-only Chinese compound noun phrases into compound noun phrases with verbs, and construct a corpus of Chinese compound noun with semantic relations. A total of 27007 sentences are collected from various fields, and boundary of compound noun phrases in the sentences and its internal semantic relationships are annotated. This corpus is characterized by the context information is first provide for Chinese compound nouns and a new semantic relation system is formed to depict Chinese compound nouns. In addition to a detailed analysis of the corpus, the automatic identification of the Chinese compound nouns with the relationships is investigated by Bert+BiLSTM+CRF framework. The experimental results reveal the challenges of this task and the possible solutions are discussed.
  • Machine Translation
  • Machine Translation
    YANG Feiyang, ZHAO Yahui, CUI Rongyi, YI Zhiwei
    2019, 33(12): 37-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to achieve multi-language word alignment, an improved multi-language word relevance measure based on PMI translation probability is proposed. Firstly, it is proved that the PMI measure method of the correlation strength between words can be simplified to translation probability in the region of ordinary frequency grade words obeying Zipf's law. Secondly, the translation probability between parallel corpus words is calculated after pre-processing of Chinese, English and Korean parallel corpus, and the top k words with the highest translation probability are chosen as candidate translation words. Further optimization is applied to improve the word alignment accuracy. The experimental results show that this method can obtain more than 94% accuracy in small-scale corpus, which provides a solution to the low-resource language word alignment.
  • Machine Translation
    LI Jingyu, FENG Yang
    2019, 33(12): 45-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    Neural machine translation (NMT) has achieved remarkable progress in recent years. However, how to model contextual information when translating a document is still a problem worth discussing. Traditional attention mechanism considers all the words in source sentences, whereas translating a sentence relies solely on very sparse tokens in large document. For this purpose, we introduce a co-attention mechanism to capture the context by combining hard attention and soft attention. Specifically, the hard approach is used to select the source historical words related to the current sentence to be translated, and then soft attention mechanism is used to further extract the context information needed in the current translation. Experiments show that our method leads to strong improvements in translation quality, greatly outperforming the baseline models.
  • Machine Translation
    SU Yila, GAO Fen, RENQING Daoerji
    2019, 33(12): 54-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    Most current NMT models applies word or sub-word as the unit to learn embedded representations. To deal with the existing errors at the word level, this paper conducts sub-word segmentation for Mongolian, and sub-word and character segmentation for Chinese, respectively, on the translation models of LSTM and Transformer. Experimental results show that transformer and LSTM models with char segmentation both achieve significant improvements in terms of BLEU.
  • Machine Translation
    CIZHEN Jiacuo, SANGJIE Duanzhu, SUN Maosong, SE Chajia, ZHOU Maoxian
    2019, 33(12): 61-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    To better utilize the monolingual Tibetan texts in Tibetan-Chinese neural machine translation(NMT), we propose to pre-train a Tibetan neural language model and then integrate it into a Transformer-based Tibetan-Chinese NMT model. Experiments indicate our approach can boost the Tibetan-Chinese results from 21.1 to 24.5, and the Chinese-Tibetan form 18.6 to 23.3 in terms of BLEU score.
  • Machine Translation
    CHE Wanjin, YU Zhengtao, GUO Junjun, WEN Yonghua, YU Zhiqiang
    2019, 33(12): 67-75.
    Abstract ( ) PDF ( ) Knowledge map Save
    In neural machine translation, the problem of unknown words caused by limited vocabulary significantly affects the translation quality. Inspired by the integration of external knowledge, this paper investigates to improve the RNNSearch NMT by incorporating the classification dictionary, and proposes a new hybrid network to deal with the unknown words problem in the Chinese-Vietnamese neural machine translation. For source language sentence, the model scans classification dictionary to determine candidate phrase pairs and tags, the decoder uses hybrid network with both word and phrase level components to generate the translations. Experiments on Chinese-Vietnamese, English-Vietnamese and Mongolian-Chinese NMT show that this method significantly improves the translation performance.
  • Other Language in/around China
  • Other Language in/around China
    ROU Te,CAI Rangjia
    2019, 33(12): 76-82.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentence recognition is essential to the study of syntax and sentence meaning since there are no special punctuation marks at the end of Tibetan sentences to indicate sentence classes. In this paper, a sentence-use classification scheme is proposed based on the context and functional features of the sentences. Firstly, we introduce the classification and characteristics of Tibetan sentence classes in grammar. Secondly, we collect a large number of Tibetan sentences and manually labeled them. Finally, we use recurrent convolutional neural network to automatically identify Tibetan sentence classes. The experiment shows that the model has a significant effect on the recognition of Tibetan sentence classification.
  • Other Language in/around China
    ZHAO Xiaobing, BAO Wei, DONG Jian, BAO Wugedele
    2019, 33(12): 83-90.
    Abstract ( ) PDF ( ) Knowledge map Save
    To alleviate the scarcity of Tibetan language corpus, this paper proposed data augment method for Tibetan-Chinese bilingual paraphrase and Tibetan paraphrase detection. In Tibetan-Chinese bilingual paraphrase detection task, this paper proposed to augment the parallel corpora available by the Tibetan monolingual texts. When the training set is expanded to 200,000 pairs, the Pearson coefficient of the experiment is increased from 0.397 1 to 0.547 6 for the baseline system. In Tibetan text paraphrasing detection task, Tibetan syllable vectors is adopted to alleviate the insufficient training corpus for the word vector. Experimental results show that the Pearson correlation based on the Tibetan syllable vector experiment reaches 0.678 0, which is 0.1 higher than the corresponding word vector based method.
  • Other Language in/around China
    WANG Wenhui, BI Yude, LEI Shujie
    2019, 33(12): 91-100.
    Abstract ( ) PDF ( ) Knowledge map Save
    For the Vietnamese chunk identification task, this paper proposes two ways to integrate the attention mechanism with the Bi-LSTM+CRF model. The first is to integrate the attention mechanism at the input layer, which allows the model to flexibly adjust weights of word embeddings and POS feature embeddings. The second is to add a multi-head attention mechanism on the top of Bi-LSTM, which enables the model to learn weight matrix of the Bi-LSTM outputs and selectively focus on important information. Experimental results show that, after integrating the attention mechanism at the input layer, the F-value of Vietnamese chunk identification is increased by 3.08%; and after adding the multi-head attention mechanism on the top of Bi-LSTM, the F-value of Vietnamese chunk identification is improved by 4.56%.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    QU Beijun, BAI Yu, CAI Dongfeng, CHEN Jianjun
    2019, 33(12): 101-109.
    Abstract ( ) PDF ( ) Knowledge map Save
    The WeChat official account has become one of the important sources of information for people. The existing official account ranking method mainly heuristically weigh the indexes such as the total reading number and the total point number, ignoring the impact of the content of the article on the selection of the official account. In addition to these quantitative indicators, this paper proposes the WeChat text features such as topic verticality, post-text stability, topic coverage, and topic relevance. LambdaMART algorithm is applied, and feature selection is performed by principal component analysis. The experimental results show that the proposed method is superior to other existing methods.
  • Information Retrieval and Question Answering
    LUO Yang, XIA Hongbin, LIU Yuan
    2019, 33(12): 110-118.
    Abstract ( ) PDF ( ) Knowledge map Save
    To better capture the implicit representation for the user and the item, and the semantic relationship between words, a collaborative filtering recommendation model combining auxiliary information and attention LSTM is proposed. Firstly, the additional stacked denoising autoencoder is applied to extract the user potential vector from the scoring information and the user auxiliary information. Secondly, the LSTM with attention mechanism is utilized to extract the potential vector of the item from the item auxiliary information. Finally, the user and the item potential vectors are used in the probability matrix factorization to predict user preferences. Experiments on two real data sets, movielens-100k and movielens-1m, show that the proposed model has improved performance compared with other recommendation algorithms.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    LI Weijiang, QI Fang
    2019, 33(12): 119-128.
    Abstract ( ) PDF ( ) Knowledge map Save
    Language knowledge and sentiment resources are not well utilized in the current deep learning sentiment analysis. To address this issue, we propose a novel sentiment analysis model based on multi-channel bidirectional long short term memory network (Multi-Bi-LSTM), which generates different feature channels to fully learn the sentiment information of the text. Compared with CNN, the Bi-LSTM used in this model takes into account the dependencies between word sequences, and it can capture contextual semantic information about a sentence. The experiments on Chinese COAE2014 dataset, English MR dataset and SST dataset reveal better performance of the proposed method than the classical Bi-LSTM, the CNN combined with the features of sentiment sequences, and the classical classifiers.
  • NLP Application
  • NLP Application
    MA Chuangxin, LIANG Shehui, CHEN Xiaohe
    2019, 33(12): 129-134.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to determine the correlation between the pre-Qin schools, to detect the characteristic words that can represent the theme characteristics of each school, this paper makes a quantitative investigation of the relations among the various schools, together with the subject characteristics of the various ideas of each school. It is revealed that the correlation between the school of Confucianism and the school of Taoism is the highest, the correlation between the school of Soldiers and the school of Mo is the lowest, and the mean value of the correlation between Taoism and other schools ranks at the top. This paper also selects the characteristic words which can represent the subject of the school by analyzing the difference in the rank of word types between schools.