2019 Volume 33 Issue 7 Published: 15 July 2019
  

  • Select all
    |
    Survery
  • Survery
    ZHANG Yangsen, DUAN Yuxiang, HUANG Gaijuan, JIANG Yuru
    2019, 33(7): 1-10,30.
    Abstract ( ) PDF ( ) Knowledge map Save
    Social media such as Facebook, Twitter, and Sina Microblog have become the main channels for people to exchange information. To deal with the large quantity, complex structure and the fast transmission speed of social media information, the technology of topic detection and tracking comes into being to generate simple and clear topic information. This paper reviews the work done on social media topic detection and tracking. Firstly, it summarizes three types of topic detection methods based on topic model, clustering algorithm and multi-feature fusion, respectively. Secondly, it introduces the researches on topic tracking in two categories: non adaptive topic tracking and adaptive topic tracking. Finally, it lists the problems in the current topic detection and tracking technology, and discusses the prospects of future researches on social media.
  • Language Analysis and Calculation
  • Language Analysis and Calculation
    ZHANG Kejun, SHI Taimeng, LI Weinan, QIAN Rong
    2019, 33(7): 11-19.
    Abstract ( ) PDF ( ) Knowledge map Save
    ZHANG Kejun1, SHI Taimeng1, LI Weinan1,2, QIAN Rong1z
  • Language Resources Construction
  • Language Resources Construction
    HOU Shengluan, FEI Chaoqun, ZHANG Shuhan
    2019, 33(7): 20-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    Rhetorical Structure Theory (RST) is a common discourse structure theories, emphasizing the RSR (rhetorical structure relation). Based on English-oriented RST and the characteristics of Chinese text, this paper presents a hierarchical taxonomy and multiple definitions of Chinese-oriented RSR. Moreover, an annotated method is proposed to deal with the problem of ambiguity. A Java-GUI based tagging tool called RST Tagger is designed and implemented as a bottom-up tagger, whose elementary tagging unit is a subject-predicate structure and tagging result is a full discourse structure tree. To validate our proposed tagging framework, we selected 160 Chinese foreign trade text as the tagging corpus, from which 50 texts were randomly selected to be tagged by different annotators. We got annotator agreement with score 76.63%.
  • Language Resources Construction
    DING Ying, LI Junhui, ZHOU Guodong
    2019, 33(7): 31-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentence alignment provides high quality parallel sentence pairs for cross-language natural language processing tasks. Inspired by the intuition that aligned sentence pairs consists of a large number of aligned word pairs, this paper proposes the sentence alignment method by the semantic interaction between word pairs in neural network framework. In particular, this paper proposes word-pair relevance network, which first captures the semantic interaction between word pairs from different perspectives, then incorporates the semantic interaction to predict whether a sentence pair is aligned or not. Experimental results on monotonic and non-monotonic bitexts show that the proposed approach significantly improves the performance of sentence alignment.
  • Machine Translation
  • Machine Translation
    HAN Dong, LI Junhui, ZHOU Guodong
    2019, 33(7): 40-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    Due to incapability of fully learning the semantic details of source words, neural machine translation (NMT) tends to have a large number of wrong word translations in translation output. This paper proposes to explicitly incorporate word translation into NMT encoder. Firstly, the dictionary method is used to find the corresponding word translation for each source word. Then two different ways are proposed to fuse the source word and its translation information: (1) Factored Encoder: words and their translation information are added directly; (2) Gated Encoder: controls the input of word translation information through gate mechanism. Based on the state-of-the-art NMT framework of transformer with self-attention mechanism, experimental results on Chinese-English translation task show that the proposed encoders can significantly improve the performance, especially the Gated Encoder method achieves 0.81 BLEU scores improvement over the baseline system.
  • Machine Translation
    LI Xiang, LIU Yang, CHEN Wei, LIU Qun
    2019, 33(7): 46-55.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes to utilize a large and high precision neural machine translation (NMT) model (teacher model) to distill invisible bilingual knowledge from monolingual data in order to improve the translation quality of a small and low precision NMT model (student model). This paper first proposes the method of pseudo bilingual data where the student model is improved based on the synthesized training data by utilizing the teacher model to translate the monolingual data. Further, this paper proposes the joint optimization approach of negative log-likelihood and knowledge distillation. In addition to the synthetic training data, the student model can be enhanced by using the probability distribution of target language words obtained by the teacher model as knowledge under the knowledge distillation framework. Experiments on the Chinese-English and Germany-English translation tasks show that the student model trained by the proposed approaches not only significantly outperforms the baseline student model regarding translation quality on in-domain test sets, but also achieves a better generalization performance on an out-domain test set.
  • Machine Translation
    TAN Min, DUAN Xiangyu, ZHANG Min
    2019, 33(7): 56-64.
    Abstract ( ) PDF ( ) Knowledge map Save
    Translation models trained by neural machine translation system in resource rich areas tend to perform poorly in resource poor areas. This paper proposes domain adaptation based on domain features to improve the quality of neural machine translation with poor resource. Specifically, this paper establishes domain sensitive networks to obtain domain specific features, as well as to build domain insensitive networks to obtain common features between domains. A domain discriminator is used to distinguish the domain. This paper trained domain sensitive network to make it easier for the domain discriminator to make accurate judgements. At the same time, the adversarial mechanism is used so that the domain insensitive network can deceive the domain discriminator. Finally, a system combination mechanism is proposed by combining the base neural translation network, the domain sensitive network, and the domain insensitive network for the domain adaptation task. The experimental results show that this method achieves significant improvement in Chinese-English Broadcast Conversation translation task and English-German Spoken Language translation task.
  • Machine Translation
    TANG Jian, HONG Yu, LIU Mengyi, YAO Liang, YAO Jianmin
    2019, 33(7): 65-74.
    Abstract ( ) PDF ( ) Knowledge map Save
    Image Description Translation takes a source language description and translates it into the target language, where this process can be supported by information from the image. Observations show that different images often express different scenes, the corresponding image description has obvious differences in topic distributions. This paper presents an image description translation method integrating the topic information of the image. For a pair of image and its descriptions, the method retrieves similar images from wiki, and then use the documents of the retrieved images to learn topic distributions. Finally, we use topic distributions of all training images and their descriptions to relearn the topic distribution, and get the translation model of topic adaptation. Our experimental results on the WMT16 test set show an improvement of 0.74 BLEU point over baseline.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    LAMA Zhaxi, CAI Zhijie, BAN Mabao
    2019, 33(7): 75-80.
    Abstract ( ) PDF ( ) Knowledge map Save
    Tibetan function word plays an important role in ambiguity resolution in both syntax and semantics in Tibetan language. This paper examines the Tibetan function words related to natural language processing, and proposes a Tibetan function word recognition combing rules and Maximum Entropy Model. Experiments show that the accuracy, recall and F1 value of the proposed method reaches 98.39%, 98.75%, and 98.57%, respectively.
  • Ethnic Language Processing and Cross Language Processing
    CAI Zhijie, SUN Maosong, CAI Rangzhuoma
    2019, 33(7): 81-87,100.
    Abstract ( ) PDF ( ) Knowledge map Save
    Evaluation of words embedding as an essential issue in the research can be performed by intrinsic evaluation or extrinsic evaluation. The intrinsic evaluation, as a basic solution, usually demands an evaluation set describing the similarity or relevance among words. After examing the construction methods of words embedding evaluation sets of English and Chinese, this paper investigate the construction of Tibetan words embedding evaluation set according to the characteristic of Tibetan. The evaluation sets WordSim215 and TWordRel215 are constructed and analyzed for their effectiveness of evaluating Tibetan words embedding similarity and relevance.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    SONG Rui, CHEN Xin, HONG Yu
    2019, 33(7): 88-100.
    Abstract ( ) PDF ( ) Knowledge map Save
    Slot Filling is to extract the value of specific slot (also called filler)for a named entity. Provenance refers to the source of the fillers, which it is usually a passage or a sentence used to prove that the filler correctly reflects the slot type. It is revealed in the corpus that there are many homogeneity in the provenance of slot filling, which is called paraphrase. Therefore, we combine the paraphrase technique with the existing knowledge base to explore the provenance identification via the paraphrase identification model derived from small-scale seed “provenance”. The results show that the paraphrase identification method based on tree edit model can capture the relevant “provenance” of slot fillers well with less prior knowledge.
  • Information Extraction and Text Mining
    SHI Cunhui, MENG Jian, YU Xiaoming, LIU Yue, JIN Xiaolong, CHENG Xueqi
    2019, 33(7): 101-109.
    Abstract ( ) PDF ( ) Knowledge map Save
    It is critical for a web crawler to identify new relevant contents and expand its data collection targets in time. A board-article structure based web crawler could effectively achieve the above goal by frequently revisiting its target sites, without being website-friendly by bombarding the target sites. To address this issue, we propose an improved re-crawling strategy based on time series prediction. Experiments show that our method can significantly reduce the number of visits required and improve the friendliness towards websites of our web crawler while obtaining the data in time.
  • Information Extraction and Text Mining
    LU Yaojie, LIN Hongyu, HAN Xianpei, SUN Le
    2019, 33(7): 110-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    Deep learning recently applied in the event detection task is limited by the scarcity of the annotated data and the instability during the training phase. This paper proposes a data augmentation method based on linguistic perturbation for event detection, which generates pseudo data from both syntactic and semantic perspectives to improve the performance of event detection systems. In order to effectively exploit generated pseudo data, this paper explores two training strategies: data addition and multi-instance learning. Experiments on the KBP 2017 event detection dataset demonstrate the effectiveness of our approach. Furthermore, the empirical results on a manual constructed portion of ACE2005 dataset show that the proposed method can significantly improve the model performance on small training data.
  • Information Extraction and Text Mining
    WU Hao, ZHANG Weiqiang, ZHANG Pengzhu
    2019, 33(7): 118-127.
    Abstract ( ) PDF ( ) Knowledge map Save
    The trajectory data generated from users' mobile access to base stations reflect their life styles and behavior patterns in terms of both time and space. Based on the fact that temporal and spatial information are produced simultaneously, this paper proposes a TFT-IDFT method to extract semantic information from trajectories. First, a word embedding method named word2vec is applied to build trajectory word vectors which include users' geometric and semantic information. Then, classification methods are used on these vectors to discriminate user age groups. The result shows that TFT-IDFT is more applicable than TF-IDF in the task of extracting semantic trajectories, and word vectors based on this method performs better in the age classification task.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    WANG Rongbing, XU Hongyan, FENG Yong, AN Weikai
    2019, 33(7): 128-135.
    Abstract ( ) PDF ( ) Knowledge map Save
    To recommend important users in similar interest areas for micro-blog users, the improved HITS method is used to classify user categories based on the analysis of the micro-blog users’ network structure. Since the user's authority and centrality is already introduced into micro-blog topic similarity calculation, the micro-blog users are recommended according to the category of users. Using the crawled micro-blog data, the proposed algorithm has significant improvement compared with the traditional recommendation algorithms.
  • Script Processing
  • Script Processing
    XIONG Dan, LU Qin
    2019, 33(7): 136-142.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Chinese characters used in Hong Kong are listed in the H column in the ISO/IEC 10646. This paper introduces the further improvements for the extension scheme for the characters in Hong Kong’s Chinese computer systems and for the encoding scheme of the character resource references in the H column. Since the current glyphs for the Chinese characters in the H column do not really reflect the actual shapes of the glyphs commonly used in Hong Kong, the Reference Glyphs for Chinese Computer Systems in Hong Kong is developed, and the principles for this set of reference glyphs are presented.