2018 Volume 32 Issue 11 Published: 15 November 2018
  

  • Select all
    |
    Machine Translation
  • Machine Translation
    REN Zhong, HOU Hongxu, WU Jing, WANG Hongbin, LI Jinting, FAN Wenting, SHEN Zhipeng
    2018, 32(11): 1-7.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper investigates the statistical Mongolian-Chinese machine translation model and the neural networkbased machine translation model, including CNN and RNN translation models. To address the low-resource and rich morphology issue, this paper proposes multiple methods to improve the three translation models. For the top-performed CNN model, we apply the character and phrase joint-training method. We further augmented the improvement to RNN model with a Giza++ guided alignment to attention. We designed a realignment method to the SMT model. Experiments indicate the above approaches improve the Mongolian-Chinese translation performance significantly.
  • Ethnic Language and Cross Language Information Processing
  • Ethnic Language and Cross Language Information Processing
    Mairehaba Aili, Tuergen Yibulayin, Jiamila Wushouer
    2018, 32(11): 8-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    As an agglutinative language, the complex word structure in Uyghur affects its dependent role. This paper presents a few important factors in Uyghur that should be considered when building Uyghur dependency treebank. These factors include (1) the granularity of dependency, (2) dependent relations, (3) the annotation guidelines, and (4) the annotation tool. More than 3 400 Uyghur sentences are annotated manually based on the “Uyghur dependency tree bank annotation manual“. A statistical analysis have been made on that Uyghur dependency Treebank for three aspects.
  • Ethnic Language and Cross Language Information Processing
    Wang Lulu, Aishan Wumaier, Maihemuti Maimaiti, Kahaerjiang Abiderexiti, Tuergen Yibulayin
    2018, 32(11): 16-26,33.
    Abstract ( ) PDF ( ) Knowledge map Save
    Researches on Uyghur named entity recognition is currently focused on a single entity without using unsupervised semantic and structural information in un-annotated data. A Uyghur named entity recognition method based on semi-supervised learning is proposed in the framework of conditional random fields(CRF). The lexical features, dictionary features and unsupervised learning features based on word embedding are introduced and analyzed. The experimental results illustrate that the F-score of Uyghur named entity recognition reach 87.43%.
  • Ethnic Language and Cross Language Information Processing
    Aizimaiti Ainiwaer, DONG Jun, LI Xiao
    2018, 32(11): 27-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper introduces the feature of Uyghur stem structure, affix structure and the Uyghur phonetic harmony. Based on Uyghur phonetic harmony, a Uyghur affix variant collocation algorithm is proposed to cover the basic and special collocation rules. To verify the correctness and completeness of stem and affix structure feature extraction, 500 noun stems and 300 verb stems are generated with affix variants, resulting 9 000 nouns and 37 800 verbs, respectively. The examination results show that the overall accuracy reaches 96.86%, with 98.40% and 96.49% for nouns and verbs, respectively.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    LIU Wei, CHEN Hongchang, HUANG Ruiyang
    2018, 32(11): 34-40.
    Abstract ( ) PDF ( ) Knowledge map Save
    Tree-Based CNN, a kind of tree-structured neural network based on syntax tree, is implemented to extract entities relations in natural language which is designed to enhance tree coding in original CNN model. Experiment in this paper indicates Tree-based has an improvement of 3% and 5% compared to CNN and LSTM, respectively, in the task of relation extraction.
  • Information Extraction and Text Mining
    LI Na
    2018, 32(11): 41-48,61.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recently, the rapid development of digital library in China provides a foundation for the deep excavation and utilization of collection resources. Taking the digital ancient local chronicles as research corpus, this paper analyzes the internal and external features of aliases based on full-text manual annotation, and proposes an automatic extraction model of alias based on conditional random field. The accuracy of the model reaches 93.52%, indicating that the CRFs model is suitable for the content mining of ancient local chronicles.
  • Information Extraction and Text Mining
    ZHU Jin, HUAI Libo, CUI Rongyi, YIN Hui
    2018, 32(11): 49-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a method for text feature extraction based on wavelet analysis with the TF-IDF vector space as the input. KNN method is employed to examine text classification accuracy in two spaces. The experiment results show that the wavelet transformation method reduces almost half of vector space dimensions while maintaining the same classification accuracy of classical vector space model and the proposed inverse wavelet transformation exhibits excellent advantages of large dimension reduction for specific text categories, which testify the correctness and rationality of the compressive sensing.
  • Information Extraction and Text Mining
    CHEN Shimei, WU Xing, TANG Fan
    2018, 32(11): 55-61.
    Abstract ( ) PDF ( ) Knowledge map Save
    Negation recognition is to distinguish positive information from negative information in natural language, which is of substantial significance in information retrieval, text mining and emotion analysis. This paper investigates the cue detection and scope recognition in Chinese negative information by combining the BiLSTM (bidirectional long short-term memory network) and CRF (conditional random field) as BiLSTM-CRF. The pre-trained word embedding is input as features to detect the cue. And then the known cue features are added to define the scope. For Chinese negation and speculation information corpus, the cue detection reaches 91.03% in F1 value, and the scope recognition 73.91% (in the sub-corpus financial news only). The experimental results show that this proposed method is superior to the CRF model and the BiLSTM model in Chinese negative cue detection and scope recognition.
  • Information Extraction and Text Mining
    LIN Guanghe, ZHANG Shaowu, LIN Hongfei
    2018, 32(11): 62-71,78.
    Abstract ( ) PDF ( ) Knowledge map Save
    Named entity recognition(NER), whose performance has a highly marked impact on the following piped nature language processing(NLP) system such as relation extraction and semantic role labeling, is a fundamental stage in NLP. Traditional statistical models have difficulty in feature designing, whose features have poor cross-domain adaptability, and some neural network models neglect morphological information of the word.Aiming at the above problems, our paper proposes a new end-to-end neural network model(Finger-BiLSTM-CRF) based on a fine-grained word representation for named entity recognition task. First, we design Finger, a character-level word representation model based on the attention mechanism, for the integration of morphological information with information from each character of current token. Secondly, we combine Finger with BiLSTM-CRF for named entity recognition task. Finally, the model trained in an end-to-end fashion achieves a F1 score of 91.09% on test dataset for CoNLL 2003. The experimental results show that our Finger model significantly boosts the recall of the NER system , which results in performance improvement of recognition ability of the system.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    LIANG Shiwei, ZHANG Chenrui, CAO Lei, CHENG Junjun, XU hongbo, CHENG Xueqi
    2018, 32(11): 72-78.
    Abstract ( ) PDF ( ) Knowledge map Save
    The news recommender system is a popular research issue, in which the cold-start problem and the rich semantic information in the content challenges classical models. This paper proposes a collaborative joint embedding model to learn user and document vector with semantic information simultaneously. Specifically, it combines the word&doc embedding model with matrix factorization based collaborative filter mode. Experiment on real-world dataset shows that the proposed model outperforms other baseline models.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    GAO Jinhua, SHEN Huawei, CHENG Xueqi, LIU Yue
    2018, 32(11): 79-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    Popularity prediction for news in online social networks is of substantial application value. In contrast to the existing feature-based models and process-based models, this paper presents a prediction approach based on similar historical tweets. For each tweet to be predicted, K most similar historical tweets are selected for prediction. To measure the similarity between two tweets, LDA model is utilized to learn tweet representations from cascading data. Experimental results show the proposed our model can successfully identify tweets with similar diffusion pattern, achieving better prediction performance.
  • Sentiment Analysis and Social Computing
    SHANG Qi, ZENG Biqing, WANG Shengyu, ZHOU Caidong, ZENG Feng
    2018, 32(11): 86-96.
    Abstract ( ) PDF ( ) Knowledge map Save
    The sparseness of rating data is one of the main factors that affect the recommender models prediction. To exploit the advantage of convolutional neural networks in feature extraction and attention mechanism in feature selection, a probability matrix factorization model (PMF) with attention convolutional neural network(ACNN) is proposed as attention convolutional model based matrix factorization (ACMF). Firstly, the ACMF model compresses the high dimensional and sparse word vectors into low dimensional and dense feature vectors through word embedding technique. Then, it uses the local attention layer and convolutional layer to learn the feature of review document, and utilizes the user and item’s latent models to reconstruct the rating prediction matrix. Finally, the loss function is set as the root-mean-square error of rating matrix. Compared with the best prediction model PHD, the ACMF model increases the accuracy rate on ML-100k, ML-1m, ML-10m and Amazon datasets by 3.57%, 1.25%, 0.37% and 0.16%, respectively.
  • NLP Application
  • NLP Application
    MA Chuangxin, CHEN Xiaohe
    2018, 32(11): 97-102.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper puts forward an author identification method based on rank correlation of high frequency word types. Words in each corpus are arranged according to the frequency of occurrence and the rank is determined, then the correlation degree between the high frequency word types among the corpus is calculated, which is applied as the similarity of the language style between corpus. This method is compared the word intersection based method and token intersection based method on 12 sub-divisions of total 120 chapters from The dream of Red mansions. It is revealed that the correlation is rather high either between the former 8 corpus or between the latter 4 corpus, while the correlation significantly decreases between the former and the latter chapters. It is inferred that the former 80 chapters of The dream of Red mansions were written by one author, and the latter 40 chapters by another one.
  • Machine Reading Comprehension
  • Machine Reading Comprehension
    LIU Jiahua, WEI Wan, CHEN Hao, DU Yantao
    2018, 32(11): 103-111.
    Abstract ( ) PDF ( ) Knowledge map Save
    Machine Reading Comprehension (MRC) has become a popular issue in Natural Language Processing (NLP). The 2018 NLP Challenge on Machine Reading Comprehension provides a large-scale application-oriented dataset for Chinese Machine Reading Comprehension, which is much more challenging than previous Chinese MRC dataset. To cope with those challenges, we present a system with improvements in all aspects, including preprocessing strategy, feature expression, model design, loss function and training criterion. Our system achieves 63.38 in ROUGE-L score and 59.23 in BLEU-4 score on the final test set, ranking first among 105 participating teams.
  • Machine Reading Comprehension
    YIN Yichun, ZHANG Ming
    2018, 32(11): 112-116.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes the model proposed in “2018 NLP Challenge on Machine Reading Comprehension” by ZWYC team. Treated the machine reading comprehension as extracting the text span from the documents, this paper proposes a feature-rich neural interaction network. In order to effectively use the information of golden answers, our model first reconstructs the data in detail so that all golden answer information can be integrated. Then a feature-rich semantic representation is built for each word. Moreover, a simple but effective network is designed for question-aware representation for each document by captuing the interaction between questions and documents. The proposed model predicts answer text based on global representations of multiple candidate documents, leading to the runner-up position among 105 teams.
  • Machine Reading Comprehension
    YANG Zhiming, SHI Yingcheng, WANG Yong, PAN Haojie, MAO Jintao
    2018, 32(11): 117-127.
    Abstract ( ) PDF ( ) Knowledge map Save
    Exploiting the deep neural network model in machine reading comprehension, this paper presents the RBiDAF model. Firstly, by the data exploration to the DuReader dataset and the preprocessing of the data, the features beneficial to model are extracted. Then, based on the BiDAF model, a machine reading comprehension model for multi-document reranking is proposed, named RBiDAF. This model adds a paragraph-ranking-layer to the four-layer standard BiDAF model, in which we design the ParaRanking algorithm with multi-feature fusion. Additionally, in order to predict a comprehensive answer, we propose the multi-answer cross validation algorithm based on prior knowledge. Finally, the RBiDAF model has shown good results in the 2018 NLP Challenge on Machine Reading Comprehension
  • Machine Reading Comprehension
    ZHENG Yukun, LI Dan, FAN Zhen, LIU Yiqun, ZHANG Min, MA Shaoping
    2018, 32(11): 128-134.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes the approach and the experimental results of THUIR at 2018 NLP Challenge on Machine Reading Comprehension. We design a multi-task deep neural model with self-attention mechanism. The self-attention mechanism on passages within a document allows information to flow across passages, and the recurrent neural network further shares information across documents. Besides the distributed representation of questions and passages learned during model training, we also extract features denoting exact matching between questions and passages as the inputs of the model. When predicting the span of the answer, we introduce passage ranking into the model to promote the model performance via reinforcement learning. This proposed method ranks 8/105 on the final test set.
  • Machine Reading Comprehension
    LAI Yuting, TSENG Yiying, LIN Pocheng, HSIAO Vincent, SHAO Chihchieh
    2018, 32(11): 135-142.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposed a reading comprehension model based on Bi-Directional Attention Flow (BiDAF) network. It predicts the answers using complete paragraphs and the results outperformed baseline system. The fastText is applied to train word embedding to include contextual information. The ensemble learning is adopted to improve performance and stability. Specifically, for the Yes/No questions, this paper ensembles two classification models based on attention and similarity mechanism, respectively. The model reaches a ROUGE-L score of 56.57 and a BLEU-4 score of 48.03 in the MRC 2018.