Journal of Chinese Information Processing

Select

Machine Translation

Research on Mongolian-Chinese MT Based on Statistical and Neural Network

REN Zhong, HOU Hongxu, WU Jing, WANG Hongbin, LI Jinting, FAN Wenting, SHEN Zhipeng

2018, 32(11): 1-7.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper investigates the statistical Mongolian-Chinese machine translation model and the neural networkbased machine translation model, including CNN and RNN translation models. To address the low-resource and rich morphology issue, this paper proposes multiple methods to improve the three translation models. For the top-performed CNN model, we apply the character and phrase joint-training method. We further augmented the improvement to RNN model with a Giza++ guided alignment to attention. We designed a realignment method to the SMT model. Experiments indicate the above approaches improve the Mongolian-Chinese translation performance significantly.

Select

Ethnic Language and Cross Language Information Processing

Construction of Uyghur Dependency Treebank and Its Statistical Analysis

Mairehaba Aili, Tuergen Yibulayin, Jiamila Wushouer

2018, 32(11): 8-15.

Abstract ( ) PDF ( )

Knowledge map

Save

As an agglutinative language, the complex word structure in Uyghur affects its dependent role. This paper presents a few important factors in Uyghur that should be considered when building Uyghur dependency treebank. These factors include (1) the granularity of dependency, (2) dependent relations, (3) the annotation guidelines, and (4) the annotation tool. More than 3 400 Uyghur sentences are annotated manually based on the “Uyghur dependency tree bank annotation manual“. A statistical analysis have been made on that Uyghur dependency Treebank for three aspects.

Select

Ethnic Language and Cross Language Information Processing

A Semi-supervised Approach to Uyghur Named Entity Recognition Based on CRF

Wang Lulu, Aishan Wumaier, Maihemuti Maimaiti, Kahaerjiang Abiderexiti, Tuergen Yibulayin

2018, 32(11): 16-26,33.

Abstract ( ) PDF ( )

Knowledge map

Save

Researches on Uyghur named entity recognition is currently focused on a single entity without using unsupervised semantic and structural information in un-annotated data. A Uyghur named entity recognition method based on semi-supervised learning is proposed in the framework of conditional random fields(CRF). The lexical features, dictionary features and unsupervised learning features based on word embedding are introduced and analyzed. The experimental results illustrate that the F-score of Uyghur named entity recognition reach 87.43%.

Select

Ethnic Language and Cross Language Information Processing

Rules and Algorithms for Uyghur Affix Variant Collocation

Aizimaiti Ainiwaer, DONG Jun, LI Xiao

2018, 32(11): 27-33.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper introduces the feature of Uyghur stem structure, affix structure and the Uyghur phonetic harmony. Based on Uyghur phonetic harmony, a Uyghur affix variant collocation algorithm is proposed to cover the basic and special collocation rules. To verify the correctness and completeness of stem and affix structure feature extraction, 500 noun stems and 300 verb stems are generated with affix variants, resulting 9 000 nouns and 37 800 verbs, respectively. The examination results show that the overall accuracy reaches 96.86%, with 98.40% and 96.49% for nouns and verbs, respectively.

Select

Information Extraction and Text Mining

Neural Relation Extraction Via Tree-based CNN

LIU Wei, CHEN Hongchang, HUANG Ruiyang

2018, 32(11): 34-40.

Abstract ( ) PDF ( )

Knowledge map

Save

Tree-Based CNN, a kind of tree-structured neural network based on syntax tree, is implemented to extract entities relations in natural language which is designed to enhance tree coding in original CNN model. Experiment in this paper indicates Tree-based has an improvement of 3% and 5% compared to CNN and LSTM, respectively, in the task of relation extraction.

Select

Information Extraction and Text Mining

Automatic Extraction of Alias in Ancient Local Chronicles Based on Conditional Random Fields

LI Na

2018, 32(11): 41-48,61.

Abstract ( ) PDF ( )

Knowledge map

Save

Recently, the rapid development of digital library in China provides a foundation for the deep excavation and utilization of collection resources. Taking the digital ancient local chronicles as research corpus, this paper analyzes the internal and external features of aliases based on full-text manual annotation, and proposes an automatic extraction model of alias based on conditional random field. The accuracy of the model reaches 93.52%, indicating that the CRFs model is suitable for the content mining of ancient local chronicles.

Select

Information Extraction and Text Mining

Feature Extraction for Text Classification Based on Wavelet Analysis

ZHU Jin, HUAI Libo, CUI Rongyi, YIN Hui

2018, 32(11): 49-54.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a method for text feature extraction based on wavelet analysis with the TF-IDF vector space as the input. KNN method is employed to examine text classification accuracy in two spaces. The experiment results show that the wavelet transformation method reduces almost half of vector space dimensions while maintaining the same classification accuracy of classical vector space model and the proposed inverse wavelet transformation exhibits excellent advantages of large dimension reduction for specific text categories, which testify the correctness and rationality of the compressive sensing.

Select

Information Extraction and Text Mining

Chinese Negation Recognition Based on BiLSTM-CRF

CHEN Shimei, WU Xing, TANG Fan

2018, 32(11): 55-61.

Abstract ( ) PDF ( )

Knowledge map

Save

Negation recognition is to distinguish positive information from negative information in natural language, which is of substantial significance in information retrieval, text mining and emotion analysis. This paper investigates the cue detection and scope recognition in Chinese negative information by combining the BiLSTM (bidirectional long short-term memory network) and CRF (conditional random field) as BiLSTM-CRF. The pre-trained word embedding is input as features to detect the cue. And then the known cue features are added to define the scope. For Chinese negation and speculation information corpus, the cue detection reaches 91.03% in F1 value, and the scope recognition 73.91% (in the sub-corpus financial news only). The experimental results show that this proposed method is superior to the CRF model and the BiLSTM model in Chinese negative cue detection and scope recognition.

Select

Information Extraction and Text Mining

Named Entity Identification Based on Fine-Grained Word Representation

LIN Guanghe, ZHANG Shaowu, LIN Hongfei

2018, 32(11): 62-71,78.

Abstract ( ) PDF ( )

Knowledge map

Save

Named entity recognition(NER), whose performance has a highly marked impact on the following piped nature language processing(NLP) system such as relation extraction and semantic role labeling, is a fundamental stage in NLP. Traditional statistical models have difficulty in feature designing, whose features have poor cross-domain adaptability, and some neural network models neglect morphological information of the word.Aiming at the above problems, our paper proposes a new end-to-end neural network model(Finger-BiLSTM-CRF) based on a fine-grained word representation for named entity recognition task. First, we design Finger, a character-level word representation model based on the attention mechanism, for the integration of morphological information with information from each character of current token. Secondly, we combine Finger with BiLSTM-CRF for named entity recognition task. Finally, the model trained in an end-to-end fashion achieves a F1 score of 91.09% on test dataset for CoNLL 2003. The experimental results show that our Finger model significantly boosts the recall of the NER system , which results in performance improvement of recognition ability of the system.

Select

Information Retrieval and Question Answering

Collaborative Joint Embedding Based on Personalized News Recommendation

LIANG Shiwei, ZHANG Chenrui, CAO Lei, CHENG Junjun, XU hongbo, CHENG Xueqi

2018, 32(11): 72-78.

Abstract ( ) PDF ( )

Knowledge map

Save

The news recommender system is a popular research issue, in which the cold-start problem and the rich semantic information in the content challenges classical models. This paper proposes a collaborative joint embedding model to learn user and document vector with semantic information simultaneously. Specifically, it combines the word&doc embedding model with matrix factorization based collaborative filter mode. Experiment on real-world dataset shows that the proposed model outperforms other baseline models.

Select

Sentiment Analysis and Social Computing

GAO Jinhua, SHEN Huawei, CHENG Xueqi, LIU Yue

2018, 32(11): 79-85.

Abstract ( ) PDF ( )

Knowledge map

Save

Popularity prediction for news in online social networks is of substantial application value. In contrast to the existing feature-based models and process-based models, this paper presents a prediction approach based on similar historical tweets. For each tweet to be predicted, K most similar historical tweets are selected for prediction. To measure the similarity between two tweets, LDA model is utilized to learn tweet representations from cascading data. Experimental results show the proposed our model can successfully identify tweets with similar diffusion pattern, achieving better prediction performance.

Select

Sentiment Analysis and Social Computing

ACMF: Rating Prediction Based on Attention Convolutional Model

SHANG Qi, ZENG Biqing, WANG Shengyu, ZHOU Caidong, ZENG Feng

2018, 32(11): 86-96.

Abstract ( ) PDF ( )

Knowledge map

Save

The sparseness of rating data is one of the main factors that affect the recommender models prediction. To exploit the advantage of convolutional neural networks in feature extraction and attention mechanism in feature selection, a probability matrix factorization model (PMF) with attention convolutional neural network(ACNN) is proposed as attention convolutional model based matrix factorization (ACMF). Firstly, the ACMF model compresses the high dimensional and sparse word vectors into low dimensional and dense feature vectors through word embedding technique. Then, it uses the local attention layer and convolutional layer to learn the feature of review document, and utilizes the user and item’s latent models to reconstruct the rating prediction matrix. Finally, the loss function is set as the root-mean-square error of rating matrix. Compared with the best prediction model PHD, the ACMF model increases the accuracy rate on ML-100k, ML-1m, ML-10m and Amazon datasets by 3.57%, 1.25%, 0.37% and 0.16%, respectively.

Select

NLP Application

Author Identification of The Dream of Red Mansions Basedon the Rank Correlation of the High Frequency Words

MA Chuangxin, CHEN Xiaohe

2018, 32(11): 97-102.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper puts forward an author identification method based on rank correlation of high frequency word types. Words in each corpus are arranged according to the frequency of occurrence and the rank is determined, then the correlation degree between the high frequency word types among the corpus is calculated, which is applied as the similarity of the language style between corpus. This method is compared the word intersection based method and token intersection based method on 12 sub-divisions of total 120 chapters from The dream of Red mansions. It is revealed that the correlation is rather high either between the former 8 corpus or between the latter 4 corpus, while the correlation significantly decreases between the former and the latter chapters. It is inferred that the former 80 chapters of The dream of Red mansions were written by one author, and the latter 40 chapters by another one.

Select

Machine Reading Comprehension

Machine Reading Comprehension for Multi-document and Multi-answer

LIU Jiahua, WEI Wan, CHEN Hao, DU Yantao

2018, 32(11): 103-111.

Abstract ( ) PDF ( )

Knowledge map

Save

Machine Reading Comprehension (MRC) has become a popular issue in Natural Language Processing (NLP). The 2018 NLP Challenge on Machine Reading Comprehension provides a large-scale application-oriented dataset for Chinese Machine Reading Comprehension, which is much more challenging than previous Chinese MRC dataset. To cope with those challenges, we present a system with improvements in all aspects, including preprocessing strategy, feature expression, model design, loss function and training criterion. Our system achieves 63.38 in ROUGE-L score and 59.23 in BLEU-4 score on the final test set, ranking first among 105 participating teams.

Select

Machine Reading Comprehension

A Neural Machine Reading Comprehension Model Based on Relabeling and Rich Features

YIN Yichun, ZHANG Ming

2018, 32(11): 112-116.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper describes the model proposed in “2018 NLP Challenge on Machine Reading Comprehension” by ZWYC team. Treated the machine reading comprehension as extracting the text span from the documents, this paper proposes a feature-rich neural interaction network. In order to effectively use the information of golden answers, our model first reconstructs the data in detail so that all golden answer information can be integrated. Then a feature-rich semantic representation is built for each word. Moreover, a simple but effective network is designed for question-aware representation for each document by captuing the interaction between questions and documents. The proposed model predicts answer text based on global representations of multiple candidate documents, leading to the runner-up position among 105 teams.

Select

Machine Reading Comprehension

Reading Comprehension Model Based on BiDAF and Multi-document Reordering

YANG Zhiming, SHI Yingcheng, WANG Yong, PAN Haojie, MAO Jintao

2018, 32(11): 117-127.

Abstract ( ) PDF ( )

Knowledge map

Save

Exploiting the deep neural network model in machine reading comprehension, this paper presents the RBiDAF model. Firstly, by the data exploration to the DuReader dataset and the preprocessing of the data, the features beneficial to model are extracted. Then, based on the BiDAF model, a machine reading comprehension model for multi-document reranking is proposed, named RBiDAF. This model adds a paragraph-ranking-layer to the four-layer standard BiDAF model, in which we design the ParaRanking algorithm with multi-feature fusion. Additionally, in order to predict a comprehensive answer, we propose the multi-answer cross validation algorithm based on prior knowledge. Finally, the RBiDAF model has shown good results in the 2018 NLP Challenge on Machine Reading Comprehension

Select

Machine Reading Comprehension

T-Reader: A Multi-task Deep Reading Comprehension Model with Self-attention Mechanism

ZHENG Yukun, LI Dan, FAN Zhen, LIU Yiqun, ZHANG Min, MA Shaoping

2018, 32(11): 128-134.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper describes the approach and the experimental results of THUIR at 2018 NLP Challenge on Machine Reading Comprehension. We design a multi-task deep neural model with self-attention mechanism. The self-attention mechanism on passages within a document allows information to flow across passages, and the recurrent neural network further shares information across documents. Besides the distributed representation of questions and passages learned during model training, we also extract features denoting exact matching between questions and passages as the inputs of the model. When predicting the span of the answer, we introduce passage ranking into the model to promote the model performance via reinforcement learning. This proposed method ranks 8/105 on the final test set.

Select

Machine Reading Comprehension

D-Reader: A Reading Comprehension Model by Full-text Prediction

LAI Yuting, TSENG Yiying, LIN Pocheng, HSIAO Vincent, SHAO Chihchieh

2018, 32(11): 135-142.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposed a reading comprehension model based on Bi-Directional Attention Flow (BiDAF) network. It predicts the answers using complete paragraphs and the results outperformed baseline system. The fastText is applied to train word embedding to include contextual information. The ensemble learning is adopted to improve performance and stability. Specifically, for the Yes/No questions, this paper ensembles two classification models based on attention and similarity mechanism, respectively. The model reaches a ROUGE-L score of 56.57 and a BLEU-4 score of 48.03 in the MRC 2018.

Please choose a citation manager

Content to export

2018 Volume 32 Issue 11 Published: 15 November 2018