Journal of Chinese Information Processing

Select

Machine Translation

Interactive Machine Translation Based on Bilingual Phrase Constraints

XU Ping, YE Na, WU Chuang, ZHANG Guiping

2018, 32(9): 1-10.

Abstract ( ) PDF ( )

Knowledge map

Save

Interactive machine translation (IMT) is a technology which guides the MT decoder and improves the output quality through interactions with the translator. This paper improves the IMT method from two aspects of interaction mode and decoding algorithm. In terms of interaction mode,the translators are allowed to select the correct translation of source phrases from the phrase table before translation. A re-ranking algorithm is proposed to improve the diversity of the phrase translation options,and an interation interface is designed according to the cognitive process of the translator to improve the user experience. In terms of decoding algorithm,the bilingual phrases are taken as constraints together with the prefix to guide decoding and improve the accuracy of translation hypotheses evaluation and filtration. Experimentation with real users is conducted on the Chinese-English LDC corpora. Results show that compared with the traditional IMT method,our method can reduce the user cognitive burden and the translation time,thus improving the translation efficiency.

Select

Machine Translation

A Novel MT Metric Based on the Hybrid Strategy

MA Qingsong, ZHANG Jinchao, LIU Qun

2018, 32(9): 11-19.

Abstract ( ) PDF ( )

Knowledge map

Save

With the development of machine translation (MT) evaluation,various MT metrics have been proposed. Different metrics evaluate the quality of MT hypotheses from different perspectives. This paper proposes a novel MT metric that combines the merits of a range of metrics. Our investigation includes several aspects: (1) Comparing the performance of combined metrics that using Direct Assessment manual evaluation (DA) or Relative Ranking human evaluation (RR) to guide the training process. Experiments show that reliable DA human evaluation benefits the combined metric,Blend. (2) Comparing the performance of Blend using SVM or FFNN as the training algorithm. (3) Examining the contribution of metrics incorporated in Blend tentatively,in order to find a trade-off between performance and efficiency. (4) Applying Blend to other language pairs if incorporated metrics support the specific language pair. Experiments on WMT16 and WMT17 Metrics tasks show that Blend achieves the start-of-the-art performance.

Select

Machine Translation

A Large-scale Uyghur-Chinese Neural Machine Translation Model Based on Multiple Encoders and Decoders

ZHANG Jinchao, Aishan Wumaier, Maihemuti Maimaiti, LIU Qun

2018, 32(9): 20-27.

Abstract ( ) PDF ( )

Knowledge map

Save

To enhance the translation ability of Uyghur-Chinese translation model,the paper proposes a large-scale Neural Machine Translation system based on multiple encoders and decoders. Compared with the encoder-decoder based shallow model,the proposed model consists of multiple encoders to represent the source sentence in multiple perspectives and has multiple decoders to extend the generation ability of the target sentence. The experiments on the big training corpus show that the translation quality of the proposed model surpasses phrase-based Statistical Machine Translation model and the basic Neural Machine Translation model. The paper also investigates the granularity of the translation unit and reveal that it is effective to employ the Byte Pair Encoding unit for Uyghur and character unit for Chinese to avoid the Chinese word segmentor and achieve comparable performance with BPE-BPE systems.

Select

Ethnic Language Processing and Cross Language Processing

Mongolian Speech Recognition Based on TDNN-FSMN

WANG Yonghe, BAO Feilong, GAO Guanglai

2018, 32(9): 28-34.

Abstract ( ) PDF ( )

Knowledge map

Save

In order to improve Mongolian speech recognition, the Time Delay Neural Network (TDNN) and Feed-forward Sequential Memory Network (FSMN) are combined to model the long sequence speech frames. In addition, we investigate the influence caused by the information from the preceding and the subsequent frames in the memory block over FSMN. We compare the performance of the TDNN-LSTM using different hidden layers and nodes. The results show that the fusion of TDNN and FSMN produces better performance than DNN, TDNN and FSMN, reducing the word error rate (WER) by 22.2% compared with the DNN baseline.

Select

Ethnic Language Processing and Cross Language Processing

Study on Recognition of Uzbek Noun Stems Based on Multi-strategy

Azhar, Zulkar, Azragul, Yusup Abaydula

2018, 32(9): 35-40.

Abstract ( ) PDF ( )

Knowledge map

Save

Uzbek language noun stems recognition is aimed at extracting noun stems from the sentence, thereby improving the efficiency of nouns recognition. We first discuss the classification of part-of-speech of Uzbek words and the morphological analysis of nouns, summarizing the affix and ambiguity resolution rules in Uzbek language. Then we put forward the algorithm for Uzbek new nouns recognition, include the internal features, the feature selection, the parameter estimation, and the word dependency features. Taking the Uzbek Web text as test object, the noun stems are finally identified with additional statistical analysis.

Select

Ethnic Language Processing and Cross Language Processing

Classification of Tibetan Phrases for Natural Language Processing

CAI Zangtai, SUO Nancairang, CAI Rangjia

2018, 32(9): 41-46.

Abstract ( ) PDF ( )

Knowledge map

Save

Phrase is an important linguistic phenomenon. At present, the research of Tibetan phrases for information processing has just started. This paper investigates the boundary between Tibetan phrases and Tibetan sentences, and proposes the classification of Tibetan phrases for Tibetan information processing. Types of Tibetan phrases and the annotation scheme are also provided.

Select

Ethnic Language Processing and Cross Language Processing

Vector Based Spelling Check for Tibetan Characters

CAI Zhijie, SUN Maosong, CAI Rangzhuoma

2018, 32(9): 47-55.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic spelling checking is a challenging task in natural language processing with broad application in corpus construction,text editing,speech recognition and OCR. Tibetan scripts are alphabetic writing formed by 1 to 7 alphabets horizontally and vertically. Non-real Tibetan characters appear frequently,which is the focus in Tibetan spelling checking. Through the analysis of the characters- formation rules in the Tibetan grammar,this paper proposes a Tibetan characters vector model to represent Tibetan characters by numbers (vectors) with rule constraints. Then the Tibetan spelling checking model is established. The experiment shows an average accuracy of 99.995% for the proposed method,at the speed of 1 060 words per second.

Select

Information Extraction and Text Mining

Emergency Information Extraction Based on Style and Terminology

QIU Qizhi, ZHOU Sansan, LIU Changfa, CHEN Hui

2018, 32(9): 56-65,74.

Abstract ( ) PDF ( )

Knowledge map

Save

With the development of Big Data,one of necessities of management information system is to structure tons of non- or semi-structured data. The paper proposed a solution to extract the attributes of emergencies from Web pages. Based on study of Web page structure and style of news,the paper expanded the existing terminology by Google Word2Vec,and proposed different ways from different attributes of emergencies: terminology for classification,style for date/time and abstract,style and terminology for location,casualty and loss. Experiment result showed that the solution's average accuracy were 87.89%,91.29% and the average recall were 81.76%,87.91% on both Web news set and published emergency corpus,which was high enough to meet the requirement of emergency management. The idea of information extraction proposed in this paper has practical value for free text information extraction in other application fields.

Select

Information Extraction and Text Mining

An Improved Entity Relation Extraction Algorithm — OptMultiR

YAN Haoran, JIN Xiaolong, JIA Yantao, CHENG Xueqi

2018, 32(9): 66-74.

Abstract ( ) PDF ( )

Knowledge map

Save

As the key task in the construction of knowledge graph,relation extraction,the process of extracting relations between entities from massive natural language texts,has drawn more and more attention. In recent years,by aligning entities and relations to those existing knowledge base,distant supervision can automatically train original texts without the time-consuming manual annotation of data. Aimed at to improve MultiR,a popular algorithm in this aspect,this paper proposes OptMultiR. First,in the process of extraction scoring,we consider that the potential relations extracted for the same entity pair may be corelated. Therefore,we introduce the relational weight matrix. It transforms the known relations into a weight vector in the process of extraction to reduce the interference of some individual text features and improve the extraction accuracy. Second,in the process of probabilistic graph calculation,we replace the original greedy algorithm with the dynamic programming algorithm based on state compression to improve this solution from the local optimal value to the global optimal value. Experiments demonstrate that the proposed method has significant improvement in the performance of relational extraction.

Select

Information Extraction and Text Mining

An Evolutionary Summarization System Based on Local-global Topic Relationship

WU Renshou, LIU Kai, WANG Hongling

2018, 32(9): 75-83.

Abstract ( ) PDF ( )

Knowledge map

Save

Evolutionary timeline summarization (ETS) for Internet News Event is a new task in natural language processing,which is a kind of multi-document summarization (MDS) in essence. According to the features of dynamic evolution,content relevance and information redundancy of Internet news event,this paper puts forward an evolutionary summarization method basing on local and global topic relations. First,the news event is divided into a number of different sub-topics. In the meantime,the basis of time evolution and the topic evolution between sub-topics are considered. Finally,headlines are extracted as summary. The experimental results show that this method is effective. Especially using news headlines as inputs or outputs brings significant improvements in the Rouge evaluation,compared with current popular method of multi-document summarization and evolution summarization.

Select

Information Extraction and Text Mining

Conflating Papers across Different Data Sources

ZHANG Fanjin, GU Xiaotao, YAO Peiran, TANG Jie

2018, 32(9): 84-92,131.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper studies conflating papers across different data sources. We propose two algorithms for paper matching. The first algorithm (MHash) employs hashing technique to accelerate matching process. The second one (MCNN) leverages convolutional neural network (CNN) to improve matching accuracy. Experimental results show that，by combining different attributes of papers，MHash is able to execute matching process quickly,yielding a good accuracy (93%+) at the same time. Besides，MCNN can achieve more satisfactory accuracy (98%+). Meanwhile，we design an asynchronous search framework for largescale paper matching problem. Finally，we obtain 64,639,608 matching pairs of AMiner papers and MAG papers within 15 days. The matching results and all AMiner and MAG publication data have been published.

Select

Sentiment Analysis and Social Computing

Feasibility of Detecting Depressive Users Using Quasi-private Social Text

LIU Dexi, QIU Jiahong, WAN Changxuan, LIU Xiping, ZHONG Minjuan, GUO Haifeng, DENG Song

2018, 32(9): 93-102.

Abstract ( ) PDF ( )

Knowledge map

Save

The development of social network has provided an innovative perspective for detecting depressive users. Few works have been done on private data which come from the relatively private social network such as WeChat friends circle or QQ Zone to detect depressive users. This paper discusses the feasibility of detecting depressive users on quasi-private social network data,including training samples,feature extraction,detection model,etc. The experimental results show that,to train an effective model and overcome the challenge of unbalance samples,we should firstly select almost the same amount of positive and negative samples with the highest and the lowest scores of self-report tests,which corresponding to the most depressive users and the most normal users. Secondly,the features should be quantified by Z-score standardized frequency,which is more powerful than the other two quantifying methods such as frequency or normalized frequency. Thirdly,the SGD classifier performs better than the other classifiers such as SVM. The results also show that,compared to other features such as bag-of-words or word-to-vector,topical features performs better with 0.813 detection precision and 0.753 F-measure.

Select

Sentiment Analysis and Social Computing

User Activeness Determination in Microblog

ZHONG Zhaoman, DAI Hongwei, GUAN Yan

2018, 32(9): 103-112.

Abstract ( ) PDF ( )

Knowledge map

Save

To determining the user activeness,the existing methods mainly centered on the amount of information users posted,without proper utilizing the users- social relationship and behavior on microblog. This paper proposes a systematic method of determining the user activeness on microblog. In this method,four indexes are introduced to determinate users- activeness on microblog,including users- profile,social relationship,information quality and social behavior. And we also present the flow of determining the user activeness,and computation model for the diversity between a user and the whole user set. From Sina microblog,we select 900 users as the test set from the domain of academic research,business management,education,culture and military. Precision,Recall and F-value were used as evaluation index for experimental analysis and comparison among methods. The results show that our method improves the precision,recall and F-value of the user activeness determination by 21%,13% and 16%,respectively. Applying the proposed method to user recommendation,the precision,recall and F-value are increased by 5%,2% and 3%,respectively.

Select

Sentiment Analysis and Social Computing

An Ensemble Learning Framework for Sentiment Classification of Chinese Online Reviews

HUANG Jiafeng, XUE Yun, LU Xin, LIU Zhihuang, WU Wei, HUANG Yingren, LI Wanli, CHEN Xin

2018, 32(9): 113-122.

Abstract ( ) PDF ( )

Knowledge map

Save

We propose an ensemble learning framework for sentiment classification of Chinese online reviews. Firstly,according to the complicated characteristics of Chinese online reviews,we combine the POS pattern,the frequent word sequence pattern and the OPSM pattern as the input features. Secondly,to deal with the massive features in the reviews,we use the random subspace based on information gain algorithm,which can enhance the base classifiers simultaneously. Finally,we design base classifiers for each product aspect so as to combine the sentiment information of each aspect in a review. The experimental results show that our framework leads to significant improvement in sentiment classification of Chinese online reviews,with an accuracy of 90.3% on Logistic Regression.

Select

Sentiment Analysis and Social Computing

Word Attention-based Convolutional Neural Networks for Sentiment Analysis

WANG Shengyu, ZENG Biqing, SHANG Qi, HAN Xuli

2018, 32(9): 123-131.

Abstract ( ) PDF ( )

Knowledge map

Save

Sentiment classification task needs to capture the sentiment features from document and then combines them to construct the document representation. In this paper,we propose the model of Word Attention-based Convolution Neural Networks(WACNN). Compared with CNN,our model takes the document information as input. In detail,we put a word attention layer after the word embedding layer and before the CNN layer. Attention layer enables our model to focus on certain part of the input document and learn weights of each word. We also add a convolve filter with size of 1 in the convolution layer to extract the features of single word. To ensure the existance of context for each word,we pad the input of the convolution layer. This method can be used to extract the n-grams local features of each word effectively,avoiding information loss caused by convolution processing. Compared with traditional CNN and machine learning methods,the accuracy is improved by 0.5% and 2% on MR5K and CR datasets,respectively.

Select

Sentiment Analysis and Social Computing

Research on Community Detection Algorithm Based on Meta Path in Heterogeneous Information Network

ZHENG Yuyan, WANG Mingsheng, SHI Chuan, WANG Rui

2018, 32(9): 132-142.

Abstract ( ) PDF ( )

Knowledge map

Save

The real networked data often contain different types of objects and relations,which can be better modeled with heterogeneous information network. Although the community detection in homogeneous information networks has been intensively studied,few works are done in heterogeneous information networks.In this paper,we study the community detection problem in heterogeneous information networks,and propose a novel method based on meta path called HCD (heterogeneous community detection). This method consists of two parts: a HCD_sgl algorithm based on single meta path,and a HCD_all algorithm combining multiple meta paths. The HCD_sgl decides the initial community label,then detecting the final community structures through the improved label propagation algorithm. HCD_all combined the results of multipie meta paths.Experiments on real dataset and artificial dataset demonstrate that the proposed method can detect community structures in heterogeneous information networks effectively.

Please choose a citation manager

Content to export

2018 Volume 32 Issue 9 Published: 17 September 2018