Journal of Chinese Information Processing

Select

Language Analysis and Calculation

A Multi-sense Word Embedding Method Based on Gated Convolution and Hierarchical Attention Mechanism

LIU Yang, JI Lixin, HUANG Ruiyang, ZHU Yuhang, LI Xing

2018, 32(7): 1-10,19.

Abstract ( ) PDF ( )

Knowledge map

Save

The existing methods (mapping a word to a single vector) do not consider the problem of polysemy, which may cause the problem of ambiguity; Rather than mapping a word to multiple vectors, this paper proposes a computing method of multi-sense word embedding by: 1) fusing hierarchical attention mechanism with non-residual encapsulated gated convolution mechanism in the sub-sense layer and synthetic sense layer of the words in the selected context window, and 2) obtains the synthetic sense embedding of the target word under the asymmetric window to predict the target word. On small-scale corpus, the proposed multi-sense word embedding achieves at most 1.42% increase in the accuracy of the word analogy task, an average 2.11% (up to 5.47%) improvement in the word similarity tasks including WordSim353, MC, RG, and RW. In addition, this method also significantly improves the performance of the language modeling compared with other methods predicting target words.

Select

Language Analysis and Calculation

CNN and BiLSTM Based Chinese Textual Entailment Recognition

TAN Yongmei, LIU Shuwen, LV Xueqiang

2018, 32(7): 11-19.

Abstract ( ) PDF ( )

Knowledge map

Save

A CNN and BiLSTM based Chinese textual entailment recognition method is presented. By using CNN and BiLSTM, the method can automatically extract relevant features, and then generate the initial result by a fully connected layer. The final result is further processed by semantic rules. Evaluated on the dataset of RITE-VAL in 2014, the method obtains 61.74%, outperforming the top-ranked 61.51% in that evaluation campaign.

Select

Language Resources Construction

Construction and Application of Chinese Joke Corpus

REN Lu, YANG Liang, XU Linhong, FAN Xiaochao, DIAO Yufeng, LIN Hongfei

2018, 32(7): 20-29.

Abstract ( ) PDF ( )

Knowledge map

Save

Jokes, as a national intangible cultural heritage, exist in daily life with a long history. Close as it is to people, it remains one of challenging artificial intelligence research issues. The large-scale Chinese joke corpus constructed in this paper provides necessary data resources for subsequent researches. This paper summarizes the relevant theoretical basis of jokes on which the joke corpus is based, and then makes a detailed introduction to corpus construction, corpus analysis and so on. It also identifies jokes from stories, news, proverbs, and Weibo expressions, revealing such joke features as conciseness, episode and emotions. An experiment is carried by validating jokes form negative samples of equal length.

Select

Machine Translation

Data Augmentation for Neural Machine Translation

CAI Zilong, YANG Mingming, XIONG Deyi

2018, 32(7): 30-36.

Abstract ( ) PDF ( )

Knowledge map

Save

Neural machine translation performs well in large-scale language pairs, but less satisfactory for low resource language pairs. This paper employs the data augmentation technology to expand the training data for low-resource language pairs, which can enhance the generalization ability of neural machine translation. Experimented with the language pairs of Tibetan-Chinese and Chinese-English, the translation quality is improved for both tasks significantly, achieving 4.0 BLEU point increase at the training scale of 100,000 pairs of sentences.

Select

Machine Translation

Analysis of Data Parallel Training of Neural Language Models via Multiple GPUs

LI Yinqiao, HAN Ambyer, XIAO Tong, BO Le, ZHU Jingbo, ZHANG Li

2018, 32(7): 37-43.

Abstract ( ) PDF ( )

Knowledge map

Save

Data parallelism aims at reducing time consumption without changing network structure while training neural language model. However, the result is not satisfactory due to frequent data transmission between multiple devices. In this paper, we compare the effect of gradient update strategies based on the All-Reduce algorithm and the sampling-based approach in data transmission. On four NVIDIA TITAN X (Pascal) GPUs, they achieve an acceleration rate of 25% and 41%, respectively. We also discuss the applicability of data parallelism and influence of hardware connection mode.

Select

Ethnic Language and Cross Language Information Processing

Design and Implementation of Mongolian Information Retrieval System

WEN Zixiao, BAO Feilong, GAO Guanglai, WANG Yonghe, SU Xiangdong

2018, 32(7): 44-51,57.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a well-functioned information retrieval system for both traditional Mongolian and Cyrillic Mongolian. In the network crawling, MD5 algorithm is applied to improve the crawler performance. In the preprocessing, Mongolian documents are porcessed for code conversion, affix analysis and proofreading. The retrieval module is built upon the Vector Space Model. In addition, the Cyrillic Mongolian to the traditional Mongolian conversion module is developed to meet the application requirements.

Select

Ethnic Language and Cross Language Information Processing

Research on Vowel Patterns of Kazakh Language

Dawel Abilhayer, Nurmamet Yolwas, LIU Yan

2018, 32(7): 52-57.

Abstract ( ) PDF ( )

Knowledge map

Save

In Kazakh, the acoustic properties of nine vowels play an important role in speech recognition. This paper investigates the vowel pattern in the polysyllabic words of Kazakh in the style of experimental phonetics. For the 1062 polysyllabic words selected from the phonetic library, we investigate the vowel patterns in first,medial and ending syllables using Joos method, and summarize the Kazakh polysyllabic words vowel formant mode. The study has enriched the speech research and application of Kazakh language.

Select

Ethnic Language and Cross Language Information Processing

Disambiguation of Polyphonic Words in Tibetan

LhakpaDondrub, Ngodrup, ZU Yiqing, PEI Chunbao

2018, 32(7): 58-66.

Abstract ( ) PDF ( )

Knowledge map

Save

With the development of Tibetan speech synthesis, polyphonic words become the main barrier, which is still a less addressed issue. Different from the Mandarin, Tibetan dictionary alone is not enough in dealing with this issue. Based on grammatical rules and phonetic features of Tibetan language and the General Tibetan-Chinese Dictionary, this paper collects 465 polyphonic Tibetan words, and conducts a statistical analysis among 372320 sentences. The formation, occurence, and pronunciation rule of these words are investigated, and a disambiguation method is designed. Experimental results are provided, validating the method as a promising front-end text analysis support to Tibetan speech synthesis system.

Select

Ethnic Language and Cross Language Information Processing

Layout Analysis for Historical Tibetan Documents Based on Convolutional Denoising Autoencoder

ZHANG Xiqun, MA Longlong, DUAN Lijuan, LIU Zeyu, WU Jian

2018, 32(7): 67-73,81.

Abstract ( ) PDF ( )

Knowledge map

Save

The digitalization of historical documents attract increasing research interests in recent years. Focusing on layout analysis, the essential step in digitizing historical documents, this paper proposes a convolutional denoising auto-encoder approach to historical Tibetan documents. Firstly, the document images are clustered into superpixel blocks. Then, we use the convolutional autoencoder to extract features from these blocks. Finally, the superpixel blocks are classified by the SVM classifier, thus the different parts of the Tibetan historical document are identified. Experiments on the dataset of historical Tibetan documents show that our method can effectively separate the different layout elements of Tibetan historical documents.

Select

Information Extraction and Text Mining

Document Summarization Based on Penetrability of Key Words

REN Liyuan, XIE Zhenping, LIU Yuan

2018, 32(7): 74-81.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic document summarization aims to extract brief and important information from massive texts. In order to further explore novel features for text summarization, knowledge network is introduced to model document information. Specifically, key words of documents are viewed as network nodes, sentences are represented as the paths of sequential key words on knowledge network. Then, the feature model for the penetrability of key words is proposed, in which width and depth of penetrability of key words are defined to measure each sentence. A maximum entropy based document summarization model is implemented with the proposed feature, which is validated in the experiments for its effectiveness.

Select

Information Extraction and Text Mining

Protein-Protein Interaction Extraction from Biomedical Literature

ZHAO Zhehuan, YANG Zhihao, SUN Cong, LIN Hongfei

2018, 32(7): 82-90.

Abstract ( ) PDF ( )

Knowledge map

Save

Protein-protein interaction extraction can be widely applied to the field of life science research. Most of the machine learning based methods focused on binary relationship extraction for high precision, while the rule based strategy can extract complex relations (“protein1, relational word, protein2”) with low recall. This paper proposes a hybrid protein-protein interaction extraction method. In this method, machine learning methods are first applied to recognize protein entities and extract relational protein pairs. Then, the syntactic patterns and a dictionary are employed to find out corresponding relational words that represent the relationship between two proteins. This method obtains a F-score of 40.18% on the AImed corpus, outperforming any of the two methods alone.

Select

Information Extraction and Text Mining

Attribute Selection Algorithm Based on K-Means Clustering and Low Rank Constraint

YANG Changqing

2018, 32(7): 91-98.

Abstract ( ) PDF ( )

Knowledge map

Save

The unsupervised attribute selection algorithm does not consider the classification information and the low rank of attributes. To address this issue, this paper proposes an attribute selection algorithm combining K-means clustering and low-rank constraint. The algorithm embeds the self-expression method into the framework of the linear regression model. At the same time, the K-means clustering is used to generate the pseudo-class label to maximize the class spacing to better sparse the structure. The algorithm uses l_2,p-norm instead of the traditional l_2,1-norm, which can adjust the sparsity of the result flexibly by parameter p. It is also proved that the algorithm has the characteristics and convergence of linear discriminant analysis. The experimental results show that the accuracy of the proposed algorithm is 17.04%, 13.95%, 3.6% and 9.39% higher than that of the NFS algorithm、the LDA algorithm、the RFS algorithm 、the RSR algorithm, respectively, with the lowest classification accuracy variance.

Select

Information Retrieval and Question Answering

DQN-based Policy Learning for Open Domain Multi-turn Dialogues

SONG Haoyu, ZHANG Weinan, LIU Ting

2018, 32(7): 99-108,136.

Abstract ( ) PDF ( )

Knowledge map

Save

The open domain dialogue system is challenged by effective multi-turn dialogues. Current neural dialogue generation models tend to fall into conversation black holes by generating safe responses, without considering the future information. Inspired by the global view of reinforcement learning methods, we present an approach to learn multi-turn dialogue policy with DQN (deep Q-network). We introduce a deep neural network to evaluate each candidate sentence and choose the sentence with the maximum future rewards, instead of the highest generation probability, as a response. The results show that our method improves the average dialogue turns by 2 in the automatic eva-luation and outperforms the baseline model by 45% in the human evaluation.

Select

Sentiment Analysis and Social Computing

Domain Specific High-quality Microblogging User Detection

YE Yongjun, LI Peng, ZHOU Meilin, WAN Yifang, WANG Bin

2018, 32(7): 109-115.

Abstract ( ) PDF ( )

Knowledge map

Save

In nowadays microblogging systems such as Twitter or Weibo, searching high quality users to follow is essential for acquiring information. This paper is focuses on the task of high quality user identification, i.e., given a domain query, return a user list according to user quality. We divide the task into two sub-problems: the user search and the user ranking. As for the user search, we represent users according to their tags and propose a similarity-based retrieval approach using the Chinese Wikipedia, which is essentially an extension of the current ESA(explicit semantic analysis) method. As for the user ranking, we propose a graph-based ranking method called UBRank, which considers both the quantity and the quality of the published posts to measure the user importance. Experiments indicate that using Chinese Wikipedia is better than other resources such as HowNet, and validate the efficiency and superiority of the ranking method.

Select

Sentiment Analysis and Social Computing

Discovering Users's Expertises and Interests in Online Technology Communities

ZHANG Donglei, LIN Youfang, WAN Huaiyu, MA Yudan, LU Jinliang

2018, 32(7): 116-127.

Abstract ( ) PDF ( )

Knowledge map

Save

Online communities are important platforms for technology enthusiasts or practitioners to exchange and share information. To capture the expertise and interests of each user simultaneously, this paper proposes an Author-Reader-Topic (ART) model based on the fact that a user in a community is both a producer (author) and a consumer (reader) of contents. By linking the authors and the readers of the documents, this model can improves the topic clustering, and achieves more accurate author topic distribution and reader topic distribution. We conducted an experimental comparison and analysis based on a real data set collected from the CSDN community. The experiments show that the proposed model can effectively discovery users' expertises and interests, outperforming the existing methods significantly.

Select

NLP Application

Identifying Novel Characters' Personality—An Example Study on Ordinary World

WU Yufeng, WU Shengtao, ZHU Tingshao,LIU Hongfei, JIAO Dongdong

2018, 32(7): 128-136.

Abstract ( ) PDF ( )

Knowledge map

Save

The previous psychological investigations of novel mainly focus on novel characters' qualitatively, relying on researchers' personal experience subjectively, which is less stable or systematic as compared to personality analysis of the characters. Using the machine learning model based on Chinese psychology analysis system and haracters' dialogues in Ordinary World, we got the four novel characters' big-five personality scores. Then, by comparing the predicted scores with documents related to characters' psychological analysis and novel descriptions, the validity of this method was established. The results revealed that: The younger characters (Shaoping Sun and Xiaoxia Tian) held relatively high Openness, while the older ones (Shaoan Sun and Runye Tian) held relatively high Extraversion. In addition, Shaoping Sun and Runye Tian had relatively high Conscientiousness, Shaoan Sun and Xiaoxia Tian had relatively high Agreeableness, and Shaoan Sun and Shaoping Sun had relatively high Neuroticism. The results demonstrate the applicability of personality analysis of novel characters, through an objective, systematic and intelligent approach.

Select

NLP Application

Construction and Analysis of Rubbing-oriented Oracle Character Network

JIAO Qingju, GAO Feng, JIN Yuanyuan, XIONG Jing, LIU Yongge

2018, 32(7): 137-142.

Abstract ( ) PDF ( )

Knowledge map

Save

Predicting semantics of unknown oracle bone inscriptions is an important topic, and also the bottleneck for histories and computer scientists in oracle bone inscriptions research. Based on the massive accumulated data of oracle , we first define the distance of two oracle character by modeling on big rubbing information, then construct the network of oracle characters in this work, and, finally, analyze the degree distribution, local link rate, clustering coefficient and modularity of the network. Experiments show that the constructed network has a strong module structure, reflects the feature of more monosyllabic words and less polysyllabic words, and captures semantic units of rubbings.

Please choose a citation manager

Content to export

2018 Volume 32 Issue 7 Published: 16 July 2018