Journal of Chinese Information Processing

Select

Language Analysis and Calculation

Analysis of Parts-of-speech Correspondence Between DCC and GKB

QIU Likun, ZHAO Hui, YU Shiwen, ZHU Xuefeng

2017, 31(5): 1-7,20.

Abstract ( ) PDF ( )

Knowledge map

Save

Part-of-speech annotation has attracted extensive attention from the areas including Chinese information processing, Chinese grammar study and Chinese lexicographer. Multiple part-of-speech systems have been proposed and there are significant differences between these systems. So far, little research has been done to systematically compare different large-scale part-of-speech annotations. Based on the part-of-speech annotation results in Dictionary of Contemporary Chinese and Grammatical Knowledge-Base Dictionary, this paper proposes a mapping algorithm, which can detect part-of-speech differences in two dictionaries automatically. Further, we analyze the differences and conclude in two perspectives. 1) about 83.5% of the part-of-speech annotation results is identical. and 2) all the differences can be attributed to three effects: part-of-speech shifting, different part-of-speech annotation standards and different senses.

Select

Language Analysis and Calculation

Lexical Frequency Rank Difference Distributions Between Texts

LIU Rui, SUN Bize, LONG Yunfei, WANG Shan

2017, 31(5): 8-13.

Abstract ( ) PDF ( )

Knowledge map

Save

Based on previous studies on frequency and frequency rank of words, this paper focuses on the analysis of the frequency rank difference (FRD) from the perspective of lexical quantitative analysis. This paper reveals that for the common words between texts, the FRDs are distributed symmetrically and gathered around the median. This characteristic assumes a “two-tailed distribution”, which is flat in the middle and curving in both ends. Three lexical levels, i.e. middle, downward end and upward end, are summarized based on the FRD distributions. The middle lexicon reflects the common characteristics of the two texts, while the lexicon that belongs to both ends reflects their own distinctive features. These features are of linguistic significance in reflecting the thematic content and stylistic features of the texts.

Select

Language Analysis and Calculation

Towards a Rule-based Approach to Semantic Recognition Model of Shenme

NIU Changwei, CHENG Bangxiong

2017, 31(5): 14-20.

Abstract ( ) PDF ( )

Knowledge map

Save

There are at least three interpretations of wh-phrases in Mandarin Chinese: interrogative reference, existential reference, and universal reference. This paper takes shenme as the example, and proposes a rule-based approach to recognize its interpretation in different syntactic contexts. After the testing of its preferred reference in the complex syntactic contexts, a semantic recognition model of shenme is built and revised by experiments.

Select

Language Analysis and Calculation

Parsing of Double-Object Phrases Based on Concept Knowledge Tree

LIN Ziqi, NI Wancheng, ZHAO Meijing, YANG Yiping

2017, 31(5): 21-31,49.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper analyzes the double-object phrase which is a special linguistic phenomenon from the syntactic and semantic perspective, and presents a semantic double-object expressive model based on Conceptual Knowledge Tree (CKT). Moreover, this paper proposes a method for analyzing the double-object phrases, which can automatically translate them into the semantic expressive model. It firstly, in a top-down style, classifies the syntactic parts of a double-object phrase into three parts - double-object verb, direct object and indirect object. And then, in a bottom-up style, it uses CKT to do inferences on these three parts and get their semantic expressions. Experiment on a dataset consisting of 122 double-object verbs and 209 phrases selected from authoritative literatures and grammar dictionaries reveals an accuracy 90.43%.

Select

Language Analysis and Calculation

The Extraction of Chinese Sentence Pattern Instance Based on Diagrammatic Treebank

ZHU Shuqin, PENG Weiming, SONG Jihua, GUO Dongdong

2017, 31(5): 32-39.

Abstract ( ) PDF ( )

Knowledge map

Save

For the purpose of international Chinese teaching, this paper introduces the sentence-focused Diagramma-tic Treebank to preserve the integrity of the sentence pattern structure in grammar teaching. Based upon a thorough analysis the Treebank structures, the Chinese sentence pattern instance are summarized form each parse in via a hierarchical extraction strategy. Finally, a Chinese sentence pattern instance bank is achieved, consisting of basic sentence patterns and complex sentence patterns. This approach paves the way to develop Chinese sentence pattern instances for a small scale Treebank, and enables the practical application of Diagrammatic Treebank in the international Chinese teaching.

Select

Language Analysis and Calculation

A Study on Chinese Discourse Coherence Based on CFN

LV Guoying, SU Na, LI Ru, WANG Zhiqiang

2017, 31(5): 40-49.

Abstract ( ) PDF ( )

Knowledge map

Save

The research on discourse coherence is an important issue in discourse analysis. Based on Chinese FrameNet(CFN), this paper presents a coherence description scheme for Chinese discourse. It establishes the relationship between the frames and discourse units, and discusses the ways to achieve the discourse coherence by the frames and semantic relationships between frames. This provides a description mechanism and computation basis for discourse coherence. Annotations of 160 articles are selected from the People's Daily shows a more than 0.8 kappa value in both discourse structure annotation and discourse relation annotation. This proves that the proposed scheme guarantee a high consistent manual annotation, which is crucial to larger-scale discourse annotating.

Select

Machine Translation

Domain Adaptation of Reordering Model via Topic Information: Word Order in Translated Text across Domains

LIU Mengyi, YAO Liang, HONG Yu, LIU Hao, YAO Jianmin

2017, 31(5): 50-58.

Abstract ( ) PDF ( )

Knowledge map

Save

The research on domain adaptation (DA) for statistical machine translation (SMT) aims at dynamically adjusting the translation model to ensure balanced and reliable translation quality in different domains. Existing researches on adaptation of translation model have made remarkable progress, but neglect the reordering issue. This paper investigates the translation samples in a large scale source bilingual corpus, revealing that 36.17% samples exhibits clear word order differences in phrase level translation pairs. Therefore, we propose a domain adaptive reordering model based on fusing topic information, to explore the reordering differences of phrases under different topic distribution. Experimental results show that translation systems with adaptive reordering model yield obvious performance improvements.

Select

Other Language in/around China

Unit Selection Algorism for Corpus-based Tibetan Speech Synthesis

CAI Rangzhuoma, CAI Zhijie

2017, 31(5): 59-63.

Abstract ( ) PDF ( )

Knowledge map

Save

In the corpus-based text to speech system, the choices of unit selection impact directly on the quality of synthesized speech. By analyzing the features of Tibetan language, this paper proposes not only a hybrid strategy which mixed components, characters, words and sentences, but also a corpus-based unit selection algorism for Tibetan Speech Synthesis. Subjective assessment results and objective evaluation results indicate that the algorithms are effective, the coverage and synthesized speech of units are satisfactory reached expected target.

Select

Other Language in/around China

Online Handwritten Sample Generated Based on Component Combination for Tibetan-Sanskrit

WANG Weilan, LU Xiaobao, CAI Zhengqi, SHEN Wentao, FU Ji, CAIKE Zhaxi

2017, 31(5): 64-73.

Abstract ( ) PDF ( )

Knowledge map

Save

Tibetan-Sanskrit includes more than 500 Tibetan characters, and more than 6000 Sanskrit. Belonging to the large class of character set, the sample collection of the online handwritten is a large and complex project. We present an online handwriting character sample generation method based on component combination for Tibetan-Sanskrit. The proposed method includes four main parts: (1) to determine the Tibetan-Sanskrit character set and component set; (2) to get location information of Tibetan-Sanskrit characters; (3) to collect online handwritten sample of component set for Tibetan-Sanskrit; and (4) to generate sample database of online handwritten Tibetan-Sanskrit character set. This provides the character's training sample set and test sample set for online handwritten Tibetan-Sanskrit.

Select

Other Language in/around China

Grapheme Segmentation Based Mongolian Handwriting Recognition

FAN Daoerji, GAO Guanglai, WU Huijuan

2017, 31(5): 74-80.

Abstract ( ) PDF ( )

Knowledge map

Save

Hidden Markov Models(HMM) has strong modeling capabilities for sequence data, and it is widely used in speech recognition and handwriting recognition task. HMM-based Mongolian handwriting recognizers require the data to be analyzed sequentially. According to Mongolian word formation and writing style, it is evident that a Mongolian word consists of grapheme seamless connected from top to down. The selection of grapheme and segmentation word to grapheme is a preliminary work for handwriting recognition with substantial effects on recognition accuracy. In this paper, according to knowledge of syllables and coding, we collect a Mongolian letters set of 1 171 letters. The long grapheme set which contain 378 grapheme is then extracted from letters set by correlation process and HMM based sorting method. The short grapheme set which contain 50 shapes is extracted from long grapheme set via decompose long grapheme by hands. We present an algorithm to decompose a word to grapheme by two layers mapping. Experimental results show that the short grapheme get better performance than long grapheme.

Select

Other Language in/around China

On Zipf's Law in Korean Language

CUI Rongyi, ZHAO Xue

2017, 31(5): 81-84,91.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper aims to verify the Zipf's law in Korean language. Firstly, the statistical distribution is investigated for two linguistic units, words and alphabets, on a massive Korean text corpus. Then the least square method is adopted to simulate the curve of rank-frequency distribution of words in Korean text. Finally, the estimation values of the parameter of Zipf's law is calculated. The experimental results show that the relationship between frequency and rank of both linguistic units falls into the Zipf's law in Korean language.

Select

Other Language in/around China

Design and Implementation of Mongolian Fixed Phrase Recognition Algorithm

S Loglo

2017, 31(5): 85-91.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic identification and annotation of fixed phrases are esseential to the Mongolian text processing. On the basis of “Mongolian Fixed Phrase Grammatical Information Dictionary”, this paper designs and implements an algorithm for Mongolian fixed phrase recognition and labeling based on finite state automata and rules. Experiments reavel an recognition rate of more than 90%, and an average processing speed of 0.005 millisecond per word.

Select

Other Language in/around China

Anaphoricity Determination of Uyghur Noun Phrases

TAO Doudou, YU Long, TIAN Shengwei, ZHAO Jianguo, Turgun·Ibrahim , Askar·Hamdulla

2017, 31(5): 92-98,113.

Abstract ( ) PDF ( )

Knowledge map

Save

Focusedon Uyghur noun phrase coreference identification task, this paper proposed a Stacked Nonnegative Constrained Autoencoder( SNCAE) for anaphoricity determination based on semantic feature. Through the analysis of Uyghur noun phrase language phenomenon, 15 kinds of semantic features are extracted, and then input into SNCAE to extract the deep semantic features. Finally, the Softmax classifier is used to complete the recognition task. Compared with Support Vector Machine (SVM), the positive accuracy and negative accurate increased by 8.259% and 4.158%, respectively, and increased by 1.884% and 1.590%, respectively, than the Stacked Autoencoder (SAE).

Select

Other Language in/around China

A Weighted Semantic String-Based Approach to Uyghur Text Clustering

Turdi Tohti, Winira Musajan, Askar Hamdulla

2017, 31(5): 99-107.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes an improved frequent pattern-growth approach to discover and extract the semantic strings which express key information in the text, It then assigns weights to them via a multi-feature fusion method and select the most important semantic strings as features to represent the text. The experimental results by K_means cluster shows that the text model constructed by semantic string feature is more compact than the text model constructed by word feature, not only greatly reducing the dimensions of feature space but also improving the performance of clustering algorithm.

Select

Other Language in/around China

The Study of Modern Uyghur Stems in Maths Textbook of Junior Middle School

Azragul, Azharjan, Yusup Abaydula, Zulkarjan, Mirxat

2017, 31(5): 108-113.

Abstract ( ) PDF ( )

Knowledge map

Save

In this study, focused on the Uyghur mathematics textbooks in junior high school, the Uyghur stem are studied. This paper studies the basic stems in the textbooks, the new stems, and the high frequency stems. This provides reference materials for the Uighur language study, Uighur Mathematics Teaching and codification.

Select

Information Extraction and Text Mining

Semi-automatic Construction of Chinese Relation Extraction Data Set Based on a Weakly Supervised Method

MA Chaoyi, XU Weiran

2017, 31(5): 114-119.

Abstract ( ) PDF ( )

Knowledge map

Save

The relation extraction is a fundamental task in information extraction, with practical significance in information retrieval, question answering system and knowledge mapping, etc. The existing relation extraction data set are for English, containing very limited categories and neglecting sentence level annotations. This paper constructs a Chinese relation extraction data set using a weakly supervised and semi-automatic method. It firstly extracts a large amount of relation pairs from Wikipedia, then extracts sentences that contains entity pairs from the corpus of Sougou News and Baidu. Thus the weakly supervised sentence extracting is completed. These sentences are then scored in an RNN-based relation extraction system, selecting sentences with higher score for manual annotation. Finally the Chinese relation extraction data set is completed after manual annotation.

Select

Information Extraction and Text Mining

News Topic Sentence Extraction via Weighted Features

WAN Guo, ZHANG Guiping, BAI Yu, ZHU Yaohui

2017, 31(5): 120-126.

Abstract ( ) PDF ( )

Knowledge map

Save

A topic sentence extraction method for news text is proposed. Firstly, the location feature is derived from the distribution of news topic sentence in the text. Then, the overlap ratio between a sentence and the title calculated owing to the interrelation of the news title with the theme. To best estimate the relevancy between the title and the candidate topic sentence, a maximum matching based on weighted bipartite graph is applied. Finally, the topic sentence is selected according to the sentence rank score. The experimental results show that the proposed method reaches 75.9% in P@1, and 92.4% in P@3.

Select

Information Extraction and Text Mining

A General Theme Information Extraction for Webpages

ZHANG Ruqing, GUO Yan, LIU Yue, YU Xiaoming, CHENG Xueqi

2017, 31(5): 127-137.

Abstract ( ) PDF ( )

Knowledge map

Save

Most of existing information extraction methods are focused on a specific type of webpages, rather than applicable to all webpages. In this paper, we propose a general framework based on fusion mechanism to enable the extraction of the theme information of all webpages. This framework combines the automatic information extraction strategy and the template detection strategy through four steps: template matching, template based extraction, web page classification and automatic extraction. The experiments show that the proposed strategy can lead to an additional performance improvement in the precision of extraction.

Select

Information Extraction and Text Mining

Text Classification Method Based on TF-IDF and Cosine Similarity

WU Yongliang, ZHAO Shuliang, LI Changjing, WEI Nadi, WANG Ziyan

2017, 31(5): 138-145.

Abstract ( ) PDF ( )

Knowledge map

Save

Text classification is the fundamental task for text mining. Many text classification algorithms have been presented in previous literatures, such as KNN, Nave Bayes, Support Vector Machine, and some improved algorithms. The performance of these algorithms depends on the data set and does not have self-learning function. This paper proposes an effective approach for text classification. The three key points of the approach are: 1)extracting the keywords of category (KWC) of labeled texts based on the TF-IDF approach, 2) classifying unlabeled text by the relevancy of category and unlabeled text, and 3) improving the performance of the approach via updating the KWC in the process of classification. Simulation experiment results show that the new approach can improve the accuracy of text classification to 90%, and even up to 95% when the data volume is large enough. The method can automatically update the keywords of category to improve the classification accuracy of the classifier.

Select

Information Retrieval and Question Answering

A Search Engine Click Model Based on Deep Neural Network

XIE Xiaohui, WANG Chao, LIU Yiqun, ZHANG Min, MA Shaoping

2017, 31(5): 146-155.

Abstract ( ) PDF ( )

Knowledge map

Save

With the rich media introduced into searching interface, the result pages of the search engine appear to be heterogeneous and in a form of two-dimensional distribution. To deal with this new challenge to traditional click model, we analyze the result pages of a popular commercial search engine and build a click model based on deep neural network, trying to reveal correlations between multimedia information and text information. This framework contains both the characteristics of neural network and prediction ability of click model. The experiment demonstrates that our framework is well improved compared to original click model. However, due to the complexity of multimedia contents, even deep neural network would produce quite weak semantic correlations if we rely merely on basic characteristics of multimedia results.

Select

Information Retrieval and Question Answering

A Context-aware Deep Sentence Matching Model

FAN Yixing, GUO Jiafeng, LAN Yanyan, XU Jun, CHENG Xueqi

2017, 31(5): 156-162.

Abstract ( ) PDF ( )

Knowledge map

Save

Traditional researches on information retrieval are focuse on document-level retrieval, neglecting, sentence-level information retrieval which is of great importance in such applications, as searching in mobile phone Assuming that the context sentence could provide richer evidence for matching. this paper proposes a context-aware deep sentence matching model(CDSMM). Specifically, the model employs bi-directional LSTM to capture the interior and exterior information of the sentence; Then, a matching matrix is constructed based on the sentence representation and query representation; Finally, we get the matching score after a feed forward neural network. Experiment results on the WebAP dataset show that out model can significantly out-perform the state-of-the-art models.

Select

Information Retrieval and Question Answering

Rapid Start Top-k Query Based on Threshold

JIANG Yu, SONG Xingshen, YANG Yuexiang, JIANG Kun

2017, 31(5): 163-170.

Abstract ( ) PDF ( )

Knowledge map

Save

Top-k query is a popular technique of search engines, which returns the most relative results for user from massive data. Although Top-k query significantly improves the performance of the system, its slow-start issue has not been effectively resolved. This paper extracts static Top-k information of inverted index and then calculats initial threshold in real time for specific query. On this basis, this paper presents a rapid start algorithm of Top-k query for the current state-of-art methods MaxScore and WAND. Experimental results show that the proposed approach achieves better performance.

Select

Information Retrieval and Question Answering

Research on Question Classification via Bilingual Information

XU Jian, ZHANG Dong, LI Shoushan, WANG Hongling

2017, 31(5): 171-177.

Abstract ( ) PDF ( )

Knowledge map

Save

Question classification is a basic task in question answering system. Previous studies only employ the monolingual corpus to train the question classification model, suffering from problems such as lack of corpus and short length of question text. To solve these problems, we propose a new approach named dual-channel LSTM model with bilingual information. Firstly, we extend the Chinese corpus and English corpus with the corresponding translated corpus. Secondly, the samples are represented by the question text and translation word vector. Finally, we build an question classifier using dual-channel LSTM model. The experimental result demonstrates that our approach improves the performance of question classification.

Select

Sentiment Analysis and Social Computing

A Cascaded Construction of Sentiment Classifier for Micro-blogs

ZHANG Yangsen, SUN Kuangyi, DU Cuilan, WANG Jian, TONG Lingling

2017, 31(5): 178-184.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a cascaded classifier micro-blog sentiment analysis. The primary classifier is based on emotional dictionary and sina micro-blog emoticons dictionary. The secondary classifier is based on the orientation similarity, grouped by several key sentimental word. And the third-level classifier is built by using Nave Bayes. The micro-blogs are processed by the three classifiers in a pipeline style. The experimental results show that the method is effective compared against the NLPCC2014 micro-blog sentiment evaluation results.

Select

Sentiment Analysis and Social Computing

Connecting Social Media to E-Commerce: Predicting Long-tail Purchase Behaviors using Multiple Additive Regression Tree

BAI Ting, WEN Jirong, ZHAO Xin, YANG Bohua

2017, 31(5): 185-193.

Abstract ( ) PDF ( )

Knowledge map

Save

Long-tail products, with low demands, occupy a significant share of total revenue in total. It is challenging to analyze the long-tail purchase behaviors due to the data sparsity resulted from few purchase behaviors. This paper proposes to leverage online social media information for predicting the long-tail purchase behaviors. In specific, we collect the user profiles form the social media information, including the status text, following links and temporal activity distributions, and predict their purchases by a weighted Multiple Additive Regression Trees (MART). Experimented on the data from JingDong and SinaWeibo, the effectiveness of the proposed method are revealed, together with several interesting findings.

Select

Sentiment Analysis and Social Computing

Potential Adverse Drug Reactions Discovery from Social Networks

ZHAO Mingzhen, LIN Hongfei, XU Bo, HAO Huihui

2017, 31(5): 194-202.

Abstract ( ) PDF ( )

Knowledge map

Save

With the development of the Internet, social networks have accumulated large amounts of text data about health care. This paper presents an information entropy based method to recognize potential adverse drug reactions from user comments in health related social networks. Meanwhile, to recognize the potential adverse drug reactions, this paper proposes a protein association function based on Word2vec and Skip-gram. Following this functions, this paper tries to detect the evidences between drugs and their potential adverse drug reactions. The results show that this method is promising in providing evidence chain for potential adverse drug reactions.

Select

Sentiment Analysis and Social Computing

News Comments Clustering Based on WMD Distance and Affinity Propagation

GUAN Saiping, JIN Xiaolong, XU Xueke, WU Dayong, JIA Yantao, WANG Yuanzhuo, LIU Yue

2017, 31(5): 203-214.

Abstract ( ) PDF ( )

Knowledge map

Save

With the rapid development of news websites, the news comments increase sharply, which are very important to public opinion analysis and news comments recommendation. This paper proposes a news comments clustering method, called EWMD-AP, to automatically mine the focuses of the public on the news. This method employs Word Mover's Distance (WMD) with enhanced weight vectors to calculate the distances between news comments. It also adopts Affinity Propagation (AP) to cluster comments, and finally obtains the clusters and their representative comments corresponding to the focuses of the public. Particularly, this paper proposes to replace the traditional word frequency based weight vectors in WMD with enhanced weight vectors, which consist of three components: the importance coefficient of words, the de-contextualization coefficient, and the traditional TFIDF coefficient. Experimental results on 24 news comments datasets demonstrate that EWMD-AP performs much better than both traditional clustering methods (e.g. Kmeans, Mean Shift, etc) and the state-of-the-art ones (e.g. Density Peaks, etc).

Select

Sentiment Analysis and Social Computing

Next Basket Recommendation Based on Implicit User Feedback

LI Yumeng, LIAN Xubao, XU Bo, WANG Jian, LIN Hongfei

2017, 31(5): 215-222.

Abstract ( ) PDF ( )

Knowledge map

Save

“Next Basket” recommendation is a crucial task in E-commerce field. Traditional algorithms can be divided into sequential recommender and general recommender, both of which neglect the impact of implicit feedback behavior and time sensitivity of user's preferences. This paper proposes a “next basket” recommendation framework based on implicit user feedback. We divide the user behaviors into several time windows according to the timestamp of these behaviors, and model the user preference in different dimensions for each window. Then we utilize the convolutional neural network to train a classifier. Compared to traditional linear models and tree models on a real dataset, the proposed model improves the user satisfaction with the recommender system.

Please choose a citation manager

Content to export

2017 Volume 31 Issue 5 Published: 16 October 2017