Journal of Chinese Information Processing

Select

Language Analysis and Calculation

Automatic Parsing of Chinese Concept Compound Chunk

WU Yongxu, LV Xueqiang, ZHOU Qiang,GUAN Xiaoda,

2016, 30(2): 1-11.

Abstract ( ) PDF ( )

Knowledge map

Save

In order to solve the problems of chunk boundary identification and intra-chunk structure analysis, this paper explores a new chunk parsing task based on the Chinese concept compound chunk (CCC) scheme. After making detailed comparisons with previous base chunk and functional chunk schemes, the main parsing difficulties for CCC chunking are revealed. Therefore, the paper proposes a CCC parsing method based on the “shift-reduce” model. The experiments on the CCC bank automatically extracted from Tsinghua Chinese Treebank (TCT) show the feasibility of the method for parsing some simple CCCs, which facilitates further syntactic and semantic parsing on complex CCCs.

Select

Language Analysis and Calculation

Chinese Base-Chunk Identification Using Hidden-Layer Feature of Segmentation

LI Guochen, LIU Zhanpeng, WANG Ruibo, LI Jihong

2016, 30(2): 12-17.

Abstract ( ) PDF ( )

Knowledge map

Save

Based on the unit of Chinese character, a neural network learning model for Chinese base-chunk identification is constructed. The model combines the neural network learning model of segmentation task with the model of base-chunk identification by using the hidden-layer features of segmentation. The sentence-level likelihood function for base-chunk identification task is employed as the optimization target. The parameters of the two learning model are trained in turn. The experimental results show that: 1) the F-score of base-chunk identification with sentence-level likelihood function is 1.33% higher than that with character-level likelihood function, and especially, the recall for the multi-characters chunk identification is improved as much as 4.68%. 2) The final result of using hidden-layer features of segmentation task is 2.17% higher.

Select

Language Analysis and Calculation

A PDTB-Based Automatic Explicit Discourse Parser

LI Sheng, KONG Fang, ZHOU Guodong

2016, 30(2): 18-25.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic discourse processing is considered as one of the most challenging NLP tasks which is helpful to many downstream NLP tasks, such as question answering, automatic summary and natural language generation. Recently, the large scale discourse corpus PDTB is made available, which provides a common platform for discourse researchers. On the basis of PDTB corpus, the paper proposes an end-to-end explicit discourse parser with conditional random fields. The parser consists of three components joined in a sequential pipeline architecture, which includes connective classifier, explicit relation classifier and relation argument extractor. We report the performance on each component, and, from error-cascading perspectives, we analyses the parsers overall performance in detail.

Select

Language Analysis and Calculation

A Chinese Expert Disambiguation Method Based on Feature Mapping

PAN Xiao, YU Zhengtao, GUO Jianyi, MAO Cunli, YANG Xiuzhen

2016, 30(2): 26-31.

Abstract ( ) PDF ( )

Knowledge map

Save

A Chinese expert page disambiguation method based on feature mapping is proposed according to the characteristics of the Chinese expert page. Firstly, with the help of CRFs model, 12 predefined character attributes are extracted from the standard and the candidate page, and their weights are decided by a ME classifier. Then, the page similarity is calculated to decide if the candidate page attributes should be appended Experiments on NLP and ML expert pages show the effectiveness of the proposed method in disambiguation.

Select

Seniment Analysis and Socilal Computing

A Multi-strategy Approach to Cross-Lingual Sentiment Analysis

ZHANG Peng, WANG Suge, LI Deyu,

2016, 30(2): 32-40.

Abstract ( ) PDF ( )

Knowledge map

Save

The rapid development of Internet has built up a large number of cyber sources. This multi-lingual information come from a global environment with diversification. Considering the characteristics of cross-language sentiment identification, this paper proposes multi-strategy approach to perform cross-language sentiment analysis. The linguistic consistent sample and hybrid concept space are used to construct a bilingual cooperative framework and a sentiment feature mixture framework, respectively. Then results of tow framework are combined to decide the final sentiment label for a single sample. Experiments show that our strategy works well on cross-language sentiment analysis tasks.

Select

Seniment Analysis and Socilal Computing

Semi-supervised Sentimeng Classification Based On Ensemble Learning with Voting

HUANG Wei, FAN Lei

2016, 30(2): 41-49.

Abstract ( ) PDF ( )

Knowledge map

Save

Recently, sentiment classification has become a hot research topic in natural language processing. In this paper, we focus on semi-supervised approaches for this issue. In contrast to the traditional method based on co-training, this paper presents a semi-supervised sentiment classification via voting based ensemble learning. We construct a set of diversified sub classifiers by choosing different training sets, feature parameters and classification methods. During each voting round, samples with highest confidence are picked out to double the size of training set and then to update the model. This new method also allows sub classifiers to share useful attributes sets. It has a logarithmic time complexity and can be used for non-equilibrium corpus. Experiments show that this method has achieved good results in the sentiment classification task with corpus in different languages, areas, sizes, and both balanced and unbalanced corpus.

Select

Seniment Analysis and Socilal Computing

Calculation and Prediction of Topic Popularity Based on Causal Model

DU Hui , GUO Yan , FAN Yixing , ZHANG Jin, YU Zhihua, CHENG Xueqi

2016, 30(2): 50-55.

Abstract ( ) PDF ( )

Knowledge map

Save

Internet, with its freedom and richness, has become the most important channel of information dissemination. Hot topic mining benefits both policy making for government and business strategy adjustment for company. This paper presents an objective method to calculate topic popularity based on causal model by analyzing its influence factors. Data required by the algorithm is easy to obtain and considering panel data makes our algorithm more effective. Then we use multi-Gaussian curve to fit the movement of topic popularity which is useful for popularity prediction.

Select

Seniment Analysis and Socilal Computing

User Behavior Analysis of Person Tags in SNS

LIU Lie, XING Qianli, LIU Yiqun, ZHANG Min, MA Shaoping

2016, 30(2): 56-63.

Abstract ( ) PDF ( )

Knowledge map

Save

With the popularity of social network sites (SNS) and the massive increase in SNS users, the behavior analysis of SNS users is of substantial importance in website maintenance, performance optimization and system upgrade. Its also a very important research area of network knowledge mining and information retrieval. For a better understanding of the user behaviors in adding tags for themselves in SNS, this paper analyses the distribution of user tags based on the data of about 2.63 million Weibo users. This paper investigates the macroscopic distribution characteristics of user tags, and the relation of tag distributions between a user and the people he follows. We reveal that when Weibo users add tags for themselves, they tend to use tags which can reflect their characteristics in the beginning, then, they tend to select popular tags out of a herd mentality. We applied research findings to a tag prediction algorithm based on following relationships, and the results prove that the correlation analysis provides certain reference significance to tag recommendation in social networks.

Select

Seniment Analysis and Socilal Computing

A Collaborative Filtering Algorithm Combing Location Information

LU Xiao ,WANG Shuxin ,WANG Bin,LU Kai

2016, 30(2): 64-73.

Abstract ( ) PDF ( )

Knowledge map

Save

Recommendation system based on users consumption data is playing an increasingly large application value in e-commerce, And in these data, businesses location information which can effectively reflect the users personal geographical preference, would make an important significance on recommender system. Existing work generally use only users review data as well as the distance between locations, which cannot reflect the relationships between different locations, not to mention that user preferences in different locations should has different weight. This paper proceed from the perspective of geographical area, and study the users preferences within the area, as well as the impact of different area partition methods on recommend models. Then we explore to incorporate recommender systems with geographical information effectively, including the locations global effects and users regional preferences, proposing recommendation models, such as LGE, LGN and LRSVD. Experimental evaluation on Yelp dataset demonstrates that our models can effectively improve the prediction results comparing to the traditional methods.

Select

Seniment Analysis and Socilal Computing

Understanding Information Propagations via Influence Backbone Analysis on Social Networks

HUANG Junming, SHEN Huawei, CHENG Xueqi

2016, 30(2): 74-82.

Abstract ( ) PDF ( )

Knowledge map

Save

Understanding intrinsic mechanism of information propagations on social networks has attracted growing attention, including social network topology analysis and user behavior analysis. Due to the heterogeneity of links in social networks, only a portion of links significantly contribute to information propagations. The influence backbone of a social network, consisting of those links, might provide deeper insight to information propagations. Focused on the influence backbone, we analyzes the signs of links with social structural balance theory, and the roles of nodes with heterogeneous distributions of out-degrees, so as to find the roles played by links and nodes in information propagations in a microscopic. Furthermore, we investigate the network connectivity and information spread efficiency of the influence backbone, finding that information propagations are more fragile and less effective.

Select

Seniment Analysis and Socilal Computing

Study on User Influence in Online Social Networks

XU Danqing, LIU Yiqun, ZHAMG Min, MA Shaoping

2016, 30(2): 83-89.

Abstract ( ) PDF ( )

Knowledge map

Save

Based on the large-scale social network dataset, this paper conducts a multi-feature statistical analysis on graph structure and finds that the indegree, outdegree and posts of social networks generally fit power law distribution. The “small-world” property makes the strongly connected structure of social network show the “spindle” shape. Furthermore, this paper incorporates users posting behaviors, browsing behaviors and social communities properties into social influence modelings. Experimental results show that the PTIM model combining users behaviors and link relationships has a stable performance on identifying the numbers of fans, authenticated users, the relative influence of users pairs and other indices.

Select

Seniment Analysis and Socilal Computing

Transfer with Shared Users: A Cross-platform

LI Chao, ZHOU Tao, HUANG Junming, CHENG Xueqi, SHEN Huawei

2016, 30(2): 90-98.

Abstract ( ) PDF ( )

Knowledge map

Save

The widely use of personalized recommender systems on online shopping websites results in great profits and enhanced user experiences. However, since a users behaviors usually scatter cross multiple different websites, it becomes difficult to provide accurate recommendations when a recommender system sees a section of his behaviors on a single website. We propose a new recommendation algorithm that transfers behaviors across different websites to calculate similarities between users on different websites. Our algorithm overcomes the sparsity and cold-start problem in recommender systems with a significant accuracy improvment, outperforming traditional algorithms that applied on a single website only.

Select

Information Retrieval and Question Answering

Aparallel Query Correction Method for Mixed Language

ZHUAN Yue, XIONG Jinhua, MA Hongyuan, CHENG Shuyang, CHENG Xueqi

2016, 30(2): 99-106.

Abstract ( ) PDF ( )

Knowledge map

Save

Query in Chinese information retrieval system often contains Chinese, Chinese phonetic alphabet and English etc. Existing method can not solve the issue of mixed language and long Chinese query. In order to solve these problems, we propose a parallel query correction method for mixed language. The method establishes language model with mixed language and built the heterogeneous character dictionary tree according to the corresponding edit rules to process the query words. For the long Chinese query, we put forward spell correction model of two-way parallel. For paralle processing, we put forward the concept of reverse character dictionary tree and reverse language model. The training corpus used in the model is extracted from the user query log, click log, web links and other information. Experiment shows that the parallel query correction method for mixed language increases the accuracy by 9%, reduces the recall by 3%, and, especially, speeds up the processing by 40% compared to single pass query correction.

Select

Information Retrieval and Question Answering

Information Retrieval Model Combining Sentence Level Retrieval

ZUO Jiali, WANG Mingwen, WU Shuixiu, WAN Jianyi

2016, 30(2): 107-112.

Abstract ( ) PDF ( )

Knowledge map

Save

Models exploiting the position and proximity information of query terms in the documents improve the retrieval performance withits a high computation complexity. The paper presents an approximation method by compute the relevant degree of the sentence to query, resulting an information retrieval model combining sentence level retrieval. Experiment results show our model can get better performance than baseline models.

Select

Information Retrieval and Question Answering

Exploration of Implicit Negative Feedback with Time Factorin Search Session

CHEN Zhenhong , YU Xiaoming , LIU Yue , CHENG Xueqi

2016, 30(2): 113-120.

Abstract ( ) PDF ( )

Knowledge map

Save

Implicit relevance feedback has been used to improve the performance of the retrieve system. In Contrast to the recent, most of the related work focusing on implicit positive feedback, this paperinvestigated the usefulness of combining both implicit positive and negative feedback with time factor. The implicit negative feedback used in this paper is the unclicked document before the clicked document in the same search session. By estimating the time spent on the title and snippet of the unclicked document, the time factor is introduced to infer the relationships between users interest and behaviors. Thus, a unified time factor model called TIPNF is proposed to use both implicit positive and negative feedback to improve the performance of the retrieve system. Experiments on TREC Session 2011 and 2012 verify the effectiveness and stabilization of the TIPNF.

Select

Information Retrieval and Question Answering

Query Clustering Based on Content and User Behavior

CHENG Shuyang, XIONG Jinhua, GONG Shuai, CHENG Xueqi

2016, 30(2): 121-127.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a probabilistic latent semantic indexing model based on query graph (GPLSI) to learn query features for query clustering in this paper. GPLSI for query-word co-occurrence and query-query co-occurrence simulates the generation of query intent and its representation based on query text, click and session information, and learns the probability distribution of query on different intents. Experimental results illustrate GPLSIs effectiveness in query similarity measurement and multi-intent query clustering.

Select

Information Retrieval and Question Answering

Personalized Citation Recommendation Based on Users Preference and Language Model

LIU Yaning, YAN Rui, YAN Hongfei

2016, 30(2): 128-135.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic citation recommendation based on citation context is a highly valued research topic. The existing works all focus on the content based methods only. In this paper, we consider the citation recommendation as a content based analysis combined with personalization. Using users publication and citation history as the users profile and the language model, we propose a PCR (personalized citation recommendation) model. Experiment indicates 71.01% improvement of the performance in terms of recall@10 and 70.23% improvement in MAP compared with the traditional language model.

Select

Information Retrieval and Question Answering

Image Semantic Retrieval Based on Visual Saliency Computation

LIU Wei, CHEN Xu, LIANG Yongsheng

2016, 30(2): 136-141.

Abstract ( ) PDF ( )

Knowledge map

Save

Internet labeling and tags have been used extensively to describe the image contents on the Web. To understand and utilize these tags for image semantic retrieval , this paper introduces a visual saliency model to emphasize the salient information, and then, extracts the visual feature to describe the similarity between images. At last, a novel random walk is proposed to balance the influences between the image contents and tags. Experiments show the effectiveness and feasibility of the proposed method when applied in image understanding and retrieval.

Select

Information Retrieval and Question Answering

Interactive Question Answering Based on Ontology and Semantic Grammar

WANG Dongsheng, WANG Shi, WANG Weimin, LIU Liangliang, FU Jianhui

2016, 30(2): 142-152.

Abstract ( ) PDF ( )

Knowledge map

Save

In QA system, the user queries are usually not isolated, but correlated. This paper proposes an ontology and semantic grammar based method for interactive question answering, and we developes a QA system called OSG-IQAs based on an existing non-contextual question answering system. We first propose a discourse structure to maintain semantic information (i.e., the understanding result) of questions, and then use an approach to recognizing the specific type of relevancy between the previous question and follow-up question. We then propose an algorithm which fuses different contextual information (recorded in discourse structure) into the current, follow-up question according to the specific relevancy type. Lastly, the transformed question is resubmitted to the non-contextual question answering system. We finally evaluate the proposed method on two real contextual QA data sets from two areas of different scales. The results show that the proposed method has better scalability; we achieved an overall performance better than a baseline system and almost the same performance as another comparison system whose contextual phenomena are manually resolved.

Select

Information Retrieval and Question Answering

Wikipedia Entity to Enhanced Graph-based Multi-document Summarization

CHEN Weizheng, YAN Rui, YAN Hongfei, LI Xiaoming

2016, 30(2): 153-159.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a novel method to enhance graph-based multi-document summarization by incorporating Wikipedia entities. The Wikipedia contents of high-frequency entities are extracted and arranged as the document collections background knowledge. Then the PageRank algorithm is used to sort these sentences in the document collections and an improved DivRank algorithm is applied to sort the sentences both in the document collections and the background knowledge. Finally the summary sentences are chosen based on a liner combination of these two ranking results. Results of experiments on the data of document understanding conference (DUC) 2005 show that the method proposed in this paper can effectively make use of the Wikipedia knowledge to improve the summary quality.

Select

Information Retrieval and Question Answering

Protein-Protein Interaction Extraction Based on Transfer Learning

LI Lishuang, GUO Rui, HUANG Degen, ZHOU Huiwei

2016, 30(2): 160-167.

Abstract ( ) PDF ( )

Knowledge map

Save

As an important branch of biomedical information extraction, Protein-Protein Interaction (PPI) extraction has great research significance. Currently, research of PPI mainly focuses on traditional machine learning, which requires the use of large amounts of annotated corpus for training and makes it costly to label the new data. This paper employs Transfer Learning in extracting PPI with a small amount of labeled data of target domain (in-domain), drawing support from annotated data of source domain (out-of-domain). To avoid the negative transfer caused by large differences between the distributions of different domains, we adjust the weights of each instance from source domain, depending on its relative distribution. Experiments on the AIMed corpus and on IEPA corpus reveals the efficiency of our alogrithems.

Select

Information Retrieval and Question Answering

A Semi-supervised Chinese Event Extraction Method

XU Xia, LI Peifeng,ZHU Qiaoming

2016, 30(2): 168-174.

Abstract ( ) PDF ( )

Knowledge map

Save

Currently, semi-supervised or unsupervised event extraction remains a challenge. According to the nature of Chinese language, this paper proposes a dual-view-based bootstrapping approach to extract event patterns. According to a small set of seeds, it applies a cross filtering method to two views, document relevance and semantic similarity, and extract new patterns in each iteration. Our experimental results show our system outperforms the existed systems.

Select

Information Retrieval and Question Answering

An Approach to Crawling the Deep Web Based on Domain Knowledge Sampling

LIN Hailun, XIONG Jinhua, WANG Bo, CHENG Xueqi

2016, 30(2): 175-181.

Abstract ( ) PDF ( )

Knowledge map

Save

The Deep Web refers to the Web databases content hidden behind HTML forms, which can only be accessed by performing form submissions. The current web page collection technologies can not cover these resources effectively by employing only hyperlinks. For this purpose, this paper proposes an approach to crawling the deep web based on domain knowledge sampling. Firstly, it creates a domain attributes set using open source directory services and assigns the attributes based on a confidence function; Secondly, it uses the domain attributes set to select query interface and generate assignments, and finally, it selects the assignment with the highest confidence as a query instance for deep web crawling based on greedy algorithm. Experiments show that our approach can effectively collect the deep web resources.

Select

Information Retrieval and Question Answering

FPC: Fast Incremental Clustering for Large Scale Web Pages

YU Jun , GUO Yan,ZHANG Kai, LIU Lin, LIU Yue, YU Xiaoming, CHENG Xueqi

2016, 30(2): 182-188.

Abstract ( ) PDF ( )

Knowledge map

Save

Structure-oriented web page clustering is one of the most important technique in web data mining. Previous traditional methods havent given a formal definition of the web page cluster center and have to calculate several point-wise similarities for the purpose of getting the similarity between a point and a cluster or the similarity between two clusters. The efficiency of these methods is much slower than the clustering algorithms using cluster center, especially they cant satisfy the need of large scale clustering in fast incremental web pages clustering. To solve these issues, this paper proposes a fast incremental clustering method FPC (Fast Page Clustering). In our method, a new approach is given to calculat the similarity between two web pages which is 500 times faster than the Simple Tree Matching algorithm; then a formal representation of web page cluster center is described and a Kmeans-like MKmeans(Merge-Kmeans) clustering algorithm for fast clustering is applied; Moreover, we use local sensitive hashing technique to quickly find the most similar cluster in a large scale cluster set and improve the efficiency in terms of the incremental clustering.

Select

Information Retrieval and Question Answering

Research on Reorganization of Text Clustering Results

CHEN Xiaorong, LIU Zuoguo

2016, 30(2): 189-195.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper illustrates a distance oriented reorganization strategy in which clusters could be reorganized in independence from clustering process. The concept of Nearest Domain is proposed and Nearest Domain rules are elaborated. Then Gauss Weighing Algorithm is designed to re-wieght a text by the distance from cluster kernel. At last, Nearest Domain Weights will separates sparse clusters and adjusts abnormal texts while combines similar ones. Clustering experiment shows that reorganization process effectively improves the accuracy and recall rate and makes result more reasonable by increasing the inner density of clusters.

Select

Machine Translation

The Translation Selection of Anchor Text in Wikipedia Cross-Lingual Link Discovery

ZHENG Jianxi, BAI Yu, GUO Cheng, ZHANG Guiping

2016, 30(2): 196-201.

Abstract ( ) PDF ( )

Knowledge map

Save

The research on Wikipedia Cross-Lingual Link Discovery (CLLD) is to automatically identify an anchor text related to topic from source language Wikipedia articles, and recommend a set of relevant target language links to the anchor text. It involves three key problems: anchor text identification, anchor text translation, and target link discovery. To deal with the multiple target translations of an anchor text, we propose a context-based translation selection method, which uses a vote method based on pointwise mutual information (PMI). Experiments on the translation selection of person names, terminology and abbreviation in Chinese and English Wikipedia articles, the results show that the method achieves good performances.

Select

Other Language in/around China

Vector Space Models and Component Features Analysis of Tibetan Characters

CAI Zhi-jie,CAI rang-zhuoma,

2016, 30(2): 202-206.

Abstract ( ) PDF ( )

Knowledge map

Save

Tibetan characters property is essential for Tibetan information processing, and it is substantial significance in education and scientific research. Because Tibetan characters writing is directed by both horizontal and vertical 1-7 Tibetan characters, the properties of Tibetan characters include the structure, length, frequency of Tibetan characters and the locality features of each characters. This paper establishes vector model (VMTT) of Tibetan characters, vector model (VMTS) and sparse-land model (SLM) of Tibetan character string, and conducts the component feature analysis of Tibetan characters based on these models.

Select

Other Language in/around China

Implementation and Comparative Analyses of Mono-space Based and Voiced/Unvoiced Phoneme Base Word Segmentation of Uyghur for Keyword Search

Muhetaer Shadike, Buheliqiguli Wasili, LI Xiao

2016, 30(2): 207-212.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper we introduce two word segmentation methods for Uyghur key word search. They are realized in MATLAB code, and their performances are investigated on the same condition. At last gives some idea for optimizations.

Select

Other Language in/around China

Post-processing for Verbs in Chinese-Mongolian Machine Translation

Wangsiriguleng,Wang Chunrong,Siqintu,Arong,Yuxia

2016, 30(2): 213-216.

Abstract ( ) PDF ( )

Knowledge map

Save

Mongolian is rich in morphological variation, especially for the verb. Based on a given Mongolian verb dictionary, we corrected the wrong verb form appeared in the end of hierarchical phrase based Chinese-Mongolian machine translation sentence. The experiments show that this method can improve the translation quality.

Please choose a citation manager

Content to export

2016 Volume 30 Issue 2 Published: 20 April 2016