2016 Volume 30 Issue 5 Published: 15 October 2016
  

  • Select all
    |
    Review
  • Review
    YUAN Shuhan, XIANG Yang
    2016, 30(5): 1-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    Constructing the words representation which could express the semantic features is the key problem of Natural Language Processing. In this paper, we first introduce the lexical semantic representation based on the distributional hypothesis and prediction model, and describe the evaluations methods of words representation. Then we review the new applications based on the semantic information of words representation. Finally, we discuss the development directions and exiting problems of lexical semantic representation.
  • Review
    GOU Chengcheng, DU Pan, LIU Yue, CHENG Xueqi
    2016, 30(5): 9-18.
    Abstract ( ) PDF ( ) Knowledge map Save
    Emerging topic detection is one of the major research focus in Social Network Analysis. The openness of social networks, microblog in particular, provides unprecedented favorable conditions on which the topics might rage and outbreak. The emerging topics are often accompanied by big news or events, which are about to outbreak and have a significant social impact. How to identify these topics in the early stages is the major research content of the emerging topic detection. The main developments in the field of the emerging topic detection in the recent years are reviewed and the relevant concepts, methods and theory are elaborated. The methods of the emerging topic detection are analyzed and discussed form the perspective of the content bursty feature and information diffusion models. Finally we conclude the paper with an exploration of future research directions.
  • Review
    MEI Lili,HUANG Heyan,ZHOU Xinyu,MAO Xianling
    2016, 30(5): 19-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentiment analysis is a rapidly developing research topic in recent years, which has great research value and application value. Sentiment lexicon construction plays an increasingly important influence on the task . This paper summarizes the research progress on sentiment lexicon construction. Firstly, four kinds of methods are summarized and analyzed, including the method based on heuristic rules, the method based on graph, the method based on word alignment model and the method based on representation learning. Then, some popular corpus, dictionary resources and evaluation organizations are introduced. Finally, we conclude the topic and provide the development trends of sentiment lexicon construction.
  • Review
    HU Yang,FENG Xupeng,HUANG Qingsong,FU Xiaodong,LIU Li,LIU Lijun
    2016, 30(5): 28-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    Short-text has some peculiarities: extreme sparsity, disperse features and so on, which leads to inferior sentiment classification on short-text. To solve this problem, we propose the feature polymeric topology model for short-text sentiment classification. The model integrates mutual information among features, similarity of sentiment orientation and topic ascription difference into the sentiment features correlation. Then this correlation is employed to establish topology polymeric graph, in which the strongly connected components are assumed as the most similar sentiment features. Finally, the polymeric topology model supplements the training feature set with similar features from the unlabeled corpora, and reduces dimension of training space at same time. In experiment,the proposed model can improve the presicion and recall by 0.03 and 0.027, respectively.
  • Review
    WANG Yi, LIANG Xun, ZHOU Xiaoping
    2016, 30(5): 36-46.
    Abstract ( ) PDF ( ) Knowledge map Save
    Online Social Networking (OSN) is a complex system, where both users and messages are fundamental objects when investigating the network topology and the disseminations of information. To study the structure features and the rules of information propagation, this paper analyzes about 30,000 users including their friendships and the most recent 200 posts. The main statistical results include: 1) SINA network is not dense and the correlation density is almost linear; 2) during the dissemination of a single post, “10-90 rule” occurs, that is to say 10% of the users can affect the other 90%; and 3) four patterns can be concluded considering both life-cycle and forwarding structure. These results may provide the basis for subsequent modeling, as well as benefition the public opinion monitoring and cyber marketing.
  • Review
    QIAO Zhi, ZHOU Chuan, JI Xiancai, CAO Yanan, GUO Li
    2016, 30(5): 47-56.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to improve users' experience in event-based social networks (EBSNs) services, the event recommendation task has been studied in the recent years. In this paper, the user motivation data of EBSN applications is analyzed, and a novel latent factor model unifying multiple data features is proposed. This method considers two new types of features, i.e., heterogeneous online& offline social relationships and regional preference of users, and applies them for event recommendation. Experimental results on real-world data sets showed our method had better performance than some traditional methods.
  • Review
    WANG Yongqing, SHEN Huawei, CHENG Xueqi
    2016, 30(5): 57-64.
    Abstract ( ) PDF ( ) Knowledge map Save
    In information propagation, users have forwarding preference when receiving same message repeatedly. Modeling forwarding preference is fundamental to information propagation and other related applications, e.g., influence analytics, cascade dynamics and social recommendation. In this paper, we suggest forwarding preference is mainly affected by interpersonal influence, determined by both influence and susceptibility from the sender and the receiver, respectively. We propose to model such user-specific latent influence and susceptibility by the Forwarding Preference Model. We compare our proposed model with state-of-the-art forwarding preference models on the dataset from Weibo, which demonstrates that the proposed model consistently outperforms other methods at two evaluation measures.
  • Review
    JIANG Shengyi,YANG Bohong,YAO Juanna,WU Meiling , ZHANG Yusha
    2016, 30(5): 65-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    Microblog is one of the most popular online social media nowadays. Identification of users' community structure on Microblog can help people understand the community structure as well as users' behaviors, and even provide personalized service for users. Currently, most of the studies on Microblog community detection algorithm focus on the link information, ignoring the information posted by users. To address this issue, a fast Microblog community detection algorithm based on augmented network is proposed. The algorithm constructs an augmented network by integrating users' link information and content, on which community can be identified efficiently. Experimental results show that the proposed algorithm performs better in identifying the community structure of social networks in real Microblog network when compared with other algorithms.
  • Review
    WANG Pengfei , GUO Jiafeng, LAN Yanyan, YAN Xiaohui, CHENG Xueqi
    2016, 30(5): 73-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we propose a novel probabilistic transaction model (PTM) for brand recommendation in the traditional shopping mall. Some existing algorithms, such as KNN based recommendation, take only local information into consideration and suffer from the sparse problem in offline transaction data. Some algorithms, such as matrix factorization based recommendation, take all transactions for each user as a whole and fail to discriminatethe co-concurrence between inter- and intra-transactions. To address these two issues, the PTM is designed to learn the latent representation of brands and transactions from all the brand co-occurrences in each transaction, and then the latent representation for each user could be derived for personalized recommendation. Experiment on real transaction data sets shows that PTM based recommendation outperforms the baselines.
  • Review
    CHEN Hongchao,LI Fei,ZHU Xinhua, MA Runcong
    2016, 30(5): 80-88.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we propose a word semantic similarity approach based on the path and depth in CiLin. This approach exploits the shortest path between two word senses and the depth of their lowest common parent node in the hierarchy tree to calculate the semantic similarity between two word senses. In order to improve the rationality of calculating the path and depth, we assign different weights to the edges between the different layers in classification tree, while dynamically adjusting the shortest path between two senses through their branch interval in the lowest common parent node. The experiments show that the correlation coefficient between the human judgments in MC30 dataset and the computational measures presented in this approach is 0.856, which is higher than those of most of current semantic similarity algorithms.
  • Review
    ZHANG Yi,LI Zhijiang
    2016, 30(5): 89-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese word segmentation (CWS) is the foundation for Chinese information processing. This article proposed a feature of contextual word length based on Gaussian noise. The experiment results indicate that this feature can enhance the performance of the exit result.
  • Review
    XING Juanjuan
    2016, 30(5): 94-100.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to identify fake reviews, we propose a method of fake reviews identification based on Markov Logic Networks. Firstly, the characteristics of fake review content and reviewer behavior are analyzed,and the review content features and the reviewer behavior features are selected. Secondly, the predicates and the formulas are defined with the features,and the weight The experimental results and the inference are decriked. show that the proposed method has a good performance.
  • Review
    LI Zhiyu, LIANG Xun, ZHOU Xiaopin
    2016, 30(5): 101-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    We propose a method for Word2vec training on the short review textsby a partition according to the topic. We examine three kinds of partition methods, i.e. Based on Whole-review (BWP), Based on sentence-Separator (BSP) and Based on Topic(BTP), to improve the result of Word2vec training. Our findings suggest that there is a big difference on accuracy and similarity rates between the None Partition Model (NP) and BWP, BSP, BTP, due to the characteristic of the review short text. Experiment on various models and vector dimensions demonstrate that the result of word vector trained by Word2vec model has been greatly enhanced by BTP.
  • Review
    ZHU Shanshan, HONG Yu, DING Siyuan, YAN Weirong,YAO Jianmin, ZHU Qiaoming
    2016, 30(5): 111-120.
    Abstract ( ) PDF ( ) Knowledge map Save
    The implicit discourse relation recognition is to automatically detect the relationships between two arguments without explicit connectives. Previous studies show that linguistic features are effective for implicit discourse relation recognition. However, the state-of-the-art accuracy is merely 40% for the lack of enough training data. For the problem, this paper presents a novel implicit discourse relation recognition method based on the training data expansion. Firstly, we take some origin training data as seed samples, and then use them to mine semantically and relationally parallel data from the external data resources by using “arguments vectors”. Secondly, we augment origin training data with the mined parallel training data. Finally, we experiment the implicit discourse relation classification using the expanded data. Experiment results on the Penn Discourse Treebank (PDTB) show that our method outperforms the baseline system with a gain of 8.41% on the whole, and 5.42% on average in classification accuracy respectively. Compared with the state-of-the-art system, we further acquire 6.36% improvements.
    Key words: implicit discourse relation; semantic vector; training data expansion; discourse analysis 收稿日期: 2014-12-25 定稿日期: 2015-03-27 基金项目: 国家自然科学基金(61373097, 61272259, 61272260, 90920004);教育部博士学科点专项基金(2009321110006, 20103201110021);江苏省自然科学基金(BK2011282);江苏省高校自然科学基金(11KJA520003);苏州市自然科学基金(SH201212)
  • Review
    JIANG Dan, ZHOU Wenle, ZHU Ming
    2016, 30(5): 121-128.
    Abstract ( ) PDF ( ) Knowledge map Save
    Traditional methods for text clustering have generally taken the BOW (bag-of-words) model to construct the vector of document, ignoring semantic information between words. And partitioning clustering method based on centroid tends to split concept closely related clusters stiffly, not suitable for mining interesting topics. To address these issues, , this paper proposes a text clustering method based on semantics and cliques. Compared with three popular semantic models, experiments reveal that our method performs better than K-means on semantic clustering task.
    Keywords: text clustering method;complete sub-graph;semantic similarity;distributed representations of words in a vector space 收稿日期: 2015-04-07 定稿日期: 2015-06-02 基金项目: 海量网络数据流海云协同实时处理系统(子课题)(XDA06011203);电视商务综合体新业态运营支撑系统开发(2012BAH73F01)
  • Review
    ZHENG Xin, LI Peifeng, ZHU Qiaoming
    2016, 30(5): 129-135.
    Abstract ( ) PDF ( ) Knowledge map Save
    In recently years, more and more studies are devoted to temporal relations between events, with a focuse on improving pairwise classifiers, ignoring the obvious inconsistent problems in the global space of events when misclassifications occur. In this paper, we use a global inference model to resolve such problem bytreating temporal relations recognition as Integer Linear Program. We use many constraints, such as reflexivity, transitivity, event coreference, temporal conjunctions, pairs of event types, etc. The experimental results show that the global inference model outperformed the local classifiers by 3.56% in F1.
    Key words: event; temporal relation; inference 收稿日期: 2014-02-29 定稿日期: 2015-03-30 基金项目: 国家自然科学基金(61472265);国家自然科学基金(61331011);江苏省前瞻性联合研究项目(BY2014059-08)
  • Review
    XI Yahui
    2016, 30(5): 136-144.
    Abstract ( ) PDF ( ) Knowledge map Save
    Domain-specific sentiment lexicon plays an important role in sentiment analysis system. Due to the huge number of the product review in diverse domains , automatic construction of domain-specific sentiment lexicon is a challenging task. This paper proposes a two-phrase automatic construction algorithm of domain-specific sentiment lexicon. In the first phrase, the constrained label propagation algorithm is applied to the construction of base sentiment lexicon by using PMI and contextual constraints. In the second phrase, the domain-specific sentiment words are exacted by the frequency of sentiment conflict, and the domain-specific sentiment lexicon is improved according to the contextual constraints and the product feature modified by the sentiment word. Experiments on diverse real-life datasets show promising results.
  • Review
    YAO Liang, HONG Yu, LIU Hao, LIU Le, YAO Jianmin
    2016, 30(5): 145-152.
    Abstract ( ) PDF ( ) Knowledge map Save
    Data Selection aims at selecting sentence pairs most relevant to target domain from large scale general-domain bilingual corpus that are , so as to alleviate the lack of high quality bi-text for statistical machine translation in the domain of interest. Instead of solely using traditional language models, we propose a novel approach combining translation models with language models for data selection from the perspective of generative modeling. The approach can better measure the relevance between sentence pairs and the target domain, as well as the translation probability of sentence pair. Experiments show that the optimized system trained on selected bi-text using our methods outperforms the baseline system trained on general-domain corpus by 3.5 BLEU points. In addition, we present an effective method based on sentence pairs re-ranking to tune the weights of different features which are used for evaluating quality of general domain texts. Machine translation system based on this method achieves further imporvments of 0.68 BLEU points.
    Keywords: bilingual data selection; generative modeling; translation model; language model; weight tuning 收稿日期: 2015-07-31 定稿日期: 2016-01-25 基金项目: 国家自然科学基金(61373097, 61272259, 61272260)
  • Review
    YAN Canxun
    2016, 30(5): 153-159.
    Abstract ( ) PDF ( ) Knowledge map Save
    Pairing vertices properly in a bipartite graph can be taken as a model for the bilingual sentence alignment. The vertex pairs in the bipartite graph can be weighted with a totally bilingual-dictionary-based evaluation function which evaluates the word correspondences between an English sentence and a Chinese sentence. In our appoach, the globally-maximum-weighted vertex pairs are first chosen as temporary anchors. Then, based on the temporary anchors, the results of the locally-maximum-weighted vertex pairs and the range of the ratio of English and Chinese sentence lengths, the mistakes in the original anchor vertex pairs are corrected and the missing vertex pairs are supplemented. Meanwhile, the sentences in the bipartite graph are simultaneously grouped into minimal groups of corresponding sentences. The comparison experiments show that the vertex-pairing sentence alignment approach works better than the Champollion sentence alignment system.
  • Review
    LV Xueqiang,WU Yongxu, ZHOU Qiang,LIU Yin
    2016, 30(5): 160-168.
    Abstract ( ) PDF ( ) Knowledge map Save
    Corpus resources are closely related to Natural Language Processing. However, different research institutions have different rules and tags when constructing the copus, which prevents a unified big corpus. This paper investigates the different annotation scheme and presents a method for heterogeneous corpus integration. The experiments on part-of -speech mapping and and disambiguation indicate anaccuracy of 87% after the integration, showing the validness of this method.
    Key words: corpus construction; data fusion; word mapping; POS disambiguation; 收稿日期: 2015-10-08 定稿日期: 2016-05-25 基金项目: 国家自然科学基金(61271304,61671070);北京成像技术高精尖创新中心项目(BAICIT-2016003);国家社会科学基金(14@ZH036)
  • Review
    XUE Yuanhai, YU Xiaoming, LIU Yue, GUAN Feng, CHENG Xueqi
    2016, 30(5): 169-175.
    Abstract ( ) PDF ( ) Knowledge map Save
    The main purpose of information retrieval technology is satisfying users information needs by using massive amounts of information recource. Recent years, many techniques increase average effectiveness relative to traditional simple model while they often ignore the robustness issue. Users satisfaction will be significantly hurt because of degraded results of many queries. A query performance prediction method based on learning to rank is proposed to obtain robust ranking results. For each query, the performance of multiple ranking results generated by different models are predicted and the best one is shown to the user. A series of experiments are conducted on three standard LETOR benchmark datasets which are OHSUMED, MQ2008 and MSLR-WEB10K. The results show that, compared to one of the state-of-the-art models named LambdaMART, the ranking results obtained this way significantly reduced the number of queries whose performance are hurt with respect to BM25 model while improving the nearly same degree of everage effectiveness.
  • Review
    YAO Ziyu, TU Shouzhong, HUANG Minlie , ZHU Xiaoyan
    2016, 30(5): 176-186.
    Abstract ( ) PDF ( ) Knowledge map Save
    Microblogging sites are one of the most popular information sharing platforms today. However, among the large amount of posted published every day, spam texts are seen everywhere: users utilize spam posts to advertise, broadcast, boast their own products, and defame their competitors. Therefore, filtering spam tweets is a critical and fundamental problem. In this paper, we propose a semi-supervised algorithm based on Expectation Maximization and Naive Bayesian Classifier (EM-NB), which is able to filter spam tweets effectively using only a small amount of labeled data. The experimental results on more than 140 thousand tweets from Sina Weibo show that our method achieves higher accuracy and F-score than baselines.
  • Review
    CHEN Jinguang
    2016, 30(5): 187-194.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a summarization unit selection method based on the cloud model. The cloud model is used to consider randomness as well as fuzziness on distribution of summarization unit. In obtaining relevance between summarization unit and query, the scores of relevance between the word and each query word are seen as cloud drops. According to the uncertainty of cloud, a summarization unit which is more relevant to the query is given higher score. After that, the importance in the document set is also considered to evaluate the sentence's ability to summarize content of the document set. Similarities between a sentence and all sentences in document set are considered as cloud drops. All these cloud drops become a cloud, which indicates the sentence's ability to summarize content of the document set. The effectiveness of the proposed method is demonstrated on large-scale open benchmark corpus in English. The method was also examined by TAC (Text Analysis Conference) 2010 with satisfactory results. Key words: cloud model; query-focused multi-document summarization; uncertainty 收稿日期: 2016-00-00 定稿日期: 2016-00-00 基金项目: 教育部人文社会科学一般项目(13YJCZH013)、湖州师范学院人文社科预研究项目(KY27015A )
  • Review
    KANG Shize,MA Hong,HUANG Ruiyang
    2016, 30(5): 195-202.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentence ordering is an important task in multi-document summarization. For this purpose, we first use neural network model to incorporate five proposed criteria for sentence connection, namely chronology, probabilistic, topical-closeness, precedence, and succession. Then, a sentence ordering method based on Markov random walk model is proposed, which determines the final ordering of the sentences based on the strength of connection between them. Examined by the semi-automatic and a subjective measures, the proposed method achieves obviously better sentence order compared with the baseline algorithms in the experiments.
  • Review
    LONG Congjun LIU Huidan,WU Jian
    2016, 30(5): 203-208.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper applies rules and statistical methods to realize conversion form Tibetan texts to IPA texts. The procedures of conversion include word segmentation , and construction of mapping rules and patterns of consonants, vowels, tones and monosyllables or multi-syllables. Experimental results show the proposed system does well in IPA conversion.
    Key words: Tibetan; IPA; automatic conversion; word-segmentation 收稿日期: 2015-10-15 定稿日期: 2016-04-25
  • Review
    LI Qingwu,MA Yunpeng,ZHOU Yan,ZHOU Liangji
    2016, 30(5): 209-215.
    Abstract ( ) PDF ( ) Knowledge map Save
    The popular offline writer identification methods for handwritten Chinese characters usually work on some specific characters and demand a huge number of training samples. In this paper, a writer identification method based on curvature detection of skeletal stroke is proposed. Firstly, images of handwritten characters are preprocessed by mathematical morphology, and the representative skeletal strokes are extracted in the four directions of horizontal, vertical, left-falling and right-falling. Then, the circle reconstruction is applied to the extracted skeletal strokes, and the curvatures of the stroke circle in four directions are selected to form the handwriting features. Finally, the characters are identified according to the angular similarity. Experimental results show that the proposed algorithm makes no restrictions on the content of the character to be identified and requires less training samples.