2012 Volume 26 Issue 5 Published: 15 October 2012
  

  • Select all
    |
    Review
  • Review
    MO Yi, LIU Shenghua, LIU Yue, CHENG Xueqi
    2012, 26(5): 1-7.
    Abstract ( ) PDF ( ) Knowledge map Save
    Microblog as a new social media plays more and more important role in current life due to its real time, trends and spreading of information. The issue that filtering tweets according to a concerning topic for tracking its trends is of substantial significance to the users. Since a tweet is extremely short, containing less information and textual features, how to filter the short tweets becomes a challenge in that the traditional text classification is no longer applicable. In this paper, we proposed a entropy-based classification rule learning algorithm to filter tweets by topics. The experimental results on nearly 90 000 tweets and 3 000 officially labeled tweets from Sina Weibo and TREC 2011 show that our algorithm achieves higher F-score in filtering tweets by topics than CPAR and SVM algorithms.
    Key wordstweets filtering; rule mining; information entropy
  • Review
    WANG Hao, YANG Liang, LIN Hongfei
    2012, 26(5): 7-14.
    Abstract ( ) PDF ( ) Knowledge map Save
    At 14:46 on March 11th, a strong earthquake occurred in Japan, which leads a heated discussion on Sina Micro-blog. This paper employs emotion based HITS algorithm to analyze the corpus which is crawled during the following week that Japan Earthquake happened. Firstly the bipartite graph is constructed by candidate topic words and emotion category. The topic words is then decided by the HITS score and the frequency. Then mutual information is adopted to choose the particular topic words which show the change of the topic during the week. At the same time, the rule based emotion classification is applied to judge the emotion of the micro-blogs which contains the topic words. This paper proves the feasibility of the emotion based HITS algorithm and finds that the topic in the corpus lasts only two days. And this paper also find that users on micro-blog show the emotion of praise and criticism which express the opinion rather than the emotion of happy and sad which express the feeling.
    Key wordshot event detection; opinion analysis; emotion based HITS; Japan earthquake
  • Review
    LI Piji1, MA Jun1, ZHANG Dongmei2, HAN Xiaohui1
    2012, 26(5): 14-20.
    Abstract ( ) PDF ( ) Knowledge map Save
    There are usually millions of comments for an entity (e.g. a shop or a product). How to extract the consice and useful information to describe the entity is a challenging issue. This paper proposes a method to extract tags without semantic redundancy. First, we perform the word segmentation, POS tagging and dependency parsing for all the comments. Then, we extract tags acroding to the dependency realtions, and reduce the semantically duplicate tags explicitly. Finally, we map all the tags to the independent semantic space via K-Means and Latent Dirichlet Allocation(LDA), and rank the tag list.according to the topic confidence. The results of the experiments show that our method could extract the tags accurately with semantic independency.
    Key wordsopinion mining; topic model; semantic independent; tag extraction; ranking
  • Review
    ZHAO Honggai,LV Xueqiang, SHI Shuicai,ZHENG Li
    2012, 26(5): 20-26.
    Abstract ( ) PDF ( ) Knowledge map Save
    Correct identification of the phrases in the query log plays an important role in the construction of search engine oriented phrase dictionary for and in improving search performance. The paper adopts conditional random fields for the identification of the phrases of “N+V” structure and the phrases of “N1+N2+V” structure in search engine query logs, namely, the Sogou log. The features for the model are composed of words types, part-of-speech features and word length features. Among these amnually designed candidate features sets, the effective features are selected to build the final conditional random fields The experiment results of closed tests and open tests show that the approach can identify the two types of phrases well.
    Key wordsconditional random fields; query logs; the phrases of “N1+N2+V” structure; the phrases of “N+V” structure; features templates
  • Review
    CAO Lei1,2, GUO Jiafeng1, BAI Lu1,2, CHENG Xueqi1
    2012, 26(5): 26-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    Named entity mining from query log aims to mine a list of named entities with the specific type from the query log. Previous work proposed a seed-based method which ranked the candidate entities based on the similarity between the template distribution of the specified class and that of the entities. However, it doesnt take into account the ambiguity of named entity, the polysemy of the template and the unlabeled data. In this paper, we propose a semi-supervised topic model, which leverages the relationship between the templates (i.e. the co-occurrence between templates) to learn the template distribution of the specified class so as to improve the entity ranking. Experimental results show the effectiveness of the proposed method.
    Key wordsquery log; named entity mining; Semi-supervised Topic Model
  • Review
    ZHU Yadong1, 2, ZHANG Cheng1, YU Xiaoming1, CHENG Xueqi1
    2012, 26(5): 33-40.
    Abstract ( ) PDF ( ) Knowledge map Save
    The effective analysis of user query structure is helpful for understanding the users intent and promoting performance of the Web search engine. This paper proposes a straightforward and effective analysis method for user query structure based on PMI (pointwise mutual information). The method contains an off-line training algorithm based on MapReduce and a bottom-up online building method for query analysis. The experiment result shows that our approach possesses a high segmentation speed while maintain a comparable segmentation performance to other approaches. The experiment on TREC WT10g dataset further validates the effectiveness of our method and shows that it can prompt the search results in terms of MAP, p@5, p@10.
    Key wordsquery structure analysis; MapReduce; online query analysis tree
  • Review
    XIONG Daping, WANG Jian, LIN Hongfei
    2012, 26(5): 40-46.
    Abstract ( ) PDF ( ) Knowledge map Save
    While the traditional question answering (QA) systems just find the answer to the question directly without user interaction, the community-based QA systems (CQA) employs large available QA archives. The paper proposes a new retrieval framework based on LDA topics to find the similar questions according to the statistical, the semantic and the theme information. The experiments on the question-answer threads of the Yahoo! Answers show that our method achieved a good performance.
    Key wordsquestions similarity; LDA theme model; community question answer; similarity calculation
  • Review
    YANG Sichun1, 2, GAO Chao3, QIN Feng2, DAI Xinyu1, CHEN Jiajun1
    2012, 26(5): 46-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    To alleviate the heavy conputation cost of features extraction for question classification, a new feature model is proposed in which basic features and bag-of-word binding features are integrated. Firstly, the basic features (such as bag-of-words, part of speech, word sense) are extracted with their corresponding binding features, and then these two types of features are integrated for a more effective feature set. Experimental results on SVM classifier and the Chinese question set provided by Harbin Institute of Technology indicate that the new feature model, which is simple and cost much less in computation cost, effectively makes up the insufficiency of basic features and binding features in syntax and semantics and further improves the classification accuracy.
    Key wordsquestion answering system; question classification; feature model; bag-of-words binding
  • Review
    NIU Shuzi, CHENG Xueqi, GUO Jiafeng
    2012, 26(5): 53-59.
    Abstract ( ) PDF ( ) Knowledge map Save
    Learning to rank is one of the most attractive areas in information retrieval. Much attention has been paid on the robustness of ranking algorithms to deal with noise which is inevitable in the training set. Previous work observes that ranking performance of the same algorithm showed totally different noise sensitivities. The performance degradation of ranking models boils down to the training set. Thus the underlying reason for different sensitivities lies in some attribute of training data. Experimental results on LETOR3.0 suggest that if the document pairs of the same training set scatter more dispersedly, the model from this training set is less influenced by the error document pairs and the training set is thus less sensitive to noise.
    Key wordslearning to rank; data quality; noise sensitivity
  • Review
    SONG Ling1 , LV Qiang2 , DENG Wei3, LV Xiaolin4
    2012, 26(5): 59-65.
    Abstract ( ) PDF ( ) Knowledge map Save
    XML is a markup language that has emerged as the most relevant standardization effort for document representation and exchange on the Web. Similarity measure for XML documents plays important role in personalized recommendations and information retrieval. A novel approach to compute semantic and structural similarity between XML documents, XMLSim, is proposed in this paper. Firstly, a similarity between node tags is created based on semantic similarity and string similarity. After analyzing partial relationship among node tags, the path similarity is abstracted as Maximal Similar Subsequence (MSS) problem. The result of NPathSim is obtained by the solution of MSS with dynamic programming. Finally, XMLSim is the average of the best NPathSim value among path sets.
    Key wordsXML similarity; dynamic programming; semantics and structure
  • Review
    WANG Hongliang1, ZHAO Li1,2
    2012, 26(5): 65-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    For many complicated real-world applications of traditional clonal selection algorithm (CLONALG), we have to approximate the affinity/fitness because of its heavy computation cost. To address this issue, we propose a Hoeffding clonal selection algorithm (H-CLONALG) based on Hoeffding bound, which guarantee the optimal solution or the near-optimal with a determinate probability. The experimental results show that with the same level of accuracy, the H-CLONALG algorithm can find the solution more efficiently than tradition CLONALG algorithms.
    Key wordsclonal selection algorithm;Hoeffding bound; associative classification
  • Review
    WANG Yingying1, BAI Yu1, Ding Changlin1, DAI Jianyu2, CAI Dongfeng1
    2012, 26(5): 72-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    Construction of knowledge resources plays an important role in the language information processing, and the knowledge base construction in fundamental theory of traditional Chinese medicine (TCM) is the basic work on medicine literature processing and semantic computing. This paper analyzes the characteristics of theoretical terms in TCM, and proposes a method to construct the basic theory knowledge base of TCM by KDML in light of drawing the HowNet,. This article also describes that the option and acquisition of primitives in the process of knowledge building.
    Key wordsknowledge base;KDML;primitive selection;relationship acquisition
  • Review
    LU Jun, HONG Yu, LU Jianjiang, YAO Jianmin, ZHU Qiaoming
    2012, 26(5): 79-88.
    Abstract ( ) PDF ( ) Knowledge map Save
    Reviews reflect the value of things. From the customer’s point of view, we propose a novel method for automatically evaluating the quality of product reviews based on the global-user-intent. In this paper, we firstly divide the reviews into two opposing groups, i.e. useful reviews and spammed reviews. By means of this definition, we attempt to realize a proactive approach. We experiment with SVM classifier to classify the quality of reviews. This is a typical binary classification and taking extra three kinds of features into considerationthe popular information of product, reviewers’ opinion and review credibility. In this paper, we combine text structure feature with above three kinds of features which reflect the global user intent, and then test on a large-scale corpus of product reviews. The experimental results show a significant improvement on the global accuracy by involving diverse user intent features.
    Key wordsquality of reviews; attribute extraction; opinion mining; review credibility
  • Review
    LIU Wenfei, LIN Hongfei
    2012, 26(5): 88-94.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the computational advertising, how to return more relevant ad results for web query is a fundamental issue. Due to short web queries and the short ads which contains 15-20 bid phrases on average for each ad, it is very difficult to return the relevant ads meeting the need of users. In this paper, we propose a query expansion approach based on feature fusion to solve the problem. We use web search results initially returned for the query to create a pool of relevant documents. To avoid the topic drift of the normal query expansion algorithms based on simple feature and lack of semantic information, we compute the co-occurrence of expansion term and query term in the web search results with the traditional feature of TF and part-of-speech information. The result got on the authentic ads dataset shows that the query expansion approach based on multi-fusion can return more relevant ads.
    Key wordssponsored search; query expansion; term co-occurrence
  • Review
    NIU Xiaofei1,2,3, MA Jun1, MA Shaoping3, ZHANG Dongmei1,2
    2012, 26(5): 94-101.
    Abstract ( ) PDF ( ) Knowledge map Save
    Web spam detection is a challenging issue for web search engines. This paper proposes a Genetic Programming-based ensemble learning approach (GPENL) to detect web spam. First, the method gets t different training sets by the under-sampling from the original training set. Then, c different classification algorithms are used on t training sets to get t*c base classifiers. Finally, an integrated approach of t*c base classifiers is obtained by Genetic Programming. The new method can not only merge the under-sampling technology and ensemble learning to improve the classification performance on imbalanced datasets, but also conveniently integrate different types of base classifiers. The experiments on WEBSPAM-UK2006 show that this method improve the classification performance whether the base classifiers belong to the same type or not, and in most cases the heterogeneous classifier ensembles work better than the homogeneous ones; and GPENL can get higher F-measure than those done by AdaBoost, Bagging, RandomForest, Vote, EDKC algorithm and the method based on Prediction Spamicity.
    Key wordsweb spam; ensemble learning; genetic programming; classification on the imbalanced dataset
  • Review
    HE Zhiming1, WANG Lihong2, ZHANG Gang1, CHENG Xueqi1
    2012, 26(5): 101-107.
    Abstract ( ) PDF ( ) Knowledge map Save
    A large number of link-based spams caused a huge impact on traditional PageRank algorithm, such as link farm, link exchange, golden links and so on. This paper proposes a new PageRank algorithm named Three Stages PageRank algorithm(TSPageRank) which can resist link spam to a certain extent. Through experiments, we found out that TSPageRank algorithm increased 59.4% on the result of PageRank. TSPageRank can increase the PR of useful and authority pages and decrease the PR of spam and rubbish pages.
    Key wordssearch engine spam; PageRank algorithm; link farm
  • Review
    YU Pingfang1,2 , DU Jiali3,4
    2012, 26(5): 107-114.
    Abstract ( ) PDF ( ) Knowledge map Save
    Natural language processing is a branch of computational linguistics, and usually analyzes and understands natural language by means of computer technology. NS flowchart has the structural features of choice algorithm parsing and the well-formed substring table (WFST) possesses the characters of storing multi-structures during the parsing. Garden path sentence is a special syntactic model during the syntactic processing in which processing breakdown appears, correspondingly bringing the reconstruction of the original model. The NS algorithm-based WFST is useful for syntactic parsing of special phenomenon (e.g. garden path sentence) in natural language processing, thus making it possible for this programming method to be used in language application.
    Key wordsnatural language processing; well-formed substring table; NS flowchart; computational linguistics; garden path sentence
  • Review
    JIANG Yuru1,3, SONG Rou1,2
    2012, 26(5): 114-120.
    Abstract ( ) PDF ( ) Knowledge map Save
    Nowadays the Chinese machine translation and information extraction is still far from satisfactory. One important reason is that the topics are often omitted in the head of Chinese Punctuation Clause (abbreviated as PClause). Based on the Generalized Topic Theory, this paper proposes a novel method for topic clause identification from PClause based on the characteristic of topic strcture. The method consists of two tasks in practicetopic clause identification from a single PClause and topic clause construction for a series of PClauses. In the first task,semantic generalization and edit distance are applied in this paper, and the accuracy rate for open test is 12.51% higher than baseline. The result proves the effectiveness of the generalized topic theory in topic clause identification from a single PClause.
    Key wordspunctuation clause;generalized topic;discourse structure;topic clause;topic clause identification
  • Review
    ZHOU Qiaoli, LIU Xin, LANG Wenjing , CAI Dongfeng
    2012, 26(5): 120-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chunking includes identification and labeling of chunks, which is a way to reduce the difficulty of complete syntactic parsing through segmenting a sentence into small chunking parts. In order to reduce the complexity of long sentence chunking, a divide-and-conquer strategy is described in this paper. The basic idea of this method is to first recognize the maximal noun phrases (MNP) form a full sentence; then identify the chunks within the MNPs and among the frame of the sentence without MNPs ;. Experiments are carried out on the data set of UPenn Chinese Treebank-4 (CTB4) and the results show the the best of overall F1 score of Chinese chunking is 91.79%, which is higher than the performance produced by the state-of-the-art machine learning models.
    Key wordsChinese chunking; divide-and-conquer; complete syntactic parsing; maximal noun phrase; conditional random fields; support vector machines