2011 Volume 25 Issue 1 Published: 15 February 2011
  

  • Select all
    |
    Review
  • Review
    KONG Weize, LIU Yiqun, ZHANG Min, MA Shaoping
    2011, 25(1): 3-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    Community Question Answering (CQA) becomes more and more important for information access of Web users. However, CQA content quality varies dramatically from excellence to abuse and spam. This work investigates methods of answer quality analysis on CQA. In particular, it focuses on Baidu Knows, the largest Chinese CQA portal on the Web. A large scale corpus has been constructed by collecting data from the portal, and three new kinds of features were proposed, including sequence-based feature, features in the granularity of question, and BaiduKnows-specific user-based features. To separate high-quality answers from others, a learning based classification method is used to combine the proposed features and traditional textual and link-based features. Experiment results show that the proposed features are effective in improving the performance. Besides, this answer quality analysis framework achieves high accuracy in predicting best answers.
    Key wordscommunity question answering; quality analysis
  • Review
    JIANG Zaifan1,2, WANG Bin1
    2011, 25(1): 9-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    Personal Information Retrieval (PIR) is an important technology for users searching files in their computers. Compared with Web retrieval,the information that can be used by PIR are very limited, which makes personal information retrieval a very difficult problem. In this paper, we collect user behavior information and use them to conduct an in-depth research on ranking problem of PIR. The user behavior information includes the users search information and the file-access information. In this paper, we use search information to obtain training data and file-access information to computer file weights, then we use statistical learning method to learn ranking function. Experimental results show that our method performes better than the traditional TFIDF ranking method.
    Key wordsuser behavior; personal information retrieval; statistical learning; ranking SVM
  • Review
    LIU Quansheng, YAO Tianfang
    2011, 25(1): 15-20.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we present an opinion retrieval algorithm that retrieves opinions for a given topic according to the relevance between the topic and its expansions, the topic and the sentiments and so on. Its based on the theory of information retrieval, the sentiment analysis and the research for opinion retrieval of other researchers. The algorithm uses relevance to measure the affects to the topic, such as the expansions of the topic, the texts, the sentiments in the text, and other elements for finding the opinions, which integrates the influence among all the elements of this research theoretically. Experimental results on COAE2008 datasets and queries show that the algorithm is effective and gets a higher score than other methods for opinion retrieval.
    Key wordsopinion retrieval; relevance; text mining
  • Review
    CAO Peng1,2, LI Jingyuan1, MAN Tong1,2, LIU Yue1, CHENG Xueqi1
    2011, 25(1): 20-28.
    Abstract ( ) PDF ( ) Knowledge map Save
    Microblog is a very new concept of web 2.0. The most important microblog system in use is Twitter, with more than 160 million users all over the world. For now, Twitter is one of the most influential voices of the globe, its users including celebrities, well-known politicians and first-order companies. The length of the messages in Twitter is short, and the contents of the messages are very likely to be informal in syntax or grammar. Moreover, Twitter does not strictly define the syntax of retweet, which causes the existence of a great number of near duplicate messages. These near duplicate messages can be a waste of storage resources, and can greatly reduce the user experience of Twitter. In this paper, the syntax of retweet messages is analyzed, and a method is presented to remove the retweet symbols of messages using the analyzed results. In addition, two text distance calculating methods character statistics and shortest editing distance are proposed to cluster the Twitter messages into groups of near duplicate messages. We also analyze the log-in method and characteristics of twitters messages. Through a series of experiments, we prove that our methods are efficient, extensible and easy to implement, and can be used to discover and filter the near duplicate messages in microblogs.
    Key wordsmicroblog;Twitter;near duplicate message
  • Review
    MA Yunlong, LIN Yuan, LIN Hongfei
    2011, 25(1): 28-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    As an important technology in information retrieval, and traditional query expansion uses the pseudo-relevant documents as the candidate words set. But some of pseudo-relevant documents are not highly relevant. In our work, a query-click graph is built by a query log in real search engine. The term relationship graph which was obtained by several transformations reflects the direct relationship of the terms. We propose a weight normalization based SimRank approach—a revised algorithm based on the SimRank for the query expansion. In order to reduce the computational complexity of SimRank, strategies like pruning are used to optimize the algorithm. Experiments on large real AOL search engine query logs and a standard TREC corpus shows that our approach can discover the quality expansion terms effectively. The MAP of our approach is 1.81% higher than the query expansion based on pseudo relevance feedback, 5.44% higher on P@10, and 3.73% higher on P@20.
    Key wordssearch engine; query expansion; query logs; SimRank
  • Review
    FANG Qi, LIU Yiqun, ZHANG Min, RUN Liyun, MA Shaoping
    2011, 25(1): 35-41.
    Abstract ( ) PDF ( ) Knowledge map Save
    A session in Web access log denotes a continuous-time sequence of users Web browsing behavior. A topic of a session represents a hidden browsing intent of a Web user. It is fundamental to identify several topic-based log units from a session. Existing work mainly focuses on detecting boundaries without considering the common situation in which different topics often overlap in one session. In this paper, we first re-define the concept of session and topic, and then the task of largest segmentation is proposed. We further design the session topic identification algorithm based on crowd wisdom of Web users. The effectiveness of the algorithm is validated by the experiments performed on large scale of realistic Web access logs.
    Key wordssession topic identification; Web access log
  • Review
    DIAO Yufeng,YANG Liang, LIN Hongfei
    2011, 25(1): 41-48.
    Abstract ( ) PDF ( ) Knowledge map Save
    As well-known, Blog has become one of the main information sources on the Internet, and the opinion spam also grows fantastically in Blog. The paper focuses on identifying the opinion spam. Firstly, it adopts the method of email spam identification. Considering the characteristics of Blog, it establishes the rules of comments to filter the opinion spam, and then it utilizes the Latent Dirichlet Allocation Model (LDA) to extract the topics information from text content in Blog. Finally, with the topics information integrated, it judges the opinion whether spam or not. Experiments prove it can identify most of the spam opinions, effectively bringing more accurate and efficient Blog information for users.
    Key wordsBlog; Blog content; LDA; topic; opinion spam
  • Review
    LI Yuqin1, SUN Lihua2
    2011, 25(1): 48-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    Hot-word is a network phenomenon, which reflects some popular feelings and topics at a particular time and space. In this paper, two key technologies of hot-word analyzing are discussed, including hot-word discovering and associating technique. In the phase of word discovering, firstly, we get named entity recognition techniques and statistical techniques for high frequency phrase to do string excavation. Then, we take the basis of weight and weight fluctuations to compute hot-word weight. Up to the hot-word association, they are derided from the difference of the weight value of hot-word, and hot-word relationship was computed from the principle of co-occurrence rate. The technology has been successfully applied to hot-word discovering module, which is a part of TRS public sentiment monitoring system.
    Key wordshot words; named entity identification; hot degree computing; weight fluctuations; words relationship
  • Review
    PENG Zeying1, YU Xiaoming1, XU Hongbo1, LIU Chunyang2
    2011, 25(1): 54-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    Clustering is an unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). So far, many clustering algorithms have been proposed. With the rapid development of internet, short texts such as query logs and Twitter messages play a more and more important role in our daily life. Most existing clustering methods are hard to be applied in dealing with this kind of information due to the huge scale of data. This paper reveals the long tail distribution of this kind of information, and proposes an incomplete clustering algorithm. The experimental results show that the proposed method can cluster the short texts effectively and efficiently.
    Key wordsshort texts; clustering; incomplete clustering
  • Review
    LIU Zhenlu1, WANG Daling1,2, FENG Shi1, ZHANG Yifei1,2,FANG Donghao1
    2011, 25(1): 60-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper applies the LDA model to analyze latent semantics of documents and partition the semantic space into low, middle and high frequency space. The semantics in low frequency space are used to detect outlier web documents. The semantics in middle and high frequency space are devoted to document clustering as features of the documents. The quality of clustering results is improved by a mutual-action mechanism between document clusters and semantics. Compared with related work, this paper not only applies LDA model to represent documents, but also analyzes the semantic distribution in depth and applies the results of analysis to web document clustering. Experiments show that the clustering algorithm of the mutual-action between LDA-based document class and semantic in this paper deserve better effects in document clustering.
    Key wordsLDA; latent semantic; semantic distribution; document clustering
  • Review
    HAN Zhongyuan1, LI Sheng1, QI Haoliang2, YANG Muyun1
    2011, 25(1): 66-71.
    Abstract ( ) PDF ( ) Knowledge map Save
    The data sparseness is a non-trivial issue for language model based information retrieval methods. The paper proposes a Neighbourhood Language Model to alleviate this issue by employing the neighbour information of a document as a smoothing to the word distribution. Tested on DOE and WSJ proportion of TREC data, the results show that the Neighbourhood Language Model can improve the information retrieval performance.
    Key wordsinformation retrieval; language model; neighbourhood information
  • Review
    FENG Yanhui, HONG Yu, YAN Zhenxiang, YAO Jianmin, ZHU Qiaoming
    2011, 25(1): 71-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    A new approach has been developed for acquiring bilingual web pages from the result pages of search engines, which is composed of two challenging tasks. The first task is to detect web records embedded in the result pages automatically via a clustering method of a sample page. Identifying these useful records through the clustering method allows the generation of highly effective features for the next task which is high-quality bilingual web page acquisition. The task of high-quality bilingual web page acquisition is assumed as a classification problem. One advantage of our approach is that it is independent of the search engine and the domain. The test is based on 2 516 records extracted from six search engines automatically and annotated manually, which gets a high precision of 81.3% and a recall of 94.93%. The experimental results indicate that our approach is very effective.
    Key wordsweb mining; bilingual web pages; parallel corpora
  • Review
    QIU Likun1,2, SHAO Yanqiu2
    2011, 25(1): 79-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    The “parallel and general principle” by Baoya Chen(1999), focuses on differentiating word and phrase. Through this principle, words and phrases are classified into three kindsboth parallel and general, parallel but not general, neither parallel nor general. The first kind should not be taken as a word. Since most semantic lexicons havent conformed to the Principle when classifying one unit into word or phrase, we might acquire many rules that follow the Principle. Given two semantic lexicons, we might induce two sets of rules, both of which have their own positive examples and counterexamples. If a counterexample in one lexicon is also the positive example in another lexicon, it usually means there exists one improper categorization in the former lexicon. Based on this idea, an automatic detection method is proposed. Experimental results show the effectiveness of this method.
    Key wordsparallel and general principle; semantic lexicon; categorization; automatic detection
  • Review
    LU Yuqing, HONG Yu, LU Jun, YAO Jianmin, ZHU Qiaoming
    2011, 25(1): 85-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    Context dependent word errors in English texts are errors in which one is incorrectly substituted by another word without spelling mistake. This paper mainly discusses how to detect this type of errors. We prospose to extract the context festers by a method that firstly extract features of syntax and semantics from the word context, and then select the festures by the Document Frequency and the Information Gain to do feature selection.. Further, we model the detecting task as a classification using the Winnow algorithm. With five-across validation among 61groups of confusion sets, the average precision and recall are 96%, 79.47%, respectively.
    Key wordscontext-dependent word error; feature selection; confusion sets; Winnow algorithm
  • Review
    ZHANG Yue,YU Haomin,ZHANG Qi,HUANG Xuanjing
    (Fudan University, School of Computer Science and Technology, Shanghai 201203, China)
    2011, 25(1): 91-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    How to effectively detect near duplicate documents on large corpus is a hot topic in recent years. Usually, near duplicate detection algorithms use Inverted Index to improve their efficiency. However, as the corpus size increases, single machine implementation of index structure is intractable. Therefore Distributed Index structure is required for near duplicate detection. To process rapidly increasing data size, the distributed index structures should have both high efficiency and scalability. In this paper, we compare two different distributed index structures, Term-Split Index and Doc-Split Index, and provide the Map-Reduce implementation. Based on those two index structures, we propose two different approaches, Term-Split Approach and Doc-Split Approach, to detect near duplicate documents using Map-Reduce paradigm. Finally, we compare the performance of the two different approaches on WT10G corpus. Experimental results show that the Doc-Split Approach is more efficient and has better scalability.
    Key wordsnear duplicate detection; copy detection; Map-Reduce
  • Review
    YANG Xiaorui, LIN Lei, SUN Chengjie, LIU Bingquan
    2011, 25(1): 98-104.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the development of Internet, a huge number of high quality knowledge was embedded in the databases of Web forums. It is valuable to develop effective retrieval tools for forum information. This paper attempted to find suitable retrieval model for forum data in order to make full use of the forum knowledge to meet the needs of web users. For the differences between forum web pages and news web pages, a key posts extraction algorithm and a structure reconstruction method for forum threading are proposed to obtain the informative posts in forums. These informative posts are then used to build a retrieval system. Experimental results show that the proposed methods can improve the retrieval system for forum data effectively.
    Key wordsforum retrieval;Ranking Support Vector Machine;key post extraction
  • Review
    MIAO Jia,MA Jun,CHEN Zhumin
    2011, 25(1): 104-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    Since blog contains many comments involving massive noise, how to summarize the content of blog posts together with the comments is a difficult task for many blog applications. The previous works for textual document summarization are mostly for multi-document summarization in general. Without taking the particularity of blog into account, the previous works are inefficient for blog posts with comments. This paper proposes a novel summarization approach for blog based on the characteristics of the blog posts in which the information of comments are well considered. We first calculate the weights of the comments based on multi-features of the comments. Then we calculate the weights of the sentences in blog post based on HITS model. Finally we select sentences from the blog post according to their weights. We conduct an experiment on the dataset of Ifeng blog, and it shows that our approach works better than some previous works in terms of the score of ROUGE.
    Key wordsautomatic document summarization; blog; comment; HITS
  • Review
    LUO Yang, JI Duo, ZHANG Guiping, WANG Yingying
    2011, 25(1): 110-116.
    Abstract ( ) PDF ( ) Knowledge map Save
    Bilingual resources are the important resources in the areas of machine translation and cross language information retrieval. But concerning the corpus issues such as theauthenticity of language in use, the updatedness of langauge and the language flexibility, the existing bilingual resources are far from meeting the demand of the practical applications. This paper proposes a web based bilingual- resources mining method based on frequent sequence pattern. This algorithm adopts the SVM classification method with frequent sequence patterns as features, realizing the selection and identification of a single web page with bilingual resources. The experimental results indicate that this method can effectively improve the quality of the bilingual resources mining.
    Key wordsWeb mining; Web pages classification; bilingual resources; frequent sequence pattern; support vector machine
  • Review
    WANG Xin1, SUN Weiwei2, SUI Zhifang1
    2011, 25(1): 116-123.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic role labeling (SRL)is an important way to get semantic information. Many existing systems forSRL make use of full syntactic parses. But due to the low performance of the existing Chinese parser, the performance of labeling based on the full syntactic parses is still not satisfactory. This paper realizes SRL methods based on shallow parsing. In shallow parsing stage, this paper makes use of word formation to get fake head morpheme information, which alleviates the problem of data sparseness, and imporves the performance of the parser with the F-score up to 0.93. In the stage of semantic role labeling, this paper applies word formation to get morpheme information of the target verb, which describes the structure of word in fine granualrity, and provides more information for semantic role labeling. In addition, this paper also proposes a coarse frame feature as an approximation of the sub-categorization information existing full syntactic parsing. F-score of this semantic role labeling system has reached 0.74, a significant improvements over the best reported SRL performance(0.71) in the literature.
    Key wordssemantic role labeling; shallow syntactic analysis; morpheme; word formation
  • Review
    YU Haomin, ZHANG Yue, ZHANG Qi, HUANG Xuanjing
    2011, 25(1): 123-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    Because of the explosion of the Internet, enormous duplicated data cause serious problem for search engine, opinion mining and many other Web applications. Most existing near-duplicate detection approaches focus on the document level, incpapble of finding out the duplicated part that is just a small piece of both documents. Near-duplicate detection on sentence level is a key solution to such problem. An effective and efficient feature extraction algorithm namedLow-IDF-Sig is proposed in this paper. In order to express a specified sentence, our algorithm extracts the improved Shingle feature according to selected antecedents. Experimental results based on a real corpus show that our proposed method can improve both precision and efficiency of near-duplicate detection in sentence level.
    Key wordsNear-Duplicate detection; feature extraction; Low-IDF-SIG