2010 Volume 24 Issue 3 Published: 15 June 2010
  

  • Select all
    |
    Review
  • Review
    LI Yanan1,2, XU Sheng1,2 , WANG Bin 1
    2010, 24(3): 3-11.
    Abstract ( ) PDF ( ) Knowledge map Save
    Query recommendation as an important technology used in search engines suggests relevant queries to help users to reformulate more accurate queries. Existing approaches of query suggestion compute query similarity based on direct matching of query properties. However, it is hard to find the semantic relevant queries that are related indirectly. In this paper, queries are modeled by a query relation graph where query similarity is computed using WSimRank, a revised algorithm based on SimRank. WSimRank takes the edge information and global structure of query relation graph into account so that it can find the latent semantic relations between queries. To reduce the high complexity of basic WSimRank w.r.t real large query relation graph, this paper changes the WSimRank into a state graph and optimized with dynamic programming and pruning. Experiments on large real search engine query logs show that WSimRank outperforms SimRank and other conventional approaches on query suggestion. The MAP of query suggestions generated by WSimRank achieves nearly 0.9.
    Key wordscomputer application; Chinese information processing;search engine; query suggestion; SimRank; WSimRank
  • Review
    WU Dayong, ZHANG Yu,LIU Ting
    2010, 24(3): 11-19.
    Abstract ( ) PDF ( ) Knowledge map Save
    Interactive question answering (IQA) is a kind of QA technology that is able to process a series of coherent questions and interact with users by the means of dialogue, Being a hot research topic in the area of QA, though. IQA is less touched in Chinese to our knowledge. An important problem in a typical IQA system is the relevance recognition among a series of questions. This paper analyzes the effectiveness of different features extracted from questions on relevance recognition of question in Chinese IQA. Based on the key features detected, we experiment the Binary Classification model on the TREC QA task question set that was translated to Chinese, as well as a real IQA question set. Experimental results show that the proposed method is effective.
    Key wordscomputer application; Chinese information processing;interactive question answering; question; relevance recognition; binary classification
  • Review
    ZHOU Bo, CEN Rongwei, LIU Yiqun, ZHANG Min, JIN Yijiang, MA Shaoping
    2010, 24(3): 19-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    Relevance Feedback has been studied in information retrieval research for the past 30 years. It has been shown to be worthwhile in a wide variety of settings, either the actual user feedback is availableor it is implicit. Since the applications of relevance feedback and the type of user input to relevance feedback have changed in the Web environment, the relevance feedback is again emphasized by researchers. A document relevance based search result re-ranking approach is proposed in this paper, which makes use of both the relevant documents and irrelevant documents in feedback information. The approach is shown to be consistently valid for performance improvement on the standard large scale test dataset of TREC 2008 Relevance Feedback Track.
    Key wordscomputer application; Chinese information processing;relevance feedback; document re-ranking; search engine
  • Review
    CAI Dongfeng, BAI Yu, YU Shui,YE Na, REN Xiaona
    2010, 24(3): 24-29.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word similarity computation is one of the key issues in natural language processing fields, such as machine translation, information retrieval etc. As traditional methods ignore the context information of the word, they can not effectively distinguish the differences among the word similarities when the context information changes. This paper presents an approach for word similarity computation based on the context information, which employs the fuzzy membership functions to compute the fuzzy significance of the words and combines a method of word similarity calculation using HowNet. The experimental results indicate that our approach distinguish the semantic similar words effectively by the context information.
    Key wordscomputer application; Chinese information processing;context; fuzzy degree of significance; word similarity computation; membership function
  • Review
    CHEN You1,2, CHENG Xue qi1, YANG Sen1,2
    2010, 24(3): 29-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    Web forum has become an important resource on the Web due to its rich information contributed by millions of Internet users every day. Consequently, the outburst topic detection becomes a fundamental task in Search Engine and Web Mining systems. Most existing topic detection and tracking (TDT) methods deal with the news stories, which are proved not suitable for extracting topics in casual, oral and informal languageon the noisy Web formus. This paper presents a noise-filtered model to extract the outburst topics from web forums using terms and participations of users. The proposed model employs not only content similarity, but also user participation information. Experiments on ShuiMu community demonstrate the efficiency of the proposed modelnot only extracting the outburst topics which are better organized for search and visualization but also discovering communities corresponding to these topics.
    Key wordscomputer application; Chinese information processing; outburst topic; web forum; time sequence
  • Review
    TAO Fumin, GAO Jun, WANG Tengjiao, ZHOU Kai
    2010, 24(3): 37-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    Feature extraction is essential to the quality of text based sentiment analysis. This paper proposes a novel approach to feature extracttion for the sentiment analysis of the news comments. Firstly, the candidate sentimental features are extracted according to the comparison between contents of the news comments and the corresponding news. Then, the general sentimental features for sentiment analysis on various news comments are selected by several extension and validation processes.The proposed method is featured by capable of providing finer-grained sentimental analysis for specific news topic. Specifically, based on the topic information of news comments, it can be adapted for corresponding feature comparison and validation policies to extract topical sentiment features. The experiments show a high performance for the sentiment analysis in sparse data sets, such as comments from news.
    Key wordscomputer application; Chinese information processing;sentiment analysis; feature selection; feature extension
  • Review
    WANG Qian,LIU Yiqun, MA Shaoping, RU Liyun
    2010, 24(3): 44-49.
    Abstract ( ) PDF ( ) Knowledge map Save
    Nowadays, user behavior analysis has been widely used in Web research fields. Therefore, how to remove abnormal clicks from Web user access logs is very important for extracting true information on user purpose and behavior. In this paper, with real world Web User Access Logs provided by a commercial search engine company ,we analyze some possible abnormal clicks—such as continuous click, one user many IPs, one ip many users, from some perspectives—for the degree of concentration for user to access web sites, average daily clicks of one user, etc. We suggest that for continuous click, user behavior researcher can eliminate superfluous and repetitive clicks or all the clicks of the user with continuous click, and the cases of one ip many users and one user many ips can be left untouched.
    Key wordscomputer application; Chinese information processing;user behavior analysis;web user access logs; abnormal click
  • Review
    CEN Rongwei, LIU Yiqun, ZHANG Min, RU Liyun, MA Shaoping
    2010, 24(3): 49-55.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the growth in amount of search users, the behavior analysis has become one of the most important research issues for search engines in terms of architecture analysis, performance optimization and system maintenance. It is also a major area in both information retrieval and knowledge management. In order to better understand search behavior of web users, we analyzed web user behaviors based on 756 million entries of click-through logs. Several important aspects of user behaviors are studied, such as query length, ratio of query refining, query recommendation access, first/last click distribution, click number in query, et al. We also analyzed the differences in user behavior for different information needs based on separate query sets. These analyses may help improve both effectiveness and efficiency of search engines.
    Key wordscomputer application; Chinese information processing;user behavior analysis; search engine; web information retrieval
  • Review
    SONG Wei, ZHANG Yu, LIU Ting, LI Sheng
    2010, 24(3): 55-62.
    Abstract ( ) PDF ( ) Knowledge map Save
    Learning user preference implicitly is a hot research topic for personalized search ,and query model reformulation based on user search history is a key issue. Existing work considers the search history as a whole without distinguishing whether it is relevant to current query, resulting in much noise. In this paper, assuming that the relevant terms tend to co-occurrence in context, we treat each past snippet as a context and reformulate the query by selecting the most relevant terms to the whole query from the user clicks. The experiment results show that the algorithm can select relevant terms and reduce noise. With the evaluation metrics of p@5 and NDCG, the system achieves a relative improvement against the best baseline system by 12.8 % and 7.2% respectively, 26.0% and 11.4% against the original ranking.
    Key wordscomputer application; Chinese information processing; personalized web search; implicit feedback; query reformulation
  • Review
    DU Yanqi, MA Jun
    2010, 24(3): 62-69.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper studies the problem of incremental crawling of forums. Since a topic in a forum is usually distributed in more than one page and the revisiting strategy of traditional incremental technologies is centered on the individual page, these technologies are not suitable for crawling forum sites incrementally. Based on the statistical analysis on the evolution of board in many Web forums, a novel and board-based incremental crawling strategy is proposed. The main idea of the approach is to define the pages of the same board as the basic unit for re-crawling. In detail, this approach leverages the board weights and local time discipline to allocate crawl resources and determine the crawl time. Experimental results show that the recall for the newly published and updated discussion threads is close to 99.3% for our method strategy, and the overall system delay is maximally decreased by 42% as compared with even scheduling method.
    Key wordscomputer application; Chinese information processing;incremental crawl; forum crawler; delay
  • Review
    WANG Suge1,2, YANG Anna1
    2010, 24(3): 69-75.
    Abstract ( ) PDF ( ) Knowledge map Save
    The collocations with strong sentiment orientation are important for the text sentiment analysis. In this paper, a method of collocation orientation identification based on hybrid language information is proposed. Firstly, according to the characteristics of six kinds of collocation patterns, the probability latent semantic models are determined for them. Then the obtained semantic models were used to identify the sentiment orientations of collocations. Lastly, for some collocations containing a sentiment word, their previous tags were modified by using some constructed rules. The experiment result in the corpus of car reviews indicates that the proposed method is superior to the method based only on probability latent semantic model or rule for collocation orientation identification.
    Key wordscomputer application; Chinese information processing; collocation; collocation pattern; sentiment orientation identification; probability latent semantic model
  • Review
    LIAO Xiangwen1, XU Hongbo2, ZHONG Shangping1
    2010, 24(3): 75-81.
    Abstract ( ) PDF ( ) Knowledge map Save
    The goal of Blog Opinion Retrieval is to retrieve the blog units that not only relate to a given query but also comment on the query. Previous works ranked blog units by the opinion strength of a single blog unit. However, since blog is the media expressing the blogger’s opinions and feelings, the personality of a blogger could affect the strength of his opinion. Therefore, it is disadvantageous defect to use only a single blog unit to get opinion score while neglecting the blogger’s factor. In this paper we build a blogger profile and then present a blogger-profile based normalization strategy for blog opinion retrieval. We apply it to normalize the Blog Opinion Retrieval algorithm based on probabilistic inference model. Experiment results show that the proposed normalization strategy could rank blog units more reasonably and improve the retrieval performance.
    Key wordscomputer application; Chinese information processing; blog opinion retrieval;blogger-profile;normalization strategy
  • Review
    ZENG Yiling1, XU Hongbo1, WU Gaowei1, CHENG Xueqi1, BAI Shuo1,2
    2010, 24(3): 81-89.
    Abstract ( ) PDF ( ) Knowledge map Save
    Traditional clustering algorithms suffer from model mismatch problem when the distribution of real data does not fit the model assumptions. To address this problem, a mapping and rescaling framework (referred as M-R framework) is proposed for document clustering. Specifically, documents are first mapped into a discriminative coordinate so that the distribution statistics of each cluster could be analyzed on the corresponding dimension. With the statistics obtained, a rescaling operation is then applied to normalize the data distribution based on the model assumptions. These two steps are conducted iteratively along with the clustering algorithm to improve the clustering performance. In the experiment, the M-R framework is applied on traditional k-means and the state-of-art spectral clustering algorithm Ncut. Resultss on well known datasets show that M-R framework brings performance improvements in all datasets.
    Key wordscomputer application; Chinese information processing; document clustering; space mapping; rescaling; model misfit
  • Review
    WANG Yu1,2, FANG Binxing1, WU Bo1,2,SONG Linhai1,2, GUO Yan1
    2010, 24(3): 89-97.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a new web schema matching algorithm incorporateing attribute distribution features. Attribute distribution features include the mutually exclusive feature and the co-occurring feature. By discovering mutually exclusive attribute pair and various statistics of the attribute pair, the mutually exclusive feature is calculated with the implication of the semantic similarity of the attribute pair. To utilize name similarity and value similarity based features, the attribute distribution features are combined with traditional similarity based features through machine learning techniques. After potential matched attribute pairs are discovered, this paper introduces the co-occurring feature as the constraint of clustering algorithms and solves the web schema matching problem by constrained attribute clustering algorithms. Experiments on a wide variety of domains demonstrate the improvements of F-scores ranging from 0.13 to 0.55.
    Key wordscomputer application; Chinese information processing; mutually exclusive attributes; go-occurring attributes; web schema matching; constrained clustering
  • Review
    ZHANG Aihua1, JING Hongfang1, WANG Bin1, XU Yan2
    2010, 24(3): 97-105.
    Abstract ( ) PDF ( ) Knowledge map Save
    In traditional vector space based text categorization models, term weighting and feature selection are absolutely isolated. Although feature selection functions give a score to each term, the score is seldom taken into account while weighting terms. This paper adopts term frequency, inverse document frequency and feature selection functions as the indication of the features' ability in representing a document, distinguishing different documents and distinguishing different categories respectively. The experimental results show that TF can raise the peak of the performance but it is sensitive to noisy features; IDF is tough to noise and but unstable; the feature selection function has strong moise-tolarent ability with stability. Finally, four criteria are proposed to combine the above factors to establish optimal weighting schemes and are further verified by experiments.
    Key wordscomputer application; Chinese information processing; text categorization; term weighting; effects of weighting factors; VSM
  • Review
    NING Jian, LIN Hongfei
    2010, 24(3): 105-112.
    Abstract ( ) PDF ( ) Knowledge map Save
    Focused on the cross language information retrieval, this paper applies the improved Latent Semantic Indexing (LSI)by combining SVD and NMF to construct the semantic space for the abstracts of biomedical literatures. It maps the Chinese document and English document into the same semantic space without external dictionary and knowledge base and for the bilingual information retrieval. The proposed method also utilizes the anchor information included the abstracts of biomedical literatures and builds a series models corresponding to different K-dimensions, all contributing to the similarity between query and documents with different credibility. As a result, the similarities of term to term, document to document and term to document are calculated forthe bilingual information retrieval of biomedical abstract. The experiment gets a better result.
    Key wordscomputer application; Chinese information processing; improved latent semantic indexing; semantic space; cross language IR; SVD; NMF
  • Review
    ZHANG Guiping, LIU Dongsheng, YIN Baosheng, XU Lijun, MIAO Xuelei
    2010, 24(3): 112-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    According to the characteristics of the patent documents, this paper presents a multi-strategy approach for word segmentation based on statistics and rules. Our method takes advantage of the latent segmentation-marks in the document and employs the context information of the text in the a maximum probabilistic model of segmentation. Meanwhile, the term affix rules are applied in the post-processing. Making full use of the global information from a large scale corpus and the specific context information, this method effectively solves the problem of the out-of-vocabulary words difficult to identify in the patent segmentation. The experimental results indicate that this method achieves good results in the close and opening test, with improves on unknown words recognition as well.
    Key wordscomputer application; Chinese information processing; Chinese word segmentation; patent document; context information
  • Review
    CHEN Zhenyu1, YUAN Yulin1, ZHANG Xiusong1, ZHOU Qiang2
    2010, 24(3): 117-124.
    Abstract ( ) PDF ( ) Knowledge map Save
    We propose a kindred automatic reasoning model by taking the approach of “large knowledge base and small calculation”. Firstly we construct a cognitive model for Chinese kinship system and its semantic situation on the bassis of the lexical-grammatical expressions of Chinese kinship system, and build the large knowledge base of Chinese kinshipaccordingly. The knowledge base is consisted of two librariesthe peripheral one which stores various constructions of kinship and their semantic translation, and the kernel one storing special properties of kinship terms, reversal expression of kinship description and all potential transfer paths among the addresser, addressee and bridge person. Then we give a simple computational procedure scheme, which can easily infer different kinship relation based on given kinship facts.
    Key wordscomputer application; Chinese information processing; kinship; automatic reasoning; cognitive model; knowledge base; reverse expression; transfer path
  • Review
    GUAN Bai
    2010, 24(3): 124-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    The segmentation unit is basic unit of the segmentation system as well asthe basis for word segmentation research.. This paper discusses the segmentation unit of Tibetan word on the basis of current Tibetan grammar theory and Chinese semantic framework. Specifically, with reference to “The Criterion of Word Segmenatation for Chinese Information Processing (for Consultation)” and “The Criterion of Word Segmentation for Modern Chinese Information Processing” etc, this paper proposes the nine basic principles and three secondary principles to segment the Tibetan word on the basis of Tibetan corpus. The Tibetan word segmentation is futher explained in detail by the proposed segmentation principles and so-established Tibetan word class.
    Key wordscomputer application; Chinese information processing; Tibetan word segmentation; segmentation unit; information processing; principle of word segmentation