Review
SU Chong, CHEN Qingcai, WANG Xiaolong, MENG Xianjun
2010, 24(2): 58-68.
Most of existing web page clustering algorithms are based on short and uneven snippets of web pages, which often causes bad clustering performance (e.g., STC and Lingo algorithms). On the other hand, the classical clustering algorithms for full web pages are too complex to provide good cluster label in addition to the incapability online clustering (for example, Kmeans algorithm). To address above problems, this paper presents an online web page clustering algorithm based on maximal frequent itemsets (MFIC). At first, the maximal frequent itemsets are mined, and then the web pages are clustered based on shared frequent item sets. Finally, clusters are labelled based on the frequent items. Experimental results show that MFIC can effectively reduce clustering time, improve clustering accrucy by 15%, and generate understandable labels.
Key wordscomputer application; Chinese information processing;search engine; Web page clustering; frequent itemset