Review
CAO Peng1,2, LI Jingyuan1, MAN Tong1,2, LIU Yue1, CHENG Xueqi1
2011, 25(1): 20-28.
Microblog is a very new concept of web 2.0. The most important microblog system in use is Twitter, with more than 160 million users all over the world. For now, Twitter is one of the most influential voices of the globe, its users including celebrities, well-known politicians and first-order companies. The length of the messages in Twitter is short, and the contents of the messages are very likely to be informal in syntax or grammar. Moreover, Twitter does not strictly define the syntax of retweet, which causes the existence of a great number of near duplicate messages. These near duplicate messages can be a waste of storage resources, and can greatly reduce the user experience of Twitter. In this paper, the syntax of retweet messages is analyzed, and a method is presented to remove the retweet symbols of messages using the analyzed results. In addition, two text distance calculating methods character statistics and shortest editing distance are proposed to cluster the Twitter messages into groups of near duplicate messages. We also analyze the log-in method and characteristics of twitters messages. Through a series of experiments, we prove that our methods are efficient, extensible and easy to implement, and can be used to discover and filter the near duplicate messages in microblogs.
Key wordsmicroblog;Twitter;near duplicate message