Review
HAO Xiulan1, HU Yunfa2, SHEN Qing1
2012, 26(3): 129-137.
Internet is flooded with user-generated contents, such as posts in web forums. How to monitor these scrappy, rambling messages is concerned by safety agents. Topic Detection and Tracking (TDT) is one effective way to monitor sensitive information. However, the salient features of a reply to the post in a web forum (e.g. short in length, swift in “topic drifting”) challengethe TDT over web forums. According to the characteristics of the reply, three models are proposed in this paper. First, a baseline model employed a single pass clustering procedure is described. Second, to alleviate “topic drifting”, an improved model is proposed, in which terms in title are used to adjust the weight of term in the post and a topic is represented by a seminal vector and a tracked vector. Third, the late reweighting technique of named entity (NE) is applied. To deal with the free format of user-generated contents and meet the speed requirement, a new feature extraction procedure is proposed. Experimental results on real data set prove that the proposed models and feature extraction procedure are feasible.
Key wordscontent monitor; Chinese web forum; feature extraction