Review
ZOU Zhihua1, TIAN Shengwei 2, YU Long3, FENG Guanjun4
2013, 27(2): 118-127.
The paper proposes an improved suffix tree clustering algorithm for Uyghur Web text (STCU), with the Uyghur word as the basic unit in the construction the suffix tree. According to the characteristics of Uyghur and Web texts, we design the Uyghur word stemmer, and construct Uyghur absolute stop word table and relative stop word table. We adopt the document frequency and part-of-speech information to extract key phrases, and then automatically adjust clustering threshold according to the number of Web corpus. Finally, we utilize the most general phrases to describe clustering category information, effectively improving the quality of clustering results. Compared to the traditional suffix tree clustering, the error rate has dropped 0.94%, and in turn, the overall rate and the precision have improved by 44.51% and 11.74%, respectively.
Key wordsUyghur; suffix tree; phrase clustering; stop word list; document frequency