该文研究一种改进的n元递增算法来抽取文本中表达关键信息的语义串,然后用多特征融合的评价方法为每一个文本选取最重要的语义串,并用这些语义串作为特征表示文本。通过K_means聚类分析的实验结果表明,以语义串作为特征可以构造比单词特征集更紧凑的文本模型,不仅可以大大降低特征空间的维度,对于提高聚类算法性能也是非常有效的。
Abstract
This paper proposes an improved frequent pattern-growth approach to discover and extract the semantic strings which express key information in the text, It then assigns weights to them via a multi-feature fusion method and select the most important semantic strings as features to represent the text. The experimental results by K_means cluster shows that the text model constructed by semantic string feature is more compact than the text model constructed by word feature, not only greatly reducing the dimensions of feature space but also improving the performance of clustering algorithm.
关键词
维吾尔文 /
语义串抽取 /
特征评价及选取 /
向量空间模型 /
K_means
{{custom_keyword}} /
Key words
Uyghur language /
semantic string extraction /
feature evaluation and selection /
vector space model /
K_means
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 刘远超, 王晓龙, 徐志明, 等. 文档聚类综述[J]. 中文信息学报, 2006, 20(3):55-62.
[2] Mostafa M S, Haggag M H, Gomaa W H. Document clustering using word sense disambiguation[C]//Proceedings of the 17th International Conference on Software Engineering and Data Engineering, 2008:19-24.
[3] 徐燕, 李锦涛, 王斌, 等. 基于区分类别能力的高性能特征选择方法[J]. 软件学报, 2008, 19(1):82-89.
[4] Bakr A M, Yousri N A, Ismail M A. Efficient incremental phrase-based document clustering[C]//Proceedings of the 21st International Conference on Pattern Recognition, 2012:517-520.
[5] Wu C B, Zhang Q. Text clustering based on combined features of concepts and words[J]. Journal of Information and Computational Science, 2012, 9(15):4253-4260.
[6] Marcacini R M, Correa G N, Rezende S O. An active learning approach to frequent itemset-based text clustering[C]//Proceedings of the 21st International Conference on Pattern Recognition, 2012:3529-3532.
[7] Turdi Tohti, Winira Musajan, Askar Hamdulla. Unsupervised learning and linguistic rule based algorithm for Uyghur word segmentation[J]. Journal of Multimedia, 2014, 9(5):627-634.
[8] Candito M, Constant M. Strategies for contiguous multiword expression analysis and dependency parsing[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014-Proceedings of the Conference, 2014:743-753.
[9] Rais N H, Abdullah M T, Kadir R A. Multiword phrases indexing for Malay-English cross-language information retrieval [J]. Information Technology Journal, 2011, 10(8):1554-1562.
[10] Murata Masaki, Masao U. Compound word segmentation using dictionary definitions-extracting and examining of word constituent information [J]. ICIC Express Letters:Part B Applications, 2012, 3(3):667-672.
[11] Eldesoky A E, Saleh M, Sakr N A. Novel similarity measure for document clustering based on topic phrases[C]//Proceedings of International Conference on Networking and Media Convergence, 2009:92-96.
[12] Ma Y, Wang L. Dynamic indexing for large-scale collections[J]. Journal of Beijing Normal University(Natural Science), 2009, 45(2):134-137.
[13] Kiran R U, Reddy P K. An improved frequent pattern-growth approach to discover rare association rules[C]//Proceedings of the 1st International Conference on Knowledge Discovery and Information Retrieval, 2009:43-52.
[14] Jain J K, Tiwari N, Ramaiya M. Mining positive and negative association rules from frequent and infrequent pattern using improved genetic algorithm[C]//Proceedings of the 5th International Conference on Computational Intelligence and Communication Networks, 2013:516-521.
[15] Tiwari A, Gupta R K, Agrawal D P. A survey on frequent pattern mining:Current status and challenging issues [J]. Information Technology Journal, 2010, 9(7):1278-1293.
[16] 张华平, 高凯 , 黄河燕, 等. 大数据搜索与挖掘[M]. 北京:科学出版社, 2014.
[17] 吐尔地·托合提, 艾海麦提江·阿布来提, 米也塞·艾尼玩, 等. 一种结合GAAC和K-means的维吾尔文文本聚类算法[J]. 计算机工程与科学, 2013, 35(7):149-155.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61562083,61262062,61262063)
{{custom_fund}}