基于语义串抽取及主题相似度度量的维吾尔文文本分类

吐尔地·托合提,维尼拉·木沙江,艾斯卡尔·艾木都拉

PDF(4928 KB)
PDF(4928 KB)
中文信息学报 ›› 2017, Vol. 31 ›› Issue (4) : 100-107.
民族语言及周边语言信息处理

基于语义串抽取及主题相似度度量的维吾尔文文本分类

  • 吐尔地·托合提,维尼拉·木沙江,艾斯卡尔·艾木都拉
作者信息 +

Semantic String-Based Topic Similarity Measuring Approach for Uyghur Text Classification

  • Turdi Tohti, Winira Musajan, Askar Hamdulla
Author information +
History +

摘要

该文研究一种改进的n元递增算法来抽取维吾尔文本中表达关键信息的语义串,并用带权语义串集来刻画文本主题,提出了一种类似于Jaccard相似度的文本和类主题相似度度量方法,并实现了相应的维吾尔文分类算法。实验结果表明,该文提出的文本模型简单有效,分类算法计算量不高,而且还能达到或超过经典分类器的分类综合性能。

Abstract

This paper proposes an improved frequent pattern-growth approach to discover and extract the semantic strings which express key information in Uyghur texts. Then the topics are described by these weighted semantic strings. Based on these features, the Uyghur text classification is conducted by a new-designed Jaccard-like similarity measure. Experimental results show that the proposed method achieves comparable performance with a reasonable computation cost with regard to two traditional classifiers.

关键词

维吾尔文 / n元递增算法 / 语义串抽取 / 主题相似度 / 文本分类

Key words

Uyghur language / frequent pattern-growth algorithm / semantic string extraction / topic similarity / text classification

引用本文

导出引用
吐尔地·托合提,维尼拉·木沙江,艾斯卡尔·艾木都拉. 基于语义串抽取及主题相似度度量的维吾尔文文本分类. 中文信息学报. 2017, 31(4): 100-107
Turdi Tohti, Winira Musajan, Askar Hamdulla. Semantic String-Based Topic Similarity Measuring Approach for Uyghur Text Classification. Journal of Chinese Information Processing. 2017, 31(4): 100-107

参考文献

[1] 苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9): 1848-1859.
[2] C Niu, W Li, R K Srihari, et al. Word independent context pair classification model for word sense disambiguation [C]//Proceedings of the Ninth Conference on Computational Natural Language Learning,2005: 33-39.
[3] Y Liu, P Scheuermann, X Li, et al. Using WordNet to disambiguate word senses for text classification[J]. Lecture Notes in Computer Science, 2007: 781-789.
[4] 徐燕,李锦涛,王斌,等.基于区分类别能力的高性能特征选择方法[J].软件学报,2008,19(1): 82-89.
[5] W Zhang,T Yoshida,X J Tang. Text classification using multi-word features[C]//Proceedings the 12 th IEEE International Conference on Systems, Man and Cybernetics, 2007: 3519-3524.
[6] F Figueiredo,L Rocha,T Couto,et al. Word co-occurrence features for text classification[J]. Information Systems,2011,36(5): 843-858.
[7] D Sreya, M M Narasimha. Using discriminative phrases for text categorization [C]//Proceedings of the 20th International Conference on Neural Information Processing, 2013: 273-280.
[8] 阿力木江·艾沙,吐尔根·依布拉音,艾山·吾买尔, 等.基于机器学习的维吾尔文文本分类研究[J].计算机工程与应用, 2012, 48(5): 110-112.
[9] Turdi Tohti, Winira Musajan, Askar Hamdulla.Unsupervised learning and linguistic rule based algorithm for Uyghur word Segmentation [J]. Journal of Multimedia, 2014, 9(5): 627-634.
[10] M Candito,M Constant. Strategies for contiguous multiword expression analysis and dependency parsing[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014-Proceedings of the Conference,2014: 743-753.
[11] Rais N H, Abdullah M T, Kadir R A. Multiword phrases indexing for malay-english cross-language information retrieval [J]. Information Technology Journal, 2011,10(8): 1554-1562.
[12] Murata Masaki, U Masao. Compound word segmentation using dictionary definitions- extracting and examining of word constituent information [J]. ICIC Express Letters, Part B: Applications, 2012, 3(3): 667-672.
[13] A E Eldesoky, M Saleh, N A Sakr. Novel similarity measure for document clustering based on topic phrases [C]//Proceedings of the 2009 International Conference on Networking and Media Convergence, 2009: 92-96.
[14] Y Ma, L Wang. Dynamic indexing for large-scale collections [J]. Journal of Beijing Normal University(Natural Science),2009,45(2): 134-137.
[15] R Uday Kiran,P Krishna Reddy. An improved frequent pattern-growth approach to discover rare association rules[C]//Proceedings of the 1st International Conference on Knowledge Discovery and Information Retrieval,2009: 43-52.
[16] J K Jain, N Tiwari M Ramaiya. Mining positive and negative association rules from frequent and infrequent pattern using improved genetic algorithm[C]//Proceedings of the 5th International Conference on Computational Intelligence and Communication Networks, 2013: 516-521.
[17] A Tiwari,R K Gupta, D P Agrawal. A survey on frequent pattern mining: current status and challenging issues [J]. Information Technology Journal, 2010, 9(7): 1278-1293.
[18] 张华平,高凯 ,黄河燕,等.大数据搜索与挖掘[M].北京: 科学出版社,2014.
[19] R Anand,U D Jeffrey,互联网大规模数据挖掘与分布式处理[M].王斌,译.北京: 人民邮电出版社,2012.
[20] J Q Ji, J M Li, S C Yan, et al. Min-max hash for jaccard similarity[C]//Proceedings of the IEEE 13th International Conference on Data Mining, 2013: 301-309.

基金

国家自然科学基金(61562083,61262062,61262063);新疆维吾尔自治区高校科研计划重点项目(XJEDU2012I11)
PDF(4928 KB)

630

Accesses

0

Citation

Detail

段落导航
相关文章

/