基于统计和浅层语言分析的维吾尔文语义串快速抽取

吐尔地·托合提,维尼拉·木沙江,艾斯卡尔·艾木都拉

PDF(8292 KB)
PDF(8292 KB)
中文信息学报 ›› 2017, Vol. 31 ›› Issue (4) : 70-79.
民族语言及周边语言信息处理

基于统计和浅层语言分析的维吾尔文语义串快速抽取

  • 吐尔地·托合提,维尼拉·木沙江,艾斯卡尔·艾木都拉
作者信息 +

Uyghur Semantic String Extraction Based on Statistical Model and Shallow Linguistic Parsing

  • Turdi Tohti, Winira Musajan, Askar Hamdulla
Author information +
History +

摘要

该文提出了一种基于统计和浅层语言分析的维吾尔文语义串快速抽取方法,采用一种多层动态索引结构为大规模文本建词索引,结合维吾尔文词间关联规则采用一种改进的n元递增算法进行词串扩展并发现文本中的可信频繁模式,最终依次判断频繁模式串结构完整性从而得到语义串。通过在不同规模的语料上实验发现,该方法可行有效, 能够应用到维吾尔文文本挖掘多个领域。

Abstract

A fast Uyghur semantic string extraction method is proposed based on statistical model and shallow linguistic parsing. It employs a multilayered dynamic indexing structure to build word index for large-scale text. Combined with the Uyghur word association rules, an improved n-gram incremental algorithm is designed for word string extension, trying to capture the credible frequent patterns in the text. The final semantic strings are determined after the structural integrity of the frequent pattern is verified. Experiments on different corpus indicate that this method is feasible and effective.

关键词

语义串 / 多层动态索引 / 词串扩展 / 可信频繁模式 / 邻接特征分析

Key words

semantic string / multilayered dynamic indexing / word string extension / credible frequent pattern / context analysis

引用本文

导出引用
吐尔地·托合提,维尼拉·木沙江,艾斯卡尔·艾木都拉. 基于统计和浅层语言分析的维吾尔文语义串快速抽取. 中文信息学报. 2017, 31(4): 70-79
Turdi Tohti, Winira Musajan, Askar Hamdulla. Uyghur Semantic String Extraction Based on Statistical Model and Shallow Linguistic Parsing. Journal of Chinese Information Processing. 2017, 31(4): 70-79

参考文献

[1] L F Chien.PAT-tree-based keyword extraction for Chinese Information Retrieval[C]//Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval,1997: 50-58.
[2] J Zhang,J F Gao, M Zhou. Extraction of Chinese compound words -an experimental study on a very large corpus[C]//Proceedings of ACL2000 Second Chinese Language Processing Workshop, 2000: 132-139.
[3] Y S Lai, C H Wu. Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology [J]. ACM Transactions on Asian Language Information Processing, 2002, 1(1): 34-64.
[4] 胡吉祥.基于频繁模式的消息文本聚类研究[D]. 中国科学院研究生院硕士学位论文,2006.
[5] 贺敏.面向互联网的中文有意义串挖掘[D]. 中国科学院研究生院硕士学位论文,2007.
[6] 吴庆耀.无监督的中文语义词抽取技术研究[D]. 哈尔滨工业大学深圳研究生院硕士学位论文,2009.
[7] 贺敏,龚才春,张华平,等.一种基于大规模语料的新词识别方法[J].计算机工程与应用,2007.43(21): 157-159.
[8] N H Rais,M T Abdullah,R A Kadir. Multiword phrases indexing for Malay-English cross-language information retrieval [J]. Information Technology Journal, 2011,10(8): 1554-1562.
[9] Y F Zhang, F Long, L Bin. Identifying opinion sentences and opinion holders in Internet public opinion[C]//Proceedings of the 2012 International Conference on Industrial Control and Electronics Engineering, 2012: 1668-1671.
[10] H T Zheng, B Y Kang, H G Kim. Exploiting noun phrases and semantic relationships for text document clustering [J]. Information Sciences, 2009,179(13): 2249-2262.
[11] D Sreya, M M Narasimha. Using discriminative phrases for text categorization [C]//Proceedings of 20th International Conference on Neural Information Processing, 2013: 273-280.
[12] B Ibrahim, L Wiem, E Bile. Arabic domain terminology extraction: A literature review [J]. Lecture Notes in Computer Science, 2014,(8841): 792-799.
[13] Turdi Tohti, Winira Musajan, Askar Hamdulla.Unsupervised learning and linguistic rule based algorithm for Uyghur word segmentation[J]. Journal of Multimedia, 2014, 9(5): 627-634.
[14] J Atkinson, J Matamala.Evolutionary shallow natural language parsing [J].Computational Intelligence, 2012, 28(2): 156-175.
[15] 马乐,王力. 一种海量文本的动态索引方法[J]. 北京师范大学学报(自然科学版),2009,45(2): 134-137.
[16] W C Yang, J Liu, M Yu. Research of an improved algorithm for Chinese word segmentation dictionary based on double-array trie Tree[C]//Proceedings of 2nd CCF Conference on Natural Language Processing and Chinese Computing(NLPCC 2013),2013: 355-362.
[17] T Ahmad, M N Doja. Opinion mining using frequent pattern growth method from unstructured text [J].International Symposium on Computational and Business Intelligence, 2013: 92-95.
[18] S B Hazez. Linguistic pattern-matching with contextual constraint rules[C]//Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 2001: 971-976.
[19] 张华平,高凯 ,黄河燕,等.大数据搜索与挖掘[M].北京: 科学出版社,2014.
[20] A Tiwari, R K Gupta, D P Agrawal. A survey on frequent pattern mining: current status and challenging issues [J]. Information Technology Journal, 2010, 9(7): 1278-1293.

基金

国家自然科学基金(61562083,61262062,61262063);新疆维吾尔自治区高校科研计划重点项目(XJEDU2012I11)
PDF(8292 KB)

Accesses

Citation

Detail

段落导航
相关文章

/