词作为最小的语义单位,同领域之间具有复杂的关系,特别是较为常用的词,通常难以明确界定其所属领域。在某些应用中并非必须确定词和领域的明确关系,仅仅依赖词的领域性的量化值就能够取得较好的效果。该文根据大规模语料库中词的关联信息,采用无指导的方法,对词的领域性进行量化,其结果可以作为词的一种特征应用于文本分类、话题检测、信息检索等相关的自然语言处理中。最后,通过和常用的特征——TFIDF在话题检测应用中进行对比,证明了其有效性。
Abstract
Word, as the smallest semantic unit, has complex relationship with text domains. Especially, it is often difficult to define the exact domain for the commonly used words. In fact, it is not always necessary to establish clear relationship between the word and the domain for real applications. Instead, we can achieve satisfactory results by quantifying the domain property of the words. In this paper, we propose an unsupervised method for quantifying the domain property of words, based on word association information in the large-scale corpus. We valide the proposed value of words domain property by comparing against the classical TF * IDF measure in the topic detection application.
关键词
词的领域性 /
话题检测 /
TFIDF
{{custom_keyword}} /
Key words
the domain property of the word /
topic detection /
TF IDF
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] George A Miller. The WordNet project[DB/OL].[2012]. http://wordnet.princeton.edu/
[2] 董振东, 董强. 知网[DB/OL]. [2013]. http://www.keenage.com/.
[3] Fabrizio Sebastiani. Machine Learning in Automated
Text Categorization[C]//Proceedings of ACM Computing Surveys (CSUR), 2002, 34(1):1-47.
[4] Navigli R, Faralli S, Soroa A, et al. Two birds with one stone: learning semantic models for Text Categorization and Word Sense Disambiguation[C]//Proceedings of the 20th ACM international conference on information and knowledge management. ACM, 2011: 2317-2320.
[5] Gu H, Zhou K. Text classification based on domain ontology[J]. Journal of Communication and Computer, 2006, 3(5): 29-32.
[6] Reeve L H, Han H, Brooks A D. The use of domain-specific concepts in biomedical text summarization[J]. Information Processing & Management, 2007, 43(6): 1765-1776.
[7] S. Brin, L. Page. The anatomy of a large-scale hypertextual web searchengine[C]//Proceedings of 7th International WWW Conference, 1998: 107-117.
[8] Karypis, George. CLUTO-a clustering toolkit[CP/OL]. http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview.2002.
[9] Ying Zhao, George Karypis. Criterion functions for document clustering: Experiments and analysis[C]//Proceedings of Technical Report TR #01-40, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家语委“十二五”科研规划项目(YB125-43)
{{custom_fund}}