量化词语的领域特征

刘冬明,杨尔弘

PDF(602 KB)
PDF(602 KB)
中文信息学报 ›› 2014, Vol. 28 ›› Issue (5) : 46-50.
词法·句法·语义分析及应用

量化词语的领域特征

  • 刘冬明1,杨尔弘2
作者信息 +

Quantize the Domain Property of Words

  • LIU Dongming1, YANG Erhong2
Author information +
History +

摘要

词作为最小的语义单位,同领域之间具有复杂的关系,特别是较为常用的词,通常难以明确界定其所属领域。在某些应用中并非必须确定词和领域的明确关系,仅仅依赖词的领域性的量化值就能够取得较好的效果。该文根据大规模语料库中词的关联信息,采用无指导的方法,对词的领域性进行量化,其结果可以作为词的一种特征应用于文本分类、话题检测、信息检索等相关的自然语言处理中。最后,通过和常用的特征——TFIDF在话题检测应用中进行对比,证明了其有效性。

Abstract

Word, as the smallest semantic unit, has complex relationship with text domains. Especially, it is often difficult to define the exact domain for the commonly used words. In fact, it is not always necessary to establish clear relationship between the word and the domain for real applications. Instead, we can achieve satisfactory results by quantifying the domain property of the words. In this paper, we propose an unsupervised method for quantifying the domain property of words, based on word association information in the large-scale corpus. We valide the proposed value of words domain property by comparing against the classical TF * IDF measure in the topic detection application.

关键词

词的领域性 / 话题检测 / TFIDF

Key words

the domain property of the word / topic detection / TF IDF

引用本文

导出引用
刘冬明,杨尔弘. 量化词语的领域特征. 中文信息学报. 2014, 28(5): 46-50
LIU Dongming, YANG Erhong. Quantize the Domain Property of Words. Journal of Chinese Information Processing. 2014, 28(5): 46-50

参考文献

[1] George A Miller. The WordNet project[DB/OL].[2012]. http://wordnet.princeton.edu/
[2] 董振东, 董强. 知网[DB/OL]. [2013]. http://www.keenage.com/.
[3] Fabrizio Sebastiani. Machine Learning in Automated
Text Categorization[C]//Proceedings of ACM Computing Surveys (CSUR), 2002, 34(1):1-47.
[4] Navigli R, Faralli S, Soroa A, et al. Two birds with one stone: learning semantic models for Text Categorization and Word Sense Disambiguation[C]//Proceedings of the 20th ACM international conference on information and knowledge management. ACM, 2011: 2317-2320.
[5] Gu H, Zhou K. Text classification based on domain ontology[J]. Journal of Communication and Computer, 2006, 3(5): 29-32.
[6] Reeve L H, Han H, Brooks A D. The use of domain-specific concepts in biomedical text summarization[J]. Information Processing & Management, 2007, 43(6): 1765-1776.
[7] S. Brin, L. Page. The anatomy of a large-scale hypertextual web searchengine[C]//Proceedings of 7th International WWW Conference, 1998: 107-117.
[8] Karypis, George. CLUTO-a clustering toolkit[CP/OL]. http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview.2002.
[9] Ying Zhao, George Karypis. Criterion functions for document clustering: Experiments and analysis[C]//Proceedings of Technical Report TR #01-40, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001.

基金

国家语委“十二五”科研规划项目(YB125-43)
PDF(602 KB)

544

Accesses

0

Citation

Detail

段落导航
相关文章

/