信息的暴涨给文本处理带来了更多的挑战。话题检测能够把大量的信息以话题为单位有效地组织起来,然而最终用户有可能并不需要涉及某一话题的所有文本,而是仅仅关心该话题的具体内容。在我们根据相关文本智能表达话题内容推送给用户之前,自动从相关文本中挑选符合用户需求的文本是一个非常有意义的工作。本文致力于相同话题文本之间的内容比较,目的是有效地选出满足需求的文本。我们通过对话题进行重新定义,并根据此定义设定了话题和文本的表示方法,给出了基于该表示方法的话题和文本之间的内容比较计算方法。最后,通过实验说明了这一系列方法的有效性。
Abstract
The topic detection can effectively organize the vast information into topics with the unit of text, but end users do not need all the texts on a topic. Instead, they may just demand certain specific content of the topic. To achieve the intelligent push of the relevant content in a topic to the user, it is essential to select the corresponding part of the texts according to the needs of users. This paper compares the contents between the texts in a topic and effectively selects the texts which meets the needs of the user. We redefine the topic and represent the topic and the text according to this definition. Then we design a computation method between the texts and topic based on this representation. Finally, the experiment demonstrates the effectiveness of this approach.
关键词
话题定义 /
文本表示 /
话题检测 /
文本内容计算
{{custom_keyword}} /
Key words
topic definition /
textual representation /
topic detection /
text content computing
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 洪宇, 张宇, 刘挺, 等. 话题检测与跟踪的评测及研究综述[J]. 中文信息学报,2007,21(6), 71-87.
[2] Salton Gerard, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing[J]. Communications of the ACM,1975,18(11): 613-620.
[3] Salton G, Buckley C. Term-weighting approaches in automatic text retrieval[J]. Information processing & management, 1988,24(5): 513-523.
[4] Zeng C, Lu Z, Gu J. A new approach to Email classification using Concept Vector Space Model[C]//Proceedings of Future Generation Communication and Networking Symposia, 2008. FGCNS08. Second International Conference on IEEE,2008, 3:162-166.
[5] Liddy E D. Enhanced text retrieval using natural language processing[J]. Bulletin of the American Society for Information Science and Technology,1998,24(4): 14-16.
[6] Keikha M, Khonsari A, Oroumchian F. Rich document representation and classification: An analysis[J]. Knowledge-Based Systems, 2009,22(1), 67-71.
[7] Scott S, Matwin S. Text classification using WordNet hypernyms[A]. In Use of WordNet in natural language processing systems: Proceedings of the conference[C].1998: 38-44.
[8] 王锦, 王会珍, 张俐. 基于维基百科类别的文本特征表示[J]. 中文信息学报,2011,25(2): 27-31.
[9] Jones K S, Walker S, Robertson S E. A probabilistic model of information retrieval: development and comparative experiments: Part 1[J].Information Processing & Management, 2000,36(6): 779-808.
[10] Schenker A, Last M, Bunke H, et al. Classification of web documents using graph matching[J]. International Journal of Pattern Recognition and Artificial Intelligence,2004, 18(03), 475-496.
[11] Cieri C, Strassel S, Graff D,et al. Corpora for topic detection and tracking[A]. In Topic detection and tracking[C]//Springer US.2002: 33-66.
[12] 刘冬明, 杨尔弘. 量化词语的领域特征[J]. 中文信息学报, 2014,28(5): 46-50.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家语委“十二五”科研规划项目(YB125-43)
{{custom_fund}}