为了准确挖掘出同一主题的大量网络新闻的线索发展脉络,该文提出了一种基于条件随机场模型的网络新闻主题线索发掘方法。首先,根据新闻主题线索句的识别规则提取出相关特征,并应用到条件随机场模型中提取出主题线索句;然后,按照时间顺序构建原始线索链;最后,对语义相近的原始线索链进行合并处理,获得最终的新闻主题发展脉络。实验结果表明,该方法在主题线索句识别上有较好的效果,最终得到的主题线索脉络能够较清晰地展现新闻发展趋势。
Abstract
To accurately find out the clues of the same topic from a large number of Web news, a method of topic clues mining is proposed based on the Conditional Random Fields model. Firstly, according to the identification rules of the topic sentence, the relative characteristics were extracted and utilized on the Conditional Random Field model to get the candidate topic sentences. Then the lexical chains of topic clues were built by chronological order and lexical weight. Finally the similar clue chains in semantic needed to be merged and the whole development context of network news can be described. The experiment results show the method proposed achieves a good performance on the topic clue sentence extraction and the topic clue chains obtained can clearly show the development trend of network news.
关键词
主题线索 /
条件随机场 /
线索链
{{custom_keyword}} /
Key words
topic clue /
conditional random fields /
clue chain
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Zhang Xiaoyan, Wang Ting. Topic tracking with improved representation model and joint tracking method[J]. International Journal of Wavelets, Multi-resolution and Information Processing, 2010, 8(6): 913-930.
[2] Adams P H, Martell C H. Topic detection and extraction in chat[C]//Proceedings of IEEE International Conference on Semantic Computing. Los Alamitos, CA ,2008: 581-588.
[3] Blei D, Ng A, Jordan M. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003(3): 993-1022.
[4] 单斌,李芳.基于LDA话题演化研究方法综述[J].中文信息学报, 2010,24(6): 43-49.
[5] Yan Zehua, Li Fang. Thread labeling for news event[J]. Journal of Shanghai Jiaotong University(Science),2013,18(4): 418-424.
[6] Smriti Sharma, Rajesh Kumar. News event extraction using 5w1H approach & its analysis[J].International Journal of Scientific & Engineering Research,2013,4(5): 2064-2068.
[7] Zhao C, Yi D. Text resource emergence: discovering evolutionary event patterns from web texts. Kybernetes, 2012, 41(9): 1386-1395.
[8] 王伟,赵东岩,赵伟.中文新闻关键事件的主题句识别[J].北京大学学报(自然科学版),2011,47(5): 789-796
[9] 梁晗, 陈群秀等. 基于事件框架的信息抽取系统[J]. 中文信息学报, 2006, 20(2): 40-46.
[10] 吕楠, 罗军勇等. 一种有效的事件演化分析算法[J]. 计算机应用研究, 2009, 26(11): 4101-4104.
[11] 吴晓峰, 宗成庆. 一种基于LDA的CRF自动文摘方法[J]. 中文信息学报, 2009, 23(6): 39-45.
[12] 张龙凯, 王厚峰. 文本摘要问题中的句子抽取方法研究[J]. 中文信息学报, 2012, 26(2): 98-101.
[13] Nenkova A, McKeown K. A survey of text summarization techniques[M]. Charu C Aqyarwal, ChenXing Zhai. Mining Text Data. Springer US, 2012: 43-76.
[14] Shen Dou, Sun Jiantao, Li Hua et al. Document summarization using conditional random fields[C]//Proceedings of the 20th international joint conference on artificial intelligence, 2007: 2862-2867.
[15] Sutton C, McCallum A. An introduction to conditional random fields[J]. Machine Learning, 2011, 4(4): 267-373.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(71271209);北京市自然科学基金(4132067);教育部人文社会科学青年基金(11YJC630268)
{{custom_fund}}