针对汉越双语新闻事件线索分析,提出了基于全局/局部共现词对分布的汉越双语事件线索生成方法。该方法首先将新闻话题词语分布作为全局词语表征全局事件,然后用一定时间粒度下新闻片段特有的时间、人物、地点等事件元素作为局部词语,分析新闻片段中全局词语和局部词语的共现关系,将全局/局部词语的共现规律作为监督信息,结合RCRP算法和汉越双语新闻的对齐语料,构建有监督话题生成主题模型,获得相应时间跨度下代表事件发展进程的子话题分布,通过子话题的分布反映事件发展的线索,从而构建出在线汉越双语事件线索生成模型。实验在汉越混合新闻数据集上进行,事件线索生成对比实验结果证明了提出的方法的有效性。
Abstract
Aiming at Chinese-Vietnamese bilingual news event storyline analysis, a generative model for event storyline is proposed based on global/local word pairs’ co-occurrence distribution. Firstly, the detected news topic word distribution was used as global words to characterize a global event, Then time, person, place and other event elements in the news segment divided by certain time granularity are used as local words. The are co-occurrence of global and local words is analyzed and used as supervised information, with RCRP algorithm and bilingual aligned words together, which are integrated into a bilingual topic model to get sub-topic distribution under corresponding time slice. Finally, by the sub-topic distribution representing the developing process of an event, a generative model to storyline was constructed. On Chinese-Vietnamese mixed news set crawled from the internet, the comparative experiments of storyline generation are conducted, proving that the proposed bilingual news storyline is model got better effect than the other methods.
Key words Chinese-Vietnamese; news event storyline; global/local co-occurrence words; sub-topic distribution; bilingual topic model
关键词
汉语-越南语 /
新闻事件线索 /
全局/局部共现词对 /
子话题分布 /
双语主题模型
{{custom_keyword}} /
Key words
Chinese-Vietnamese /
news event storyline /
global/local co-occurrence words /
sub-topic distribution /
bilingual topic model
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Gerard Salton. Introduction to modern information retrieval[M]. New York: McGraw-Hill, 1983:289-317.
[2] Niek Hoogma. The Modules and Methods of Topic Detection and Tracking[C]//Proceedings of the 2nd Student Conference on IT. Enschede, Netherlands: University of Twente, 2005:1-6.
[3] 赵华,赵铁军,于浩等. 基于查询向量的英语话题跟踪研究[J]. 计算机研究与发展, 2007,44(8):1412-1417.
[4] Hischeng Chang. Extraction of Topic and Event Keywords from News Story[C]//Proceedings of 2007 National Computer Symposium.Taichung, Taiwan, 2007:1-10.
[5] Thomas Hofmann. Probalilistic Latent Semantic Indexing[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, California: ACM, 1999:50-57.
[6] David M Blei, Andrew Y Ng, Michael I Jordan. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3(4-5):993-1022.
[7] Paul Ogilvie, James Allan, David Jensen, et al. Extracting and using relationships found in text for topic tracking[R]. CIIR Technical Report IR-209Undergraduate Honors Thesis, 2000.
[8] Thomas L Griffiths, Mark Steyvers. Finding scientific topics[C]//Proceedings of the National Academy of Sciences. USA: 2004, 101(suppl 1):5228-5235.
[9] Kuanyu Chen, Luesak Luesukprasert, Seng-cho T Chou. Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling[J]. IEEE Transactions on Knowledge and Data Engineering,2007,19(8):1016-1025.
[10] Ruihua Song, Haifeng Liu, Jirong Wen, et al. Learning block importance models for web pages[C]//Proceedings of the 13th international conference on World Wide Web. New York : ACM, 2004:203-211.
[11] Loulwah AlSumait, Daniel Barbara, Carlotta Domeniconi. On-Line Lda: Adative Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking[C]//Proceedings of the 2008 8th IEEE International Conference on Data Mining. Pisa, Italy: IEEE, 2008:3-12.
[12] Amr Ahmed, Eric Xing. Dynamic Non-parametric Mixture Models and the Recurrent Chinese Restaurant Process: With Applications to Evolutionary Clustering[C]//Proceedings of 8th SIAM International Conference on Data Mining in Applied Mathematics 130.Atlanta, GA, United states: Society for Industrial and Applied Mathematics Publications, 2008:219-230.
[13] YingJu Chen, HsinHsi Chen. NLP and IR approaches to monolingual and multilingual link detection[C]//Proceedings of the 19th International Conference on Computational Linguistics. Stroudsburg, PA, USA: ACL, 2002:1-7.
[14] 陆前. 英、汉跨语言话题检测与追踪技术研究[D]. 北京: 中央民族大学博士论文,2013.
[15] Wenxu Long, Jixun Gao, Zhengtao Yu, et al. Online Chinese-Vietnamese Bilingual Topic Detection Based on RCRP Algorithm with Event Elements[J]. Communications in Computer and Information Science, 2014,496(1):422-429.
[16] Lifu Huang, Lian’en Huang. Optimizd Event Storyline Generation based on Mixture-Event-Aspect Model[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, WA, United states: ACL,2013:726-735.
[17] Dragomir R Radev, Hongyan Jing, Malgorzata Stys, et al. Centroid-based summarization of multiple documents[J]. Information Processing and Management,2004,40(6):919-938.
[18] Gunes Erkan, Dragomir R Radev. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization[J]. Journal of Artificial Intelligence Research,2004,22(2):457-479.
[19] Gunes Erkan, Dragomir R Radev. Lexpagerank: Prestige in multi-document text summarization[C]//Proceedings of EMNLP, Barcelona, Spain: ACM, 2004:365-371.
[20] Hai Leong Chieu, Yoong Keok Lee. Query based event extraction along a timeline[C]//Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Sheffield, United Kingdom: ACM, 2004:425-432.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61472168,61175068,61163004);云南省自然科学基金重点项目(2013FA130),云南省科技创新人才基金(2014HE001)资助;云南大学软件工程重点实验室开放基金(2011SE14)
{{custom_fund}}