一种面向突发事件的文本语料自动标注方法

刘 炜,王 旭,张雨嘉,刘宗田

PDF(3843 KB)
PDF(3843 KB)
中文信息学报 ›› 2017, Vol. 31 ›› Issue (2) : 76-85.
信息抽取与文本挖掘

一种面向突发事件的文本语料自动标注方法

  • 刘 炜,王 旭,张雨嘉,刘宗田
作者信息 +

An Automatic-Annotation Method for Emergency Text Corpus

  • LIU Wei, WANG Xu, ZHANG Yujia, LIU Zongtian
Author information +
History +

摘要

事件语料库是研究语义Web中事件知识的抽取、表示、推理和挖掘的基础和关键技术之一。该文以事件作为文本知识单元,在LTP分析的基础上,用序列模式挖掘算法PrefixSpan从现有的小规模语料库中挖掘事件要素的词性规则等,用同义词词林(扩展版)对触发词表进行了扩充,结合自定义的事件要素词典,采用多遍过滤、逐遍完善的思想提出一种针对大规模突发事件语料库构建的自动标注方法,在实验部分不仅与人工标注做了对比,同时与Stanford CoreNLP NER进行了对比,实验效果理想。

Abstract

Event-based text corpus is the foundation for the research on detection, representation, reasoning and exploitation of events in the Semantic Web. This paper proposes an automatic-annotation method for event-based texts to construct large-scale emergencies news corpus. Firstly, this paper presents an event structure model as event-based knowledge unit; Secondly, on the basis of text process by LTP , we apply the PrefixSpan to mine the rules of event elements from small-scale available corpus; Thirdly, by combining a customized dictionary of event elements, the denoters are expanded by Tonyici Cilin (Extended). In the experiment, the automatic annotation method is compared with manual tagging method and Stanford CoreNLP NER, showing that this method can improve the efficiency of event-based text annotation effectively.

关键词

突发事件 / 语料库 / 自动标注

Key words

emergency events / corpus / automatic / annotation

引用本文

导出引用
刘 炜,王 旭,张雨嘉,刘宗田. 一种面向突发事件的文本语料自动标注方法. 中文信息学报. 2017, 31(2): 76-85
LIU Wei, WANG Xu, ZHANG Yujia, LIU Zongtian. An Automatic-Annotation Method for Emergency Text Corpus. Journal of Chinese Information Processing. 2017, 31(2): 76-85

参考文献

[1] 喻国明, 李慧娟. 大数据时代传播研究中语料库分析方法的价值[J]. 传媒, 2014 (2): 64-66.
[2] LI Xiang, LIU Gang, LING Anhong, et al.Building a practical ontology for emergency response systems[C]//Proceedings of 2008 International Conference on Computer Science and Software Engineering. 2008: 222-225.
[3] Q YU Kai, WANG Qingquan, RONG Lili. Emergency ontology construction in emergency decision support system[C]//Proceedings of 2008 IEEE International Conference on Service Operations and Logistics, and Informatics. 2008: 801-805.
[4] 付剑锋. 面向事件的知识处理研究[D]. 上海大学博士学位论文, 2010.
[5] 赵军, 刘康, 周光有, 等. 开放式文本信息抽取[J]. 中文信息学报, 2011, 25(6): 98-110.
[6] Doddington G R, Mitchell A, Przybocki M A, et al. The Automatic Content Extraction (ACE) Program-Tasks, Data, and Evaluation[C]//Proceedings of the LREC. 2004.
[7] Consortium L D. ACE(Automatic Content Extraction)chinese annotation guidelines for events[DB/OL]. http://projects.ldc.upenn.edu/ace/docs/Chinese-Entities-Guidelines_v5.5.pdf.
[8] Pustejovsky J, Hanks P, Sauri R, et al. The timebank corpus [EB]. In Corpus Linguistics, 2003, pp.647-656, http://ucrel.lancs.ac.uk/publications/cl2003/papers/pustejovsky.pdf.
[9] 刘宗田, 黄美丽, 周文, 等. 面向事件的本体研究[J]. 计算机科学, 2009, 36(11): 189-192.
[10] Zhang X, Liu Z, Liu W, et al. Research on event-based semantic annotation of Chinese[C]//Proceedings of the Computer Science and Network Technology (ICCSNT), 2012 2nd International Conference on. IEEE, 2012: 1883-1888.
[11] 刘茂福, 李妍, 姬东鸿. 基于事件语义特征的中文文本蕴含识别[J]. 中文信息学报, 2013, 27(5): 129-136.
[12] Wanxiang Che, Zhenghua Li, Ting Liu. LTP: A Chinese Language Technology Platform[C]//Proceedings of the Coling 2010:Demonstrations. 2010.08, pp13-16, Beijing, China
[13] 同义词词林扩展版 [A Thesaurus of Chinese Words][DB],http://www.ltp-cloud.com/download/#down_cilin.
[14] Pei J, Han J, Mortazavi-Asl B, et al.Mining sequential patterns by pattern-growth: The prefixspan approach[J]. Knowledge and Data Engineering, IEEE Transactions on, 2004, 16(11): 1424-1440.
[15] Jenny Rose Finkel, Trond Grenager, Christopher Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling[C]//Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics,2005: 363-370.

基金

国家自然科学基金(61305053);国家自然科学基金(61273328)
PDF(3843 KB)

Accesses

Citation

Detail

段落导航
相关文章

/