微博客文本蕴含丰富的实时交通事件信息,能够为现有交通信息采集手段提供补充。然而,当前事件抽取方法缺少对地理实体关系的判断过程,对涉及多个地理实体及关系表达的地理空间要素抽取效果不佳,难以准确识别交通事件信息的位置描述。该文提出一种自动标注方法,将地理实体关系识别引入事件抽取过程来解决这一问题。该方法利用条件随机场模型实现交通事件角色标注,利用支撑向量机模型实现角色关系与要素关系标注,完成了交通事件信息空间要素识别。以新浪微博为数据源开展的实验分析表明,该文所提出的微博客蕴含交通事件抽取方法,正确率和召回率均达到90%,优于现有的基于模式匹配的抽取方法。
Abstract
Microblog messages usually contain a great amount of real-time traffic information which can complement the sensor based traffic information collecting technologies. In this paper, we propose an automatic event labeling method to extract traffic information from microblog messages. Specifically, we apply the spatial relation identification between geographic entities in event extraction to determine the spatial elements in traffic event messages. Firstly, a conditional random field model is used to label the event role in the message texts. Secondly, the relations between the roles and the relations between the elements are tagged by SVM models. The experiment on Sina microblogs shows the precision and recall of the proposed approach are both over 90%, which is superior to the well-known pattern matching method.
关键词
微博客 /
信息抽取 /
交通事件 /
条件随机场 /
支撑向量机
{{custom_keyword}} /
Key words
microblog /
information extraction /
traffic event /
conditional random fields /
support vector machine
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 陆锋, 郑年波, 段滢滢等. 出行信息服务关键技术研究进展与问题探讨[J]. 中国图象图形学报, 2009, 14(07): 1219-1229.
[2] 赵妍妍, 秦兵, 车万翔等. 中文事件抽取技术研究[J]. 中文信息学报, 2008, 22(01): 3-8.
[3] 郑家恒, 王兴义, 李飞. 信息抽取模式自动生成方法的研究[J]. 中文信息学报, 2004, 18(01): 48-54.
[4] 张春菊. 中文文本中事件时空与属性信息解析方法研究[D]. 南京师范大学博士学位论文, 2013.
[5] Chieu H L, Ng H T. A Maximum Entropy Approach to Information Extraction from Semi-structured and Free Text[C]//Proceedings of the 18th National Conference on Artificial Intelligence. Menlo Park, CA, USA, 2002: 786-791.
[6] Kordjamshidi P, Van Otterlo M, Moens M-F. Spatial Role Labeling: Towards Extraction of Spatial Relations from Natural Language[J]. ACM Transactions on Speech and Language Processing, 2011, 8(3): 4:1-4:36.
[7] Kordjamshidi P, Frasconi P, Otterlo M V, et al. Relational Learning for Spatial Relation Extraction from Natural Language[G]//Muggleton S H, Tamaddoni-Nezhad A, Lisi F A. Inductive Logic Programming. Springer Berlin Heidelberg, 2012: 204-220.
[8] Sankaranarayanan J, Samet H, Teitler B E, et al. TwitterStand: news in tweets[C]//Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS’09). Seattle, Washington, 2009: 42-51.
[9] Strtgen J, Gertz M, Popov P. Extraction and Exploration of Spatio-temporal Information in Documents[C]//Proceedings of the 6th Workshop on Geographic Information Retrieval. Zurich, Switzerland, 2010: 16:1-16:8.
[10] Lingad J, Karimi S, Yin J. Location extraction from disaster-related microblogs[C]//Proceedings of the 22nd international conference on World Wide Web companion (WWW ’13 Companion). Rio de Janeiro, Brazil: 2013: 1017-1020.
[11] Sakaki T, Okazaki M, Matsuo Y. Earthquake shakes Twitter users: real-time event detection by social sensors[C]//Proceedings of the 19th international conference on World wide web (WWW’10). Raleigh, North Carolina, USA, 2010: 851-860.
[12] Schulz A, Hadjakos A, Paulheim H, et al. A Multi-Indicator Approach for Geolocalization of Tweets[C]//Proceedings of the 7th International AAAI Conference on Weblogs and Social Media (ICWSM 2013). Boston, USA: 2013: 573-582.
[13] Rauch E, Bukatin M, Baker K. A Confidence-based Framework for Disambiguating Geographic Terms[C]//Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References - Volume 1. Edmonton, Canada, 2003: 50-54.
[14] Pouliquen B, Kimler M, Steinberger R, et al. Geocoding multilingual texts: Recognition, disambiguation and visualisation[C]//Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC-2006). Genoa, Italy, 2006: 53-58.
[15] 陈传彬, 陆锋, 励惠国等. 自然语言表达实时路况信息的路网匹配融合技术[J]. 中国图象图形学报, 2009, 14(8): 1669-1676.
[16] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting andLabeling Sequence Data[C]//Proceedings of the 18th International Conference on Machine Learning (ICML ’01). Williamstown, MA, USA, 2001: 282-289.
[17] Peng F, McCallum A. Information extraction from research papers using conditional random fields[J]. Information Processing and Management, 2006, 42(4): 963-979.
[18] Cortes C, Vapnik V. Support-vector networks[J]. Machine Learning, 1995, 20(3): 273-297.
[19] Fernández-Delgado M, Cernadas E, Barro S, et al. Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?[J]. Journal of Machine Learning Research, 2014, 15(1): 3133-3181.
[20] Kosala R, Adi E, Steven. Harvesting Real Time Traffic Information from Twitter[J]. Procedia Engineering, 2012, 50: 1-11.
[21] Wanichayapong N, Pruthipunyaskul W, Pattara-Atikom W, et al. Social-based traffic information extraction and classification[C]//Proceedings of the 11th International Conference on ITS Telecommunications (ITST 2011). St. Petersburg, Russia, 2011: 107-112.
[22] Endarnoto S K, Pradipta S, Nugroho A S, et al. Traffic Condition Information Extraction & Visualization from Social Media Twitter for Android Mobile Application[C]//Proceedings of the 2011 International Conference on Electrical Engineering and Informatics (ICEEI 2011). Bandung, Indonesia, 2011: 1-4.
[23] 程显毅, 朱倩. 文本挖掘原理[M]. 北京: 科学出版社, 2010.
[24] 张恒才, 陆锋, 仇培元. 基于D-S证据理论的微博客蕴含交通信息提取方法[J]. 中文信息学报, 2015,29(2): 170-178.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(41631177); 国家自然科学基金(41401460)
{{custom_fund}}