中文动词实现状态数据集构建

徐进,辛欣

PDF(6348 KB)
PDF(6348 KB)
中文信息学报 ›› 2025, Vol. 39 ›› Issue (2) : 27-40.
语言资源建设与应用

中文动词实现状态数据集构建

  • 徐进1,辛欣1,2
作者信息 +

Chinese Verb Occurrence State Dataset Construction

  • XU Jin1, XIN Xin1,2
Author information +
History +

摘要

判断动词是否在现实中真实发生是自然语言理解中的重要问题,其不仅能够为事件抽取等自然语言处理应用提供支撑,也有助于更深入地理解语言。虽然动词实现状态的辨析在英文领域已有一定的研究基础,但中文领域的相关工作仍比较缺乏。一方面,中文动词实现状态缺乏标注规范;另一方面,缺乏相关的中文语料。针对目前中文动词实现状态缺乏标注规范的问题,该文在英文规范的基础上,分析《人民日报》中文语料,结合时间提示词、句式等信息,总结了中文动词实现状态标注规范。针对中文目前缺少动词实现状态相关语料的问题,该文构建了中文动词实现状态数据集,包括5 430条语句和21 226个中文动词实例。实验表明,神经网络模型在处理描述客观规律以及缺少时间提示词等情况下的分类时还欠准确。

Abstract

Judging whether verbs really occur is an important issue in natural language understanding with potential applications in event extraction. In contrast to certain works for English, there is still little related work addressing this issue for Chinese. This paper analyzes Peoples Daily corpus and summarizes the labeling rules by with a reference to the practices in English. Then, we construct a dataset of Chinese verb occurrence states, including 5430 sentences and 21,226 Chinese verb instances labelled. The experiment shows that the cases describing objective rules and the cases that lack time phrases are more difficult to predict than general cases for the neural model.

关键词

中文动词实现状态 / 数据集构建

Key words

Chinese verb occurrence state / dataset construction

引用本文

导出引用
徐进,辛欣. 中文动词实现状态数据集构建. 中文信息学报. 2025, 39(2): 27-40
XU Jin, XIN Xin. Chinese Verb Occurrence State Dataset Construction. Journal of Chinese Information Processing. 2025, 39(2): 27-40

参考文献

[1] O'GORMAN T, WRIGHT BETTNER K, PALMER M. Richer event description: Integrating event coreference with temporal, causal and bridging annotation[C]//Proceedings of the 2nd Workshop on Computing News Storylines, 2016: 47-56.
[2] WALKER C, STRASSEL S, MEDERO J, et al. ACE2005 multilingual training corpus[J]. Linguistic Data Consortium, 2006, 57:45.
[3] 周慧先.汉英动词“时”和“体”的比较研究[J].云南师范大学学报,2005(02):55-60.
[4] 龚千炎. 现代汉语的时间系统[J]. 世界汉语教学,1994 (1): 1-6.
[5] 李铁根. 现代汉语时制硏究[M]. 沈阳: 辽宁大学出版社, 1999.
[6] YANG G, BATEMAN J. The Chinese aspect system and its semantic interpretation[C]//Proceedings of the 19th International Conference on Computational Linguistics, 2002.
[7] 李艳翠, 冯文贺, 周国栋, 等. 基于逗号的汉语子句识别研究[J]. 北京大学学报 (自然科学版), 2013, 49(1): 7-14.
[8] 韦华.实用现代汉语语法[J]. 北京: 知识出版社, 2003.
[9] 玄玥. 动词 “完结” 范畴考察与类型学分析[J]. 世界汉语教学, 2017, 31(1): 20-35.
[10] YU S. Specification for corpus processing at Peking University: Word segmentation, POS tagging and phonetic notation[J]. Journal of Chinese Language and Computing, 2003, 13:121-158.
[11] Key laboratory of computational linguistics,ministry of education, Peking University, Pku.edu.cn. [EB/OL]. https://klcl.pku.edu.cn/gxzy/231686.html. [2023-10-01].
[12] The peoples daily annotated corpus released[J]. Chinese Teaching in the World, 2001,3: 88-101.
[13] CHE W, LI Z, LIU T. Ltp: A Chinese language technology platform[C]//Proceedings of the Coling, 2010: 13-16.
[14] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-Training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACLL, 2019: 4171-4186.
[15] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[16] OSHEA K, NASH R. An introduction to convolutional neural networks[J].arXiv preprint arXiv:1511.08458, 2015.
[17] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
[18] 智谱AI. 大模型开放平台 [EB/OL]. https://open.bigmodel.cn. [2023-10-01].
[19] LIU J, MIN L, HUANG X. An overview of event extraction and its applications[J]. arXiv preprint arXiv:2111.03212, 2021.

徐进(1999—),硕士研究生,主要研究领域为自然语言处理。
E-mail: 3120210994@bit.edu.cn辛欣(1984—),通信作者,博士,副教授,主要研究领域为自然语言处理。
E-mail: xxin@bit.edu.cn

基金

国家自然科学基金(62172044)
PDF(6348 KB)

Accesses

Citation

Detail

段落导航
相关文章

/