宋衡,曹存根,王亚,王石. 一种细粒度的汉语语义角色标注数据集的构建方法[J]. 中文信息学报, 2022, 36(12): 52-66,73.
SONG Heng, CAO Cungen, WANG Ya , WANG Shi. Construction of a Finely-Grained Training Dataset for Chinese Semantic-Role Labeling. , 2022, 36(12): 52-66,73.
Construction of a Finely-Grained Training Dataset for Chinese Semantic-Role Labeling
SONG Heng1,2 , CAO Cungen1, WANG Ya1,2 , WANG Shi1
1.Key Laboratory of Intelligent Information Processing, Institute of Computer Technology, Chinese Academy of Sciences, Beijing 100190, China; 2.University of Chinese Academy of Sciences, Beijing 100049, China
Abstract:Semantic roles play an important role in the natural language understanding, but most of the existing semantic-role training datasets are relatively rough or even misleading in labeling semantic roles. In order to facilitate the fine-grained semantic analysis, an improved taxonomy of Chinese semantic roles is proposed by investigating a real-world corpus. Focusing on a corpus formed with sentences with only one pivotal semantic role, we propose a semi-automatic method for fine-grained Chinese semantic role dataset construction. A corpus of 9,550 sentences has been labeled with 9,423 pivot semantic roles, 29,142 principal peripheral semantic roles and 3,745 auxiliary peripheral semantic roles. Among them, 172 sentences are double-labeled with semantic roles and 104 sentences are labeled with semantic roles of uncertain semantic events. With a Bi-LSTM+CRF model, we compare the dataset against the Chinese Proposition Bank and reveal differences in the recognition of principal peripheral semantic roles, which provide clues for further improvement.
[1] Kapetanios E,Tatar D,Sacarea C. Natural language processing: Semantic aspects[M]. Florida: CRC Press,2013. [2] Che W,Li Z,Liu T. LTP: A Chinese language technology platform[C]//Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations,2010: 13-16. [3] Abend O,Rappoport A. Universal conceptual cognitive annotation[C]//Proceedings of the 51st Meeting of the Association for Computational Linguistics,2013: 228-238. [4] 刘亚慧,杨浩苹,李正华,等. 一种轻量级的汉语语义角色标注规范[J]. 中文信息学报,2020,34(4): 10-20. [5] Màrquez L,Carreras X,Litkowski K C,et al. Semantic role labeling: An introduction to the special issue[J]. Computational Linguistics,2008,34(2): 145-159. [6] Baker C F,Fillmore C J,Lowe J B. The Berkeley framenet project[C]//Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics,1998: 86-90. [7] Palmer M,Gildea D,Kingsbury P. The proposition bank: An annotated corpus of semantic roles[J].Computational linguistics,2005,31(1): 71-106. [8] Meyers A,Reeves R,Macleod C,et al. Annotating noun argument structure for NomBank[C]//Proceeding of the Language Resources and Evaluation Conference,2004: 803-806. [9] Xue N,Palmer M. Annotating the propositions in the Penn Chinese Treebank[C]//Proceeding of Sighan Workshop on Chinese Language Processing,2003: 47-54. [10] 李济洪,王瑞波,王蔚林,等. 汉语框架语义角色的自动标注[J]. 软件学报,2010,21(4): 597-611. [11] 袁毓林. 语义角色的精细等级及其在信息处理中的应用[J]. 中文信息学报,2007,21(4): 10-20. [12] 周丹. 基于种子文法的汉语语义文法扩展方法研究[D]. 北京: 中国科学院大学硕士学位论文,2015. [13] Zang L,Wang W,Wang Y,et al. A Chinese framework of semantic taxonomy and description: preliminary experimental evaluation using web information extraction[C]//Proceedings of the 8th International Conference on Knowledge Science,Engineering and Management,2015: 275-286. [14] 王亚,陈龙,曹聪,等. 事件常识的获取方法研究[J]. 计算机科学,2015,42(10): 217-222. [15] 方芳. Web文本语义分析与知识获取方法研究[D]. 北京: 中国科学院大学博士学位论文,2019. [16] Fillmore C J. The Case for Case[C]//Proceedings of the Texas Symposium on Language Universals,1967: 13-15. [17] 冯志伟. 从格语法到框架网络[J]. 解放军外国语学院学报,2006,29(003): 3-11. [18] 朱晓亚. 现代汉语句模研究[M]. 北京: 北京大学出版社,2001. [19] 袁毓林. 基于认知的汉语计算语言学研究[M]. 北京: 北京大学出版社,2008. [20] 鲁川. 知识工程语言学[M]. 北京: 清华大学出版社,2010. [21] 刘茂福,胡慧君. 基于认知与计算的事件语义学研究[M]. 北京: 科学出版社,2013. [22] 王亚. 基于语义分类的常识知识获取方法研究[D]. 桂林: 广西师范大学硕士学位论文,2015. [23] Carletta J. Assessing Agreement on Classification Tasks: The Kappa Statistic[J]. Computational Linguistics,1996,22(2): 249-254.