语义角色对自然语言的语义理解和分析有着重要的作用,其自动标注技术依赖良好的语义角色标注训练数据集。目前已有的大部分语义角色数据集在语义角色的标注上都不够精确甚至粗糙,不利于语义解析和知识抽取等任务。为了满足细粒度的语义分析,该文通过对实际语料的考察,提出了一种改进的汉语语义角色分类体系。在此基础上,以只有一个中枢语义角色的语料作为研究对象,提出了一种基于半自动方法的细粒度的汉语语义角色数据集构建方法,并构建了一个实用的语义角色数据集。截至目前,该工程一共完成了9 550条汉语语句的语义角色标注,其中含有9 423个中枢语义角色,29 142个主要周边语义角色,3 745个辅助周边语义角色,172条语句被进行了双重语义角色标注,以及104条语句被进行了不确定语义事件的语义角色标注。我们采用Bi-LSTM+CRF的基线模型在构建好的汉语语义角色数据集和公开的Chinese Proposition Bank数据集进行了关于主要周边语义角色的基准实验。实验表明,这两个语义角色数据集在主要周边语义角色自动识别方面存在差异,并且为提高主要周边语义角色的识别准确率提供了依据。
Abstract
Semantic roles play an important role in the natural language understanding, but most of the existing semantic-role training datasets are relatively rough or even misleading in labeling semantic roles. In order to facilitate the fine-grained semantic analysis, an improved taxonomy of Chinese semantic roles is proposed by investigating a real-world corpus. Focusing on a corpus formed with sentences with only one pivotal semantic role, we propose a semi-automatic method for fine-grained Chinese semantic role dataset construction. A corpus of 9,550 sentences has been labeled with 9,423 pivot semantic roles, 29,142 principal peripheral semantic roles and 3,745 auxiliary peripheral semantic roles. Among them, 172 sentences are double-labeled with semantic roles and 104 sentences are labeled with semantic roles of uncertain semantic events. With a Bi-LSTM+CRF model, we compare the dataset against the Chinese Proposition Bank and reveal differences in the recognition of principal peripheral semantic roles, which provide clues for further improvement.
关键词
语义角色 /
细粒度语义标注 /
汉语语义角色标注 /
汉语语义分析
{{custom_keyword}} /
Key words
semantic role /
fine-grained semantic labeling /
Chinese semantic role labeling /
Chinese semantic analysis
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Kapetanios E,Tatar D,Sacarea C. Natural language processing: Semantic aspects[M]. Florida: CRC Press,2013.
[2] Che W,Li Z,Liu T. LTP: A Chinese language technology platform[C]//Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations,2010: 13-16.
[3] Abend O,Rappoport A. Universal conceptual cognitive annotation[C]//Proceedings of the 51st Meeting of the Association for Computational Linguistics,2013: 228-238.
[4] 刘亚慧,杨浩苹,李正华,等. 一种轻量级的汉语语义角色标注规范[J]. 中文信息学报,2020,34(4): 10-20.
[5] Màrquez L,Carreras X,Litkowski K C,et al. Semantic role labeling: An introduction to the special issue[J]. Computational Linguistics,2008,34(2): 145-159.
[6] Baker C F,Fillmore C J,Lowe J B. The Berkeley framenet project[C]//Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics,1998: 86-90.
[7] Palmer M,Gildea D,Kingsbury P. The proposition bank: An annotated corpus of semantic roles[J].Computational linguistics,2005,31(1): 71-106.
[8] Meyers A,Reeves R,Macleod C,et al. Annotating noun argument structure for NomBank[C]//Proceeding of the Language Resources and Evaluation Conference,2004: 803-806.
[9] Xue N,Palmer M. Annotating the propositions in the Penn Chinese Treebank[C]//Proceeding of Sighan Workshop on Chinese Language Processing,2003: 47-54.
[10] 李济洪,王瑞波,王蔚林,等. 汉语框架语义角色的自动标注[J]. 软件学报,2010,21(4): 597-611.
[11] 袁毓林. 语义角色的精细等级及其在信息处理中的应用[J]. 中文信息学报,2007,21(4): 10-20.
[12] 周丹. 基于种子文法的汉语语义文法扩展方法研究[D]. 北京: 中国科学院大学硕士学位论文,2015.
[13] Zang L,Wang W,Wang Y,et al. A Chinese framework of semantic taxonomy and description: preliminary experimental evaluation using web information extraction[C]//Proceedings of the 8th International Conference on Knowledge Science,Engineering and Management,2015: 275-286.
[14] 王亚,陈龙,曹聪,等. 事件常识的获取方法研究[J]. 计算机科学,2015,42(10): 217-222.
[15] 方芳. Web文本语义分析与知识获取方法研究[D]. 北京: 中国科学院大学博士学位论文,2019.
[16] Fillmore C J. The Case for Case[C]//Proceedings of the Texas Symposium on Language Universals,1967: 13-15.
[17] 冯志伟. 从格语法到框架网络[J]. 解放军外国语学院学报,2006,29(003): 3-11.
[18] 朱晓亚. 现代汉语句模研究[M]. 北京: 北京大学出版社,2001.
[19] 袁毓林. 基于认知的汉语计算语言学研究[M]. 北京: 北京大学出版社,2008.
[20] 鲁川. 知识工程语言学[M]. 北京: 清华大学出版社,2010.
[21] 刘茂福,胡慧君. 基于认知与计算的事件语义学研究[M]. 北京: 科学出版社,2013.
[22] 王亚. 基于语义分类的常识知识获取方法研究[D]. 桂林: 广西师范大学硕士学位论文,2015.
[23] Carletta J. Assessing Agreement on Classification Tasks: The Kappa Statistic[J]. Computational Linguistics,1996,22(2): 249-254.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家重点研发计划(2017YFC1700302,2017YFB1002300);国家自然科学基金(61702234);北京市科技新星计划交叉学科合作课题(Z191100001119014)
{{custom_fund}}