基本篇章单元(Elementary Discourse Units,EDU)识别对篇章分析工作意义重大,是构建篇章结构的基础。从篇章衔接性视角看,每个EDU都由要表达信息的起始点——主位和要传达的新信息——述位两部分构成。该文结合已有研究和汉语实际情况,给出了一个基于主述位理论的汉语基本篇章单元识别方法。该方法将EDU识别转化为主述位识别问题,由主位、述位的位置间接地确定EDU的边界,最终完成EDU的识别。而主、述位间具有明显的信息序列化特征,因此可通过序列化标注方法进行。基于主述位理论的汉语基本篇章单元识别方法更关注EDU作为一个独立的篇章单元的内部构成,在汉语篇章话题结构语料库CDTC上的实验也进一步验证了该方法的有效性,EDU识别的性能F1值达到了89.46%。
Abstract
Elementary Discourse Unit (EDU) recognition is a fundamental task of discourse analysis. This paper proposes a Chinese elementary discourse unit recognition approach based on theme-rheme theory, in which the identification of EDU is cast into the problem of theme-rheme recognition. Detecting theme and rheme can be conducted using sequence label approach, and after achieving the boundary of theme and rheme, we can merge them to get the EDU boundary. In contrast to related work on EDU recognition, our proposed approach can pay more attention on the internal structure of EDU. The experiments on the Chinese Discourse Topic Corpus (CDTC) show the effectiveness of our approach by the F1-score of 89.46%.
关键词
基本篇章单元 /
主位 /
述位 /
序列化标注
{{custom_keyword}} /
Key words
elementary discourse unit /
theme /
rheme /
sequence label
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 徐凡,朱巧明,周国栋,等.衔接性驱动的篇章一致性建模研究[J].中文信息学报,2014,28(3):11-21,27.
[2] 奚雪峰,褚晓敏,周国栋,等.汉语篇章微观话题结构建模与语料库构建[J].计算机研究与发展,2017,54(8):1833-1852.
[3] Carlson Lynn,Marcu Daniel,Okurowski Mary Ellen.Building a discourse-tagged corpus in the framework of Rhetorical Structure Theory[C]//Proceedings of Current Directions in Discourse.New York:Kluwer,2003:85-112.
[4] Prasad Rashmi,Dinesh Nikhil,Lee Alan,et al.The penn discourse treebank 2.0[C]//Proceedings of the International Conference on Language Resources and Evaluation.Marrakech,Morocco,2008:2961-2968.
[5] Caroline Sporleder,Mirella Lapata.Discourse chunking and its application to sentence compression[C]//Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing.Vancouver,British Columbia,Canada:EMNLP,2005:257-264.
[6] Ngo Xuan Bach,Nguyen Le Minh,Akira Shimazu.A reranking model for discourse segmentation using subtree features[C]//Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL).Seoul,South Korea:SIGDIAL,2012:160-168.
[7] Chloe Braud,Ophelie Lacroix,Anders S?gaard.Does syntax help discourse segmentation? Not so much[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.Copenhagen,Denmark:EMNLP,2017:2432-2442.
[8] 乐明.汉语篇章修辞结构的标注研究[J].中文信息学报,2008,22(4):19-23.
[9] Yuping Zhou,Nianwen Xue.PDTB-style discourse annotation of Chinese text[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics.Jeju Island,Korea:ACL,2012:69-77.
[10] Yancui Li,Wenhe Feng,Jing Sun,et al.Build Chinese discourse corpus with connective-driven dependency tree structure[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Doha,Qatar:EMNLP,2014:2105-2114.
[11] 奚雪峰.汉语篇章话题结构:表示体系、资源构建及其分析研究[D].苏州:苏州大学博士学位论文,2017.
[12] 李艳翠,冯文贺,周国栋.基于逗号的汉语子句识别研究[J].北京大学学报(自然科学版),2013,29(1):7-14.
[13] Xue Nianwen,Yang Yaqin.Chinese sentence segmentation as comma classification[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics.Portland:ACL,2011:631-635.
[14] Meixun Jin,Mi-Young Kim,Dongil Kim,et al.Segmentation of Chinese long sentences using commas[C]//Proceedings of the 3rd ACL SIGHAN Workshop.Barcelona:ACL,2004:1-8.
[15] 朱伟华.马泰休斯[J].国外语言学,1987(02):86-88.
[16] Halliday M A K,Christian M I M.Matthiessen:An introduction to functional grammar[M].Hodder Education,London,2004.
[17] 徐凡,王明文,谢旭升,等.基于主位-述位结构理论的英文作文连贯性建模研究[J].中文信息学报,2016,30(01):115-123.
[18] Xue-feng Xi,Guodong Zhou.Building a Chinese discourse topic corpus with micro-topic scheme based on theme-rheme theory[C]//Proceedings of Big Data Analytics (EI),2017
[19] 李艳翠.汉语篇章结构表示体系及资源构建研究[D].苏州:苏州大学博士学位论文,2015.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61876118);人工智能应急项目(61751206);国家重点研发计划子课题(2017YFB1002101)
{{custom_fund}}