隐式篇章关系分类主要任务是在显式关联线索缺失的情况下,自动检测特定论元之间的语义关系类别。前人研究显示,语言学特征能够有效辅助隐式篇章关系的分类。目前,主流检测方法由于缺少足够的已标注隐式训练样本,导致分类器无法准确学习各种分类特征,分类精确率仅约为40%。针对这一问题,该文提出一种基于训练样本集扩展的隐式篇章关系分类方法。该方法首先借助论元向量,以原始训练样本集为种子实例,从外部数据资源中挖掘与其在语义以及关系上一致的“平行训练样本集”;然后将“平行训练样本集”加入原始训练样本集中,形成扩展的训练样本集;最后基于扩展的训练样本集,实现隐式篇章关系的分类。该文在宾州篇章树库(Penn Discourse Treebank, PDTB)上对扩展的训练样本集进行评测,结果显示,相较于原始训练样本集,使用扩展的训练样本集的实验系统整体性能提升8.41%,在四种篇章关系类别上的平均性能提升5.42%。与现有主流分类方法性能对比,识别精确率提升6.36%。
Abstract
The implicit discourse relation recognition is to automatically detect the relationships between two arguments without explicit connectives. Previous studies show that linguistic features are effective for implicit discourse relation recognition. However, the state-of-the-art accuracy is merely 40% for the lack of enough training data. For the problem, this paper presents a novel implicit discourse relation recognition method based on the training data expansion. Firstly, we take some origin training data as seed samples, and then use them to mine semantically and relationally parallel data from the external data resources by using “arguments vectors”. Secondly, we augment origin training data with the mined parallel training data. Finally, we experiment the implicit discourse relation classification using the expanded data. Experiment results on the Penn Discourse Treebank (PDTB) show that our method outperforms the baseline system with a gain of 8.41% on the whole, and 5.42% on average in classification accuracy respectively. Compared with the state-of-the-art system, we further acquire 6.36% improvements.
Key words: implicit discourse relation; semantic vector; training data expansion; discourse analysis 收稿日期: 2014-12-25 定稿日期: 2015-03-27 基金项目: 国家自然科学基金(61373097, 61272259, 61272260, 90920004);教育部博士学科点专项基金(2009321110006, 20103201110021);江苏省自然科学基金(BK2011282);江苏省高校自然科学基金(11KJA520003);苏州市自然科学基金(SH201212)
关键词
隐式篇章关系 /
语义向量 /
训练样本集扩展 /
篇章分析
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
然而,本文提出的基于训练样本集扩展的隐式篇章关系分类方法性能仍偏低,原因在于,隐式“论元对”本身具有较强的主观性和歧义性,从不同的角度考虑具有不同的语义关系。例如, “He worked all night yesterday”和“He slept all day today”两论元之间既可表示偶然关系也可表示时序关系。针对这一问题,未来工作中,我们将对本文方法深入和细化: 在训练样本集扩展方面,尝试借助LDA模型以及篇章上下文信息选择歧义性较小的实例作为种子实例;在论元向量计算方面,采用更多的相似度计算方法,例如,余弦相似度、Jaccard相似度等;在特征表示方面,尝试采用特征选择、特征融合等方法。此外,将现有的篇章关系分类方法扩展至第二层的篇章关系识别,实现更细粒度的篇章关系分类。 [1] R Prasad, N Dinesh, A Lee, et al. The Penn Discourse TreeBank 2.0[C]//Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), 2008: 2961-2968.
[2] E Miltsakaki, L Robaldo, A Lee, et al. Sense Annotation in the Penn Discourse Treebank[C]//Proceedings of the Computational Linguistics and Intelligent Text Processing. Springer Berlin Heidelberg, 2008: 275-286.
[3] E Pitler, M Raghupathy, H Mehta, et al. Joshi. Easily Identifiable Discourse Relations[R]. Technical Reports (CIS), 2008: 87-90.
[4] E Pitler, A Louis, A Nenkova. AutomaticSense Prediction for Implicit Discourse Relations in Text[C]//Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL-AFNLP), 2009, 2: 683-691.
[5] Z H Lin, M Y Kan, H T Ng. Recognizing Implicit Discourse Relations in the Penn Discourse Treebank[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2009, 1: 343-351.
[6] R Soricut, D Marcu. Sentence Level Discourse Parsing Using Syntactic and Lexical Information[C]//Proceedings of the Human Language Technology and North American Association for Computational Linguistics Conference (HLT-NAACL), 2003: 149-156.
[7] W T Wang, J Su, C L Tan. KernelBased Discourse Relation Recognition with Temporal Ordering Information[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), 2010: 710-719.
[8] Z M Zhou, Y Xu, Z Y Niu, et al. Predicting Discourse Connectives for Implicit Discourse Relation Recognition[C]//Proceedings of the 23rd International Conference on Computational Linguistics (CL): Posters, 2010: 1507-1514.
[9] J Park, C Cardie. Improving Implicit Discourse Relation Recognition Through Feature Set Optimization[C]//Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 2012: 108-112.
[10] M Lan, Y Xu, Z Y Niu. Leveraging Synthetic Discourse Data via Multi-task Learning for Implicit Discourse Relation Recognition[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), 2013: 476-485.
[11] J J Li, M Carpuat, A Nenkova. Cross-lingual Discourse Relation Analysis: A corpus study and a semi-supervised classification system[C]//Proceedings of the 25th International Conference on Computational Linguistics (COLING), 2014: 577-587.
[12] X Wang, S J Li, J Li, et al. Implicit Discourse Relation Recognition by Selecting Typical Training Examples[C]//Proceedings of the 22nd International Conference on Computational Linguistics (COLING), 2012: 2757-2772.
[13] D Marcu, A Echihabi. AnUnsupervised Approach to Recognizing Discourse Relations[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL), 2002: 368-375.
[14] C Sporleder, A Lascarides. Using automatically labelled examples to classify rhetorical relations: An assessment[J].Natural Language Engineering, 2008, 14(03): 369-416.
[15] G E Hinton. Learning distributed representations of concepts[C]//Proceedings of the eighth annual conference of the cognitive science society (COGSCI).1986: 1-12.
[16] Y Bengio, R Ducharme, P Vincent, et al. A neural probabilistic language model[J]. The Journal of Machine Learning Research, 2003, 3: 1137-1155.
[17] R Socher, J Pennington, E H Huang, et al. Semi-supervised recursive autoencoders for predicting sentiment distributions[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2011: 151-161.
[18] J Turian, L Ratinov, Y Bengio. Word representations: a simple and general method for semi-supervised learning[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), 2010: 384-394.
[19] Y Hong, X P Zhou, T T Che, et al. Cross-argument inference for implicit discourse relation recognition[C]//Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM), 2012: 295-304.
[20] C C Chang, C J Lin. LIBSVM: a library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2001, 2(3): 389-396.
[21] 徐凡, 朱巧明, 周国栋. 基于树核的隐式篇章关系识别[J]. 软件学报, 2013, 24(5): 1022-1035.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61373097, 61272259, 61272260, 90920004);教育部博士学科点专项基金(2009321110006, 20103201110021);江苏省自然科学基金(BK2011282);江苏省高校自然科学基金(11KJA520003);苏州市自然科学基金(SH201212)
{{custom_fund}}