Abstract:Implicit discourse relation recognition is an important subtask in the discourse analysis field. Most existing studies assume the balance between the numbers of positive and negative samples, and employ random under-sampling method to keep the training data well balanced. However, the training data has imbalanced distribution in reality which affect the recognition performance of the implicit discourse relation. To solve this problem, we propose a novel implicit discourse relation recognition method based on the frame semantic vectors. Firstly, we represent the argument as a frame semantic vector using the FrameNet resource, and then mine a number of effective discourse relation samples from the external data resources based on this new representation. Finally, we add the mined samples into the origin training data sets and perform experiment on this extended data sets. Evaluation on the Penn Discourse Treebank (PDTB) show that the proposed method perform better than the current mainstream imbalanced classification methods. Key words implicit discourse recognition; imbalanced data; frame semantic vectors
[1] R Prasad, N Dinesh, A Lee, et al. The Penn Discourse TreeBank 2.0[C]//Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC),2008:2961-2968. [2] E Miltsakaki, L Robaldo, A Lee, et al. Sense Annotation in the Penn Discourse Treebank[C]//Proceedings of the Computational Linguistics and Intelligent Text Processing. Springer Berlin Heidelberg, 2008:275-286. [3] E Pitler, M Raghupathy, H Mehta, et al. Easily Identifiable Discourse Relations[R]. Technical Reports (CIS), 2008:87-90. [4] E Pitler, A Louis, A Nenkova. Automatic Sense Prediction for Implicit Discourse Relations in Text[C]//Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL-AFNLP). 2009, 2:683-691. [5] Z M Zhou, Y Xu, Z Y Niu, et al. Predicting Discourse Connectives for Implicit Discourse Relation Recognition[C]//Proceedings of the 23rd International Conference on Computational Linguistics (COLING). Posters, 2010:1507-1514. [6] Z H Lin, M Y Kan, H T Ng. Recognizing Implicit Discourse Relations in the Penn Discourse Treebank[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2009, 1:343-351. [7] W T Wang, J Su, C L Tan. Kernel Based Discourse Relation Recognition with Temporal Ordering Information[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL). 2010:710-719. [8] J Park, C Cardie. Improving Implicit Discourse Relation Recognition through Feature Set Optimization[C]//Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). 2012:108-112. [9] X Wang, S J Li, J Li, et al. Implicit Discourse Relation Recognition by Selecting Typical Training Examples[C]//Proceedings of the 24th International Conference on Computational Linguistics (COLING). 2012: 2757-2772. [10] A T Rutherford, N Xue. Discovering implicit discourse relations through brown cluster pair representation and coreference patterns [C]//Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 2014: 645-654. [11] J J Li, M Carpuat, A Nenkova. Cross-lingual Discourse Relation Analysis: A corpus study and a semi-supervised classification system[C]//Proceedings of the 25th International Conference on Computational Linguistics (COLING). 2014: 577-587. [12] I Mani, J P Zhang. KNN approach to unbalanced data distributions: a case study involving information extraction[C]//Proceedings of Workshop on Learning from Imbalanced Datasets. 2003. [13] X Y Liu, J Wu, Z H Zhou. Exploratory under-sampling for class-Imbalance learning [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2009, 2(39): 539-550. [14] N V Chawla, K W Bowyer, L O Hall, et al. SMOTE: synthetic minority over-sampling technique [J]. Journal of artificial intelligence research, 2002: 321-357. [15] H Han, W Y Wang, B H Mao. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning [M]. Advances in intelligent computing. Springer Berlin Heidelberg, 2005: 878-887. [16] C Elkan. The foundations of cost-sensitive learning[C]//Proceedings of the International joint conference on artificial intelligence (IJCAI). Lawrence Erlbaum Association Ltd, 2001, 17(1): 973-978. [17] C Fillmore. Frame semantics [J]. Linguistics in the morning calm, 1982: 111-137. [18] Y Hong, X P Zhou, T T Che, et al. Cross-argument inference for implicit discourse relation recognition[C]//Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM). 2012: 295-304. [19] C C Chang, C J Lin. LIBSVM: a library for support vector machines [J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2001, 2(3): 389-396.