排序学习中数据噪音敏感度分析

牛树梓,程学旗,郭嘉丰

PDF(3685 KB)
PDF(3685 KB)
中文信息学报 ›› 2012, Vol. 26 ›› Issue (5) : 53-59.
综述

排序学习中数据噪音敏感度分析

  • 牛树梓,程学旗,郭嘉丰
作者信息 +

Noise Sensitivity in Learning to Rank

  • NIU Shuzi, CHENG Xueqi, GUO Jiafeng
Author information +
History +

摘要

排序学习是当前信息检索领域研究热点之一。为了避免训练集中噪音的影响,当前排序学习算法较多关注鲁棒性。已有的工作发现相同的排序学习方法的性能在不同的数据集上会有截然不同的噪音敏感度。模型改变是导致性能下降的直接原因,而模型又是从训练集学习到的,因此根源在于训练数据的某些特性。该文根据具体排序学习场景分析得出影响噪音敏感度的根本原因在于训练集中文档对分布的结论,并在LETOR3.0上的实验验证了这一结论。

Abstract

Learning to rank is one of the most attractive areas in information retrieval. Much attention has been paid on the robustness of ranking algorithms to deal with noise which is inevitable in the training set. Previous work observes that ranking performance of the same algorithm showed totally different noise sensitivities. The performance degradation of ranking models boils down to the training set. Thus the underlying reason for different sensitivities lies in some attribute of training data. Experimental results on LETOR3.0 suggest that if the document pairs of the same training set scatter more dispersedly, the model from this training set is less influenced by the error document pairs and the training set is thus less sensitive to noise.
Key wordslearning to rank; data quality; noise sensitivity

关键词

排序学习 / 数据质量 / 噪音敏感

Key words

learning to rank / data quality / noise sensitivity

引用本文

导出引用
牛树梓,程学旗,郭嘉丰. 排序学习中数据噪音敏感度分析. 中文信息学报. 2012, 26(5): 53-59
NIU Shuzi, CHENG Xueqi, GUO Jiafeng. Noise Sensitivity in Learning to Rank. Journal of Chinese Information Processing. 2012, 26(5): 53-59

参考文献

[1] Sheng, et al. Get another label? improving data quality and data mining using multiple, noisy labelers[C]//Proceeding of the 14th ACM SIGKDD. New York: ACM, 2008: 614-622.
[2] Xu Jingfang, Chen Chuanliang, Xu Gu, et al. Improving quality of training data for learning to rank using click-through data[C]//Proceedings of the third WSDM. New York: ACM, 2010: 171-180.
[3] Nettleton D. F., Orriols-Puig A., Fornells A., et al. A study of the effect of different types of noise on the precision of supervised learning techniques [J]. Artificial Intelligence Review, 2010, 33: 275-306.
[4] Chapelle O., Chang Yi, Liu Tie-Yan. Future directions in learning to rank [J]. Journal of Machine Learning Research, 2011, 14: 91-100.
[5] Tsivtsivadze E., Cseke B., Heskes T. Kernel Principal Component Ranking: Robust Ranking on Noisy Data[C]//Proceedings of the ECML/PKDD-Workshop on Preference Learning. Pascal Lecture Series, 2009: 101-113.
[6] Carvalho V. R., Elsas J. L., Cohen W. W., et al. Suppressing outliers in pairwise preference ranking[C]//Proceeding of the 17th CIKM, New York: ACM, 2008: 1487-1488.
[7] Aslam J. A., Kanoulas E., Pavlu V., et al. Document selection methodologies for efficient and effective learning-to-rank[C]//Proceedings of the 32nd international ACM SIGIR,New York: ACM, 2009: 468-475.
[8] Geng Xiubo, Qin Tao, Liu Tie-Yan, et al. Selecting optimal training data for learning to rank [J]. Information Processing & Management, 2011, 47(5): 730-741.
[9] Yang Hui, Mityagin A., Svore K. M., et al. Collecting high quality overlapping labels at low cost [C]//Proceeding of the 33rd international ACM SIGIR. New York: ACM, 2010: 459-466.
[10] Kumar A., Lease M. Learning to rank from a noisy crowd [C]//Proceedings of the 34th international ACM SIGIR. New York: ACM, 2011: 1221-1222.
[11] Kanoulas E., Savev S., Metrikov P., et al. A large-scale study of the effect of training set characteristics over learning-to-rank algorithms [C]//Proceedings of the 34th international ACM SIGIR. New York: ACM, 2011. 1243-1244.
[12] Qin Tao, Liu Tie-Yan, Xu Jun, et al. LETOR: A benchmark collection for research on learning to rank for information retrieval [J]. Information Retrieval, 2010, 13(4): 346-374.
[13] Joachims T. Optimizing search engines using click-through data [C]//Proceedings of the eighth ACM SIGKDD. New York: ACM, 2002: 133-142.
[14] Zhe Cao, Tao Qin, et al. Learning to rank: from pairwise approach to listwise approach [C]//Proceedings of the 24th International Conference on Machine Learning. New York: ACM, 2007: 129-136.
[15] Verbaeten S., Van A. A. Ensemble methods for noise elimination in classification problems [C]//Proceedings of the 4th international conference on multiple classifier systems. Berlin Heidelberg: Springer-Verlag, 2003: 317-325.
[16] Abell, er al. An Experimental Study about Simple Decision Trees for Bagging Ensemble on Datasets with Classification Noise[C]//Proceedings of the 10th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty. Berlin Heidelberg: Springer-Verlag, 2009: 446-456.
[17] Tan P.N., Steinbach M., Kumar V. Introduction to Data Mining [M]. Addison-Wesley, 2005: 500.
[18] Kullback S., Leibler R.A.. On information and sufficiency [J]. Annals of mathematical statistics, 1951, 22(1): 79-86.

基金

国家自然科学基金资助项目(60903139, 60873243, 60933005);国家863计划重点资助项目(2010AA012502, 2010AA012503)
PDF(3685 KB)

Accesses

Citation

Detail

段落导航
相关文章

/