自然语言处理评测数据集质量评估研究

王诚文,董青秀,穗志方,詹卫东,常宝宝,王海涛

PDF(1804 KB)
PDF(1804 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (2) : 26-40.
语言资源建设与应用

自然语言处理评测数据集质量评估研究

  • 王诚文1,2,董青秀1,2,穗志方1,2,詹卫东1,3,常宝宝1,2,王海涛4
作者信息 +

Quality Evaluation of Public NLP Dataset

  • WANG Chengwen1,2, DONG Qingxiu1,2, SUI Zhifang1,2, ZHAN Weidong1,3,
    CHANG Baobao1,2, WANG Haitao4
Author information +
History +

摘要

评测数据集是评测任务的载体,评测数据集的质量对评测任务的开展和评测指标的应用有着根本性的影响,因此对评测数据集的质量进行评估有着必要性和迫切性。该文在调研公开使用的自然语言处理主流数据集基础上,分析和总结了数据集中存在的8类问题,并在参考人类考试及试卷质量评估的基础上,从信度、效度和难度出发,提出了数据集评估的相关指标和将计算性与操作性相结合的评估方法,旨在为自然语言处理评测数据集构造、选择和使用提供参考依据。

Abstract

Pubic NLP datasets form the bedrock for NLP evaluation tasks, and the quality of such datasets has a fundamental impact on the development of evaluation tasks and the application of evaluation metrics. In this paper, we analyze and summarize eight types of problems relating to publicly available mainstream Natural Language Processing (NLP) datasets. Inspired by the quality assessment of testing in education community, we propose a series of evaluation metrics and evaluation methods combining computational and operational approaches, with the aim of providing a reference for the construction, selection and utilization of natural language processing datasets.

关键词

自然语言处理 / 评测 / 数据集 / 质量评估

Key words

natural language processing / Benchmark / dataset / quality evaluation

引用本文

导出引用
王诚文,董青秀,穗志方,詹卫东,常宝宝,王海涛. 自然语言处理评测数据集质量评估研究. 中文信息学报. 2023, 37(2): 26-40
WANG Chengwen, DONG Qingxiu, SUI Zhifang, ZHAN Weidong,
CHANG Baobao, WANG Haitao.
Quality Evaluation of Public NLP Dataset. Journal of Chinese Information Processing. 2023, 37(2): 26-40

参考文献

[1] SCHLANGEN D. Targeting the Benchmark: On methodology in current natural language processing research[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021: 670-674.
[2] GEHRMANN S, ADEWUMI T, ZHOU J. The GEM Benchmark: Natural language generation, its evaluation and metrics[C]//Proceedings of the 1st Workshop on Natural Language Generation, Evaluation,and Metrics, Bangkok, Thailand Association for Computational Linguistics, 2021: 96-120.
[3] KIELA D, BARTOLO M, NIE Y, et al. Dynabench: Rethinking benchmarking in NLP[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021: 4110-4124.
[4] 董青秀, 穗志方, 詹卫东, 等. 自然语言处理评测中的问题与对策[J]. 中文信息学报, 2021, 35(6): 1-15.
[5] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.
[6] HE K ,ZHANG X ,REN S , et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[7] LIU Y, OTT M, GOYAL N, et al. Roberta: A robustly optimized bert pretraining approach[J]. arXiv preprint arXiv: 1907.11692, 2019.
[8] LAN Z, CHEN M, GOODMAN S, et al. Albert: A lite bert for self-supervised learning of language representations[C]//Proceedings of ICLR. 2020: 1-17.
[9] RAJPURKAR P, JIA R, LIANG P. Know what you don't know: Unanswerable questions for SQuAD[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 784-789.
[10] RAJPURKAR P, ZHANG J, LOPYREV K, et al. SQuAD: 100,000+ questions for machine comprehension of text[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2016: 2383-2392.
[11] ANDREW NG.Mlops: From model centric to data-centric ai.[EB/OL]. https: //www.deeplearning.ai/wp-content/uploads/2021/06/MLOps-From-Model-centric-to-Data-centric-AI.pdf[2022-01-02].
[12] ANDREW B, GERARD E N, ANNE K E,. A dictionary of computer science[M].London: Oxford University Press, 2016.
[13] PAULLADA A, RAJI I D, BENDER E M, et al. Data and its (dis) contents: A survey of dataset development and use in machine learning research[J]. Patterns, 2021, 2(11): 100336.
[14] MCCOY T, PAVLICK E, LINZEN T. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3428-3448.
[15] 俞士汶,朱学锋,段慧明.大规模现代汉语标注语料库的加工规范[J].中文信息学报,2000(06): 58-64.
[16] EMERSON T. The 2nd international Chinese word segmentation bakeoff[C]//Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 2005: 123-131.
[17] SANG E F, DE MEULDER F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition[C]//Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL, 2003: 142-147.
[18] RAMSHAW L A, MARCUS M P. Text chunking using transformation-based learning[M]. Natural Language Processing Using Very Large Corpora. Dordrecht: Springer, 1999: 157-176.
[19] MARCUS M. Building a large annotated corpus of English: the penn treebank[J]. Computational Linguistics, 1993, 19(2): 313-330.
[20] SOCHER R, PERELYGIN A, WU J, et al. Recursive deep models for semantic compositionality over a sentiment treebank[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2013: 1631-1642.
[21] BOWMAN S R, ANGELI G, POTTS C, et al. A large annotated corpus for learning natural language inference[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2015: 632-642.
[22] YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78.
[23] KHASHABI D, CHATURVEDI S, ROTH M, et al. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018: 252-262.
[24] WANG A, PRUKSACHATKUN Y, NANGIA N, et al. Superglue: A stickier benchmark for general-purpose language understanding systems[C]//Proceedings of the 33rd Conference on Neural Information Processing Systems, 2019: 1-15.
[25] Yao Y, Ye D, Li P, et al. DocRED: A large-scaledocument-level relation extraction dataset[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 764-777.
[26] ZHANG Y, ZHONG V, CHEN D, et al. Position-aware attention and supervised data improve slot filling[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2017: 35-45.
[27] ERXLEBEN F, GNTHER M, KRTZSCH M, et al. Introducing Wikidata to the linked data web[C]//Proceedings of the International Semantic Web Conference. Springer, Cham, 2014: 50-65.
[28] VrandeCˇiC' D, Krtzsch M. Wikidata: A free collaborative knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85.
[29] Bojar O, Chatterjee R, Federmann C, et al. Findings of the conference on machine translation[C]//Proceedings of the 1st Conference on Machine Translation, 2016: 131-198.
[30] Nallapati R, Zhou B, dos Santos C, et al. Abstractive text summarization using sequence-to-sequence RNNs and beyond[C]//Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. 2016: 280-290.
[31] Nallapati R, Zhai F, Zhou B. SummaRuNNer: A recurrent neural network based sequence model for extractive summarization of documents[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017: 3075-3081.
[32] Rush A M, Chopra S, Weston J. A neural attention model for abstractive sentence summarization[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2015: 379-389.
[33] Paul O. Introduction to DUC-2001: An intrinsic evaluation of generic news text summarization systems[C]//Proceedings of DUC Document Understanding Conference. 2001: 49.
[34] Lowe R, Pow N, Serban I V, et al. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems[C]//Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 2015: 285-294.
[35] RECHT B, ROELOFS R, SCHMIDT L, et al. Do imagenet classifiers generalize to imagenet?[C]//Proceedings of the International Conference on Machine Learning. PMLR, 2019: 5389-5400.
[36] GEBRU T, MORGENSTERN J, VECCHIONE B, et al. Datasheets for datasets[J]. Communications of the ACM, 2021, 64(12): 86-92.
[37] TEJASWIN P, NAIK D, LIU P. How well do you know your summarization datasets?[C]//Proceedings of the Findings of the Association for ComputationalLinguistics. 2021: 3436-3449.
[38] NARAYAN S, COHEN S B, LAPATA M. Don't give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2018: 1797-1807.
[39] VRONIS J. A study of polysemy judgements and inter-annotator agreement[G]. Programmed and Advanced Papers of the Sense Val Workshop. 1998: 2-4.
[40] Artstein R, Poesio M. Inter-coder agreement for computational linguistics[J]. Computational Linguistics, 2008, 34(4): 555-596.
[41] 刘伟,黄锴宇,余浩,黄德根.基于语境相似度的中文分词一致性检验研究[J].北京大学学报(自然科学版).2022,58(1): 99-107.
[42] MANNING C D. Part-of-speech tagging from 97% to 100%: Is it time for some linguistics?[C]//Proceedings of the International Conference on Intelligent text Processing and Computational Linguistics. Springer, Berlin, Heidelberg, 2011: 171-189.
[43] MA J, GANCHEV K, WEISS D. State-of-the-art Chinese word segmentation with Bi-LSTMs[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2018: 4902-4908.
[44] WANG Z, SHANG J, LIU L, et al. Crossweigh: Training named entity tagger from imperfect annotations[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 5154-5163.
[45] ZENG Q, YU M, YU W, et al. Validating label consistency in NER data annotation[C]//Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems. 2021: 11-15.
[46] SCHLEGEL V, VALENTINO M, FREITAS A, et al. A Framework for evaluation of machine reading comprehension gold standards[C]//Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 5359-5369.
[47] GEVA M, GOLDBERG Y, BERANT J. Are we modeling the task or the Annotator? An investigation of annotator bias in natural language understanding datasets[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 1161-1166.
[48] ZHOU X, BANSAL M. Towards robustifying NLI models against lexical dataset biases[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 8759-8771.
[49] SAXON M, WANG X, WANG W Y. Automatically Identifying Semantic Bias in Crowdsourced Natural Language Inference Datasets[J]. arXiv preprint arXiv: 2112.09237, 2021.
[50] SUGAWARA S, STENETORP P, INUI K, et al. Assessing the benchmarking capacity of machine reading comprehension datasets[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(05): 8918-8927.
[51] GURURANGAN S, SWAYAMDIPTA S, LEVY O, et al. Annotation artifacts in natural language inference data[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018: 107-112.
[52] CAI Z, TU L, GIMPEL K. Pay attention to the ending: Strong neural baselines for the roc story cloze task[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 616-622.
[53] KOCH B, DENTON E, HANNA A, et al. Reduced, reused and recycled: The life of a dataset in machine learning research[C]//Proceedings of the 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
[54] RAJI I D, BENDER E M, PAULLADA A, et al. AI and the everything in the whole wide world benchmark[J]. arXiv preprint arXiv: 2111.15366, 2021.
[55] SAMBASIVAN N, KAPANIA S, HIGHFILL H, et al. “Everyone wants to do the model work, not the data work”: Data cascades in high-stakes AI[C]//proceedings of the CHI Conference on Human Factors in Computing Systems. 2021: 1-15.
[56] NOVICK M R. The axioms and principal results of classical test theory[J]. Journal of Mathematical Psychology, 1966, 3(1): 1-18.

基金

国家科技创新2030“新一代人工智能”重大项目(2020AAA0106700);国家自然科学基金(U19A2065);中国博士后科学基金(2022M710246)
PDF(1804 KB)

2678

Accesses

0

Citation

Detail

段落导航
相关文章

/