基于生成模型的闲聊机器人自动评价方法综述

张璐,李卓桓,殷绪成,晋赞霞

PDF(3972 KB)
PDF(3972 KB)
中文信息学报 ›› 2021, Vol. 35 ›› Issue (3) : 24-42.
综述

基于生成模型的闲聊机器人自动评价方法综述

  • 张璐1,李卓桓2,殷绪成1,晋赞霞1
作者信息 +

A Survey of Automatic Evaluation of Chatbots Based on Generative Models

  • ZHANG Lu1, LI Zhuohuan2, YIN Xucheng1, JIN Zanxia1
Author information +
History +

摘要

近年来,随着人工智能技术的发展,更多数据被利用,数据驱动的端到端闲聊机器人技术得到快速发展,受到了学术界和工业界的广泛关注。但是对于闲聊机器人的评价,现在没有标准的自动评价方法,而自动评价方法对于闲聊机器人对话效果的评估及闲聊机器人的快速迭代是十分重要的。该文综述了基于生成模型的闲聊机器人的自动评价方法。首先介绍了自动评价方法的研究背景及研究现状,然后介绍了对闲聊机器人的基本能力—生成合理的回复进行评价的自动评价方法,并指出了每类方法的优缺点及进一步发展的方向,其次对评价闲聊机器人的扩展能力的自动评价方法进行了介绍,扩展能力包括生成多样的回复、对话具有特定的个性、对话具有情感和对话主题具有深度和广度等。随后阐述了评价闲聊机器人综合能力的评价方法,并讨论了发展综合自动评价方法的方向,同时还介绍了如何评价自动评价方法。最后进行了分析与总结,指出研究自动评价方法的困难与挑战,并对未来发展进行了展望。

Abstract

In recent years, the data-driven end-to-end chatbots have developed rapidly and attracted widespread attentions from both industrial and academic circle. This paper reviews existing automatic evaluation methods for generative model based chatbots. Firstly, the research background and the state-of-the-art of the automatic evaluation methods are introduced. Then the automatic evaluation methods for the basic ability of chatbots to generate reasonable responses are presented, revealing the advantages and disadvantages of each type of methods. The automatic evaluation methods for the expansion ability of chatbots are also introduced, involving the generation of various responses, the dialogue with specific personality or emotion, and the conversation topic in depth and breadth. Besides, the evaluation methods to evaluate the comprehensive ability of chatbots are also addressed together the development direction. After reviewing the method to evaluate automatic chatbots evaluation metrics, this paper finally discusses the challenges and future development trends to develop automatic evaluation methods.

关键词

生成模型 / 闲聊机器人 / 自动评价方法

Key words

generative model / chatbot / automatic evaluation method

引用本文

导出引用
张璐,李卓桓,殷绪成,晋赞霞. 基于生成模型的闲聊机器人自动评价方法综述. 中文信息学报. 2021, 35(3): 24-42
ZHANG Lu, LI Zhuohuan, YIN Xucheng, JIN Zanxia. A Survey of Automatic Evaluation of Chatbots Based on Generative Models. Journal of Chinese Information Processing. 2021, 35(3): 24-42

参考文献

[1] Gao J, Galley M, Li L. Neural approaches to conversational AI: Question answering, task-oriented dialogues and social chatbots[M]. Now Foundations and Trends, 2019.
[2] Mnasri M. Recent advances in conversational NLP: Towards the standardization of chatbot building[J]. arXiv preprint arXiv:1903.09025, 2019.
[3] Chen H, Liu X, Yin D, et al. A survey on dialogue systems: Recent advances and new frontiers[J]. ACM Sigkdd Explorations Newsletter, 2017, 19(2): 25-35.
[4] Qiu M, Li F L, Wang S, et al. Alime chat: A sequence to sequence and rerank based chatbot engine[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 498-503.
[5] Walker M A, Litman D J, Kamm C A, et al. PARADISE: A framework for evaluating spoken dialogue agents[J]. arXiv preprint cmp-lg/9704004, 1997.
[6] Zhou L, Gao J, Li D, et al. The design and implementation of XiaoIce, an empathetic social chatbot[J]. arXiv preprint arXiv:1812.08989, 2018.
[7] Weizenbaum J. ELIZA: A computer program for the study of natural language communication between man and machine[J]. Communications of the ACM, 1966, 9(1): 36-45.
[8] Wu Y, Li Z, Wu W, et al. Response selection with topic clues for retrieval-based chatbots[J]. Neurocomputing, 2018, 316: 251-261.
[9] Wu Y, Wu W, Xing C, et al. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots[J]. arXiv preprint arXiv:1612.01627, 2016.
[10] Ritter A, Cherry C, Dolan W B. Data-driven response generation in social media[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011: 583-593.
[11] Vinyals O, Le Q. A neural conversational model[J]. arXiv preprint arXiv:1506.05869, 2015.
[12] Sordoni A, Galley M, Auli M, et al. A neural network approach to context-sensitive generation of conversational responses[J]. arXiv preprint arXiv:1506.06714, 2015.
[13] Shang L, Lu Z, Li H. Neural responding machine for short-text conversation[J]. arXiv preprint arXiv:1503.02364, 2015.
[14] Serban I V, Sordoni A, Bengio Y, et al. Building end-to-end dialogue systems using generative hierarchical neural network models[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016:3776-3783.
[15] Kuligowska K. Commercial chatbot: Performance evaluation, usability metrics and quality standards of embodied conversational agents[J]. Professionals Center for Business Research,2015:1-16.
[16] Cohen D, Lane I. An oral exam for measuring a dialog system's capabilities[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016:835-841.
[17] AbuShawar B, Atwell E. Usefulness, localizability, humanness, and language-benefit:Additional evaluation criteria for natural language dialogue systems[J]. International Journal of Speech Technology, 2016, 19(2): 373-383.
[18] Chaves A P, Gerosa M A. How should my chatbot interact? A survey on human-chatbot interaction design[J]. arXiv preprint arXiv:1904.02743, 2019.
[19] Liu C W, Lowe R, Serban I V, et al. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation[J]. arXiv preprint arXiv:1603.08023, 2016.
[20] 张伟男, 张杨子, 刘挺. 对话系统评价方法综述[J]. 中国科学: 信息科学, 2017, 47: 953-966.
[21] Gatt A, Krahmer E. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation[J]. Journal of Artificial Intelligence Research, 2018, 61: 65-170.
[22] Deriu J, Rodrigo A, Otegi A, et al. Survey on evaluation methods for dialogue systems[J]. arXiv preprint arXiv:1905.04071, 2019.
[23] 陈晨, 朱晴晴, 严睿, 等. 基于深度学习的开放领域对话系统研究综述[J]. 计算机学报, 2019, 42(7): 1439-1466.
[24] Papineni K, Roukos S, Ward T, et al. BLEU:A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of Association for Computational Linguistics. Association for Computational Linguistics, 2002: 311-318.
[25] Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005: 65-72.
[26] Lin C Y. ROUGE: A package for automatic evaluation of summaries[C]//Proceedings of Text Summarization Branches Out, 2004: 74-81.
[27] Mitchell J, Lapata M. Vector-based models of semantic composition[C]//Proceedings of ACL-08: HLT, 2008: 236-244.
[28] Forgues G, Pineau J, Larchevêque J M, et al. Bootstrapping dialog systems with word embeddings[C]//Proceedings of NIPS, Modern Machine Learning and Natural Language Processing Workshop, 2014, 2.
[29] Rus V, Lintean M. A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics[C]//Proceedings of the 7th Workshop on Building Educational Applications Using NLP. Association for Computational Linguistics, 2012: 157-162.
[30] Lowe R, Noseworthy M, Serban I V, et al. Towards an automatic Turing test: Learning to evaluate dialogue responses[J]. arXiv preprint arXiv:1708.07149, 2017.
[31] Sai A B, Gupta M D, Khapra M M, et al. Re-evaluating ADEM: A deeper look at scoring dialogue responses[J]. arXiv preprint arXiv:1902.08832, 2019.
[32] Yuwono S K, Wu B, D'Haro L F. Automated scoring of chatbot responses in conversational dialogue[C]//Proceedings of the 9th International Workshop on Spoken Dialogue System Technology. Springer, Singapore, 2019: 357-369.
[33] Kannan A, Vinyals O. Adversarial evaluation of dialogue models[J]. arXiv preprint arXiv:1701.08198, 2017.
[34] Bruni E, Fernandez R. Adversarial evaluation for open-domain dialogue generation[C]//Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, 2017: 284-288.
[35] Tao C, Mou L, Zhao D, et al. Ruber: An unsupervised method for automatic evaluation of open-domain dialog systems[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018:722-729.
[36] Ghazarian S, Wei J T Z, Galstyan A, et al. Better automatic evaluation of open-domain dialogue systems with contextualized embeddings[J]. arXiv preprint arXiv:1904.10635, 2019.
[37] Vedantam R, Lawrence Zitnick C, Parikh D. Cider: Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 4566-4575.
[38] Galley M, Brockett C, Sordoni A, et al.Deltableu: A discriminative metric for generation tasks with intrinsically diverse targets[J]. arXiv preprint arXiv:1506.06863, 2015.
[39] D'Haro L F, Banchs R E, Hori C, et al. Automatic evaluation of end-to-end dialog systems with adequacy-fluency metrics[J]. Computer Speech and Language, 2019, 55: 200-215.
[40] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of Advances in Neural Information Processing Systems. 2013: 3111-3119.
[41] Pennington J, Socher R, Manning C. Glove: Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1532-1543.
[42] Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[J]. arXiv preprint arXiv:1802.05365, 2018.
[43] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[J/OL]. https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018.
[44] Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[C]//Proceedings of the OpenAI Blog, 2019.
[45] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[46] Kiros R, Zhu Y, Salakhutdinov R R, et al. Skip-thought vectors[C]//Proceedings of Advances in Neural Information Processing Systems, 2015: 3294-3302.
[47] Logeswaran L, Lee H. An efficient framework for learning sentence representations[J]. arXiv preprint arXiv:1803.02893, 2018.
[48] Conneau A, Kiela D, Schwenk H, et al. Supervised learning of universal sentence representations from natural language inference data[J]. arXiv preprint arXiv:1705.02364, 2017.
[49] Neill J O, Bollegala D. Semi-supervised multi-task word embeddings[J]. arXiv preprint arXiv:1809.05886, 2018.
[50] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[51] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of Advances in Neural Information Processing Systems, 2017: 5998-6008.
[52] Dai Z, Yang Z, Yang Y, et al. Transformer-XL: Attentive language models beyond a fixed-length context[J]. arXiv preprint arXiv:1901.02860, 2019.
[53] Kusner M, Sun Y, Kolkin N, et al. From word embeddings to document distances[C]//Proceedings of the International Conference on Machine Learning, 2015: 957-966.
[54] Clark E, Celikyilmaz A, Smith N A. Sentence mover's similarity: Automatic evaluation for multi-sentence texts[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 2748-2760.
[55] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[C]//Proceedings of Advances in Neural Information Processing Systems. 2014: 2672-2680.
[56] Lowe R, Serban I V, Noseworthy M, et al. On the evaluation of dialogue systems with next utterance classification[J]. arXiv preprint arXiv:1605.05414, 2016.
[57] Yao K, Zweig G, Peng B. Attention with intention for a neural network conversation model[J]. arXiv preprint arXiv:1510.08565, 2015.
[58] Xing C, Wu Y, Wu W, et al. Hierarchical recurrent attention network for response generation[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018:5610-5617.
[59] Li J, Galley M, Brockett C, et al. A diversitypromoting objective function for neural conversation models[J]. arXiv preprint arXiv:1510.03055, 2015.
[60] Xing Y, Fernández R. Automaticevaluation of neural personality-based chatbots[J]. arXiv preprint arXiv:1810.00472, 2018.
[61] Li J, Galley M, Brockett C, et al. A persona-based neural conversation model[J]. arXiv preprint arXiv:1603.06155, 2016.
[62] Zhou H, Huang M, Zhang T, et al. Emotional chatting machine: Emotional conversation generation with internal and external memory[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018:730-738.
[63] Sun X, Peng X, Ding S. Emotional human-machine conversation generation based on long short-term memory[J]. Cognitive Computation, 2018, 10(3): 389-397. Guo F,
[64] Xing C, Wu W, Wu Y, et al. Topic aware neural response generation[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence, 2017: 3351-3357.
[65] Li J, Monroe W, Shi T, et al. Adversarial learning for neural dialogue generation[J]. arXiv preprint arXiv:1701.06547, 2017.
[66] Li J, Monroe W, Ritter A, et al. Deep reinforcementlearning for dialogue generation[J]. arXiv preprint arXiv:1606.01541, 2016.
[67] Serban I V, Sordoni A, Lowe R, et al. A hierarchical latent variable encoder-decoder model for generating dialogues[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence, 2017, 31(1):3295-3310.
[68] Metallinou A, Khatri C, et al. Topic-based evaluation for conversational bots[J]. arXiv preprint arXiv:1801.03622, 2018.
[69] Yu Z, Xu Z, Black A W, et al. Strategy and policy learning for non-task-oriented conversational systems[C]//Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2016: 404-412.
[70] Dinan E, Logacheva V, Malykh V, et al. The second conversational intelligence challenge (convai2)[J]. arXiv preprint arXiv:1902.00098, 2019.
[71] Venkatesh A, Khatri C, Ram A, et al. On Evaluating and comparing open domain dialog systems[J]. arXiv preprint arXiv:1801.03625, 2018.
[72] Restrepo L F, González J. From Pearson to Spearman[J]. Revista Colombiana de Ciencias Pecuarias, 2007, 20(2): 183-192.
[73] Bolboaca S D, Jntschi L. Pearson versus Spearman, Kendall's tau correlation analysis on structure-activity relationships of biologic active compounds[J]. Leonardo Journal of Sciences, 2006, 5(9): 179-200.
[74] Wu S H, Shih W F, Chien S L. Evaluation methods of emotional expression in short text conversation[C]//Proceedings of the 9th International Workshop on Evaluating Information Access, 2019:8-14.
[75] Novikova J, Duek O, Rieser V. RankMe: Reliable human ratings for natural language generation[J]. arXiv preprint arXiv:1803.05928, 2018.
[76] Li M, Weston J, Roller S. Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons[J]. arXiv preprint arXiv:1909.03087, 2019.
[77] Ghandeharioun A, Shen J H, Jaques N, et al. Approximating interactive human evaluation with self-play for open-domain dialog systems[J]. arXiv preprint arXiv:1906.09308, 2019.
[78] Sedoc J, Ippolito D, Kirubarajan A, et al. Chateval: A tool for the systematic evaluation of chatbots[C]//Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation (2IS&NLG), 2018: 42-44.
PDF(3972 KB)

2300

Accesses

0

Citation

Detail

段落导航
相关文章

/