我国是全球糖尿病患病人数最多的国家,患病人数仍在持续快速增长,糖尿病已成为我国重大公共卫生问题。该文关注的糖尿病健康管理对话系统服务于糖尿病患者,为患者解答日常生活中糖尿病相关问题,而目前缺乏用于训练对话系统模型的糖尿病相关数据。基于此,该文构建了首个标注体系完整的糖尿病健康管理中文对话数据集“Diachat”,以支持健康管理对话系统研究。Diachat收集了来自线上聊天平台糖尿病患者与医生的693段对话(Dialogue),共4 686句语料(Sentence),完成了6 594条对话动作(Dialogue act)标注。Diachat数据集采用基于对话动作的表示方式进行意图表示并定义了15个对话动作标签(Act label)。同时,Diachat定义了6个领域(Domain)涵盖语料涉及的领域,分别为: 问题(Problem)、饮食(Diet)、行为(Behavior)、运动(Sport)、治疗(Treatment)、基本信息(Profile)。为了支持构建完整的对话系统,Diachat为用户端和系统端分别构造了对话状态,并为每段对话构造了对话目标。基于Diachat数据集,该课题进行了管道(Pipeline)体系的对话系统四个模块的基本实现。实验结果显示,Diachat数据集能够支持糖尿病健康管理对话系统构建,各模块仍有较大提升空间。
Abstract
As the country with the largest number of diabetes cases in the world, diabetes has become a major public health problem in China. Onto develope the diabetes health management dialogue system, there is currently a lack of diabetes-related data for training the dialogue model. In this paper, the first Chinese diabetics-doctors dialogue dataset "Diachat" with complete annotation schemas is presented. Diachat consists of 693 conversations between diabetics and doctors from an online chat platform, with a total of 4 686 sentences annotated with 6 594 annotations The Diachat dataset employs a dialogue act-based representation for intent representation and defines 15 act labels. Meanwhile, Diachat defines 6 domains, including Problem, Diet, Behavior, Sport, Treatment, and user Profile, covering the major fields of the dataset. In order to support the construction of a complete dialogue system, Diachat generates dialogue states for the user side and system side respectively, together with conversation goal for each dialogue. Based on the Diachat dataset, this paper performs a preliminary implementation of the four modules under the pipeline framework. The experimental results showed that the Diachat dataset can support the construction of the diabetes health management dialogue system.
关键词
对话系统 /
数据集构建 /
语料标注 /
糖尿病健康管理
{{custom_keyword}} /
Key words
dialogue system /
dataset construction /
corpus annotation /
diabetes health management
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Ni J, Young T,Pandelea V, et al. Recent advances in deep learning based dialogue systems: A systematic survey[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 5016-5026.
[2] Budzianowski P, Wen T H, Tseng B H, et al. MultiWOZ: A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling[J].arXiv preprint arXiv:1810.00278, 2018.
[3] Zhu Q, Huang K, Zhang Z, et al.Crosswoz: A large-scale chinese cross-domain task-oriented dialogue dataset[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 281-295.
[4] Zhang Y, Jiang Z, Zhang T, et al. MIE: A medical information extractor towards medical dialogues[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 6460-6469.
[5] Shi X, Hu H, Che W, et al. Understanding medical conversations with scattered keyword attention and weak supervision from responses[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(05): 8838-8845.
[6] Liu W, Tang J, Cheng Y, et al.MedDG: An entity-centric medical consultation dataset for entity-aware medical dialogue generation[C]//Proceedings of the 11th CCF International Conference, NLPCC, 2022: 447-459.
[7] Gupta I, Di Eugenio B, Ziebart B, et al. Human-human health coaching via text messages: Corpus, annotation, and analysis[C]//Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 2020: 246-256.
[8] Du N, Chen K, Kannan A, et al. Extracting symptoms and their status from clinical conversations[C]//Proceedings of the 57th Annual Meeting of Association for Computational Linguistics,2019: 915-925.
[9] Zhang Z, Takanobu R, Zhu Q, et al. Recent advances and challenges in task-oriented dialog systems[J]. Science China Technological Sciences, 2020, 63(10): 2011-2027.
[10] Qin L, Xie T, Che W, et al. A survey on spoken language understanding: Recent advances and new frontiers[C]//Proceedings of the 30th International Joint Conference on Artificial Intelligence Survey Track, 2021: 4577-4584.
[11] DANIEL J, JAMES M. Speech and language processing[M]. Upper Saddle River: Prentice Hall, 2008.
[12] McTear M. Conversational ai: Dialogue systems, conversational agents, and chatbots[J]. Synthesis Lectures on Human Language Technologies, 2020, 13(3): 1-251.
[13] Young S,Gaic′ M, Keizer S, et al. The hidden information state model: A practical framework for POMDP-based spoken dialogue management[J]. Computer Speech & Language, 2010, 24(2): 150-174.
[14] Young S. CUED standard dialogue acts[R]. Cambridge University Engineering Department, 14th October, 2007.
[15] Tur G, De Mori R. Spoken language understanding: Systems for extracting semantic information from speech[M]. NY: John Wiley & Sons, 2011.
[16] Lin X, He X, Chen Q, et al. Enhancing dialogue symptom diagnosis with global attention and symptom graph[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 5033-5042.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(72125001,72071054,72293584,72121001);中国博士后科学基金(2016M601435)
{{custom_fund}}