真实对话数据量不足已经成为限制数据驱动的对话生成系统性能提升的主要因素,尤其是汉语语料。为了获得丰富的日常会话语料,可以利用字幕时间戳信息把英语电视剧的英文字幕及其对应的中文字幕进行同步,从而生成大量的汉英双语同步字幕。然后通过信息检索的方法把双语同步字幕的英文句子跟英语剧本的演员台词进行自动对齐,从而将剧本中的场景和说话者信息映射到双语字幕中,最后得到含有场景及说话者标注的汉英双语日常会话库。该文利用这种方法,自动构建了包含978 109对双语话语消息的接近人类日常会话的多轮会话数据库CEDAC。经过抽样分析,场景边界的标注准确率达到97.0%,而说话者的标注准确率也达到91.57%。该标注库为后续进行影视剧字幕说话者自动标注和多轮会话自动生成研究打下了很好的基础。
Abstract
The insufficient human dialogue corpus has been a key factor restricting the performance of dialogue generation system, especial for the Chinese dialogue corpus. This paper presents the automatic construction of CEDAC, a multi-turn dialogue corpus of human daily conversation with 978 109 pairs of Chinese-English bilingual utterances. To obtain this corpus, time-stamps can be used to synchronize English subtitles and corresponding Chinese subtitles, so that abundant Chinese-English bilingual subtitles can be generated. Then, the bilingual subtitles and the utterances in the corresponding English scripts are alinged, so that the tags of speaker and scene in the scripts can be mapped to each pair of sentences in the bilingual subtitles. The experimental result shows it achieves the accuracy of 97.0% on scene boundary annotations and91.57% on speaker annotations. The corpus lays a good foundation for the following research on automatically annotating speakers of subtitles and multi-turn dialogue automatic generation system.
关键词
日常会话语料 /
电视剧剧本解析 /
双语字幕同步 /
剧本和字幕的自动对齐
{{custom_keyword}} /
Key words
daily dialogue corpus /
parsing of TV play scripts /
synchronization of subtitles /
automatic alignment between scripts and subtitles
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Wang H, Lu Z, Li H, et al. A dataset for research on short-text conversations[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013: 935-945.
[2] Xing C, Wu W, Wu Y, et al. Topic aware neural response generation[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence, 2017: 3351-3357.
[3] Wei Z, Liu Q, Peng B, et al. Task-oriented dialogue system for automatic diagnosis[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 201-207.
[4] Keyan Zhou, Aijun Li, Zhigang Yin,et al. Casia-cassil: A Chinese telephone conversation corpus in real scenarios with multi-leveled annotation[C]//Proceedings of the 7th Conference on International Language Resources and Evaluation, 2010: 2407-2413.
[5] Lowe R, Pow N, Serban I, et al. The Ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems[J]. arXiv preprint arXiv: 1506.08909, 2015.
[6] Li Y, Su H, Shen X, et al. Dailydialog: A manually labelled multi-turn dialogue dataset[J]. arXiv preprint arXiv: 1710.03957, 2017.
[7] Lison P, Tiedemann J, Kouylekov M. OpenSubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora[C]//Proceedings of the 11th International Conference on Language Resources and Evaluation, 2018.
[8] Lison P, Meena R. Automatic turn segmentation for movie & TV subtitles[C]//Proceedings of the 2016 IEEE Workshop on Spoken Language Technology IEEE Conference, 2016: 242-252.
[9] Al-Rfou R, Pickett M, Snaider J, et al. Conversational contextual cues: The case of personalization and history for response ranking[J]. arXiv preprint arXiv: 1606.00372, 2016.
[10] Wu Y, Wu W, Xing C, et al. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots[J]. arXiv preprint arXiv: 1612.01627, 2016.
[11] Lison P, Tiedemann J. OpenSubtitles 2016: Extracting large parallel corpora from movie and TV subtitles[C]//Proceedings of the 10th Language Resources and Evaluation Conference, 2016: 923-929.
[12] Banchs R E. Movie-DiC: A movie dialogue corpus for research and development[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012,(2): 203-207.
[13] Wang L, Zhang X, Tu Z, et al. Automatic construction of discourse corpora for dialogue translation[J]. arXiv preprint arXiv: 1605.06770, 2016.
[14] Robertson S, Zaragoza H, Taylor M. Simple BM25 extension to multiple weighted fields[C]//Proceedings of the 13th ACM International Conference on Information and Knowledge Management, 2004: 42-49.
[15] Tao C, Min-Yen K. Creating a live, public short message service corpus: The NUS-SMS-Corpus[C]//Proceedings of the Language Resources and Evaluation, 2013: 299-355.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61433018,61373075)
{{custom_fund}}