梁宇海,周强. 自动构建基于电视剧字幕和剧本的日常会话基础标注库[J]. 中文信息学报, 2020, 34(1): 23-33.
LIANG Yuhai, ZHOU Qiang. Automatic Construction of Annotated Daily Conversation Corpus Based on the Subtitles and Scripts of TV Plays. , 2020, 34(1): 23-33.
Automatic Construction of Annotated Daily Conversation Corpus Based on the Subtitles and Scripts of TV Plays
LIANG Yuhai1, ZHOU Qiang2
1.Institute of Information Photonics and Optical Communications, Beijing University of Posts and Telecommunications, Beijing 100876, China; 2.Center for Speech and Language Technologies, Research Institute of Information Technology; Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
Abstract:The insufficient human dialogue corpus has been a key factor restricting the performance of dialogue generation system, especial for the Chinese dialogue corpus. This paper presents the automatic construction of CEDAC, a multi-turn dialogue corpus of human daily conversation with 978 109 pairs of Chinese-English bilingual utterances. To obtain this corpus, time-stamps can be used to synchronize English subtitles and corresponding Chinese subtitles, so that abundant Chinese-English bilingual subtitles can be generated. Then, the bilingual subtitles and the utterances in the corresponding English scripts are alinged, so that the tags of speaker and scene in the scripts can be mapped to each pair of sentences in the bilingual subtitles. The experimental result shows it achieves the accuracy of 97.0% on scene boundary annotations and91.57% on speaker annotations. The corpus lays a good foundation for the following research on automatically annotating speakers of subtitles and multi-turn dialogue automatic generation system.
[1] Wang H, Lu Z, Li H, et al. A dataset for research on short-text conversations[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013: 935-945. [2] Xing C, Wu W, Wu Y, et al. Topic aware neural response generation[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence, 2017: 3351-3357. [3] Wei Z, Liu Q, Peng B, et al. Task-oriented dialogue system for automatic diagnosis[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 201-207. [4] Keyan Zhou, Aijun Li, Zhigang Yin,et al. Casia-cassil: A Chinese telephone conversation corpus in real scenarios with multi-leveled annotation[C]//Proceedings of the 7th Conference on International Language Resources and Evaluation, 2010: 2407-2413. [5] Lowe R, Pow N, Serban I, et al. The Ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems[J]. arXiv preprint arXiv: 1506.08909, 2015. [6] Li Y, Su H, Shen X, et al. Dailydialog: A manually labelled multi-turn dialogue dataset[J]. arXiv preprint arXiv: 1710.03957, 2017. [7] Lison P, Tiedemann J, Kouylekov M. OpenSubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora[C]//Proceedings of the 11th International Conference on Language Resources and Evaluation, 2018. [8] Lison P, Meena R. Automatic turn segmentation for movie & TV subtitles[C]//Proceedings of the 2016 IEEE Workshop on Spoken Language Technology IEEE Conference, 2016: 242-252. [9] Al-Rfou R, Pickett M, Snaider J, et al. Conversational contextual cues: The case of personalization and history for response ranking[J]. arXiv preprint arXiv: 1606.00372, 2016. [10] Wu Y, Wu W, Xing C, et al. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots[J]. arXiv preprint arXiv: 1612.01627, 2016. [11] Lison P, Tiedemann J. OpenSubtitles 2016: Extracting large parallel corpora from movie and TV subtitles[C]//Proceedings of the 10th Language Resources and Evaluation Conference, 2016: 923-929. [12] Banchs R E. Movie-DiC: A movie dialogue corpus for research and development[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012,(2): 203-207. [13] Wang L, Zhang X, Tu Z, et al. Automatic construction of discourse corpora for dialogue translation[J]. arXiv preprint arXiv: 1605.06770, 2016. [14] Robertson S, Zaragoza H, Taylor M. Simple BM25 extension to multiple weighted fields[C]//Proceedings of the 13th ACM International Conference on Information and Knowledge Management, 2004: 42-49. [15] Tao C, Min-Yen K. Creating a live, public short message service corpus: The NUS-SMS-Corpus[C]//Proceedings of the Language Resources and Evaluation, 2013: 299-355.