CDCPP:跨领域中文标点符号预测

刘鹏远,王伟康,邱立坤,杜冰洁

PDF(2113 KB)
PDF(2113 KB)
中文信息学报 ›› 2021, Vol. 35 ›› Issue (6) : 131-140.
自然语言理解与生成

CDCPP:跨领域中文标点符号预测

  • 刘鹏远1,王伟康1,邱立坤2,杜冰洁1
作者信息 +

CDCPP: Cross-Domain Chinese Punctuation Prediction

  • LIU Pengyuan1, WANG Weikang1, QIU Likun2, DU Bingjie1
Author information +
History +

摘要

在中文文本特别是在社交媒体及问答领域文本中,存在非常多的标点符号错误或缺失的情况,这严重影响对文本进行语义分析及机器翻译等各项自然语言处理的效果。当前对标点符号进行预测的相关研究多集中于英文对话的语音转写文本,缺少对社交媒体及问答领域文本进行标点符号预测的相关研究,也没有这些领域公开的数据集。该文首次提出跨领域中文标点符号预测任务,该任务首先利用标点符号基本规范正确的大规模新闻领域文本,建立标点符号预测模型;然后在标点符号标注不规范的社交媒体及问答领域,进行跨领域标点符号预测。随后,构建了新闻、社交媒体及问答三个领域的相应数据集。最后还实现了一个基于BERT的标点符号预测基线模型并在该数据集上进行了实验与分析。实验结果表明,直接利用新闻领域训练的模型,在社交媒体及问答领域进行标点符号预测的性能均有所下降,在问答领域下降较小,在微博领域下降较大,超过20%,说明跨领域标点符号预测任务具有一定的挑战性。

Abstract

Punctuation errors or omissions in s in Chinese texts seriously affects various natural language processing such as semantic analysis and machine translation. Existing researches on punctuation prediction are mostly focused on the speech transcribed text of English conversations, rather than texts in social media and question answering domain. This paper proposes a cross domain Chinese punctuation prediction task, i.e. punctuation prediction for the fields of social media and question answering via large-scale news texts with correct punctuation marks. Corresponding data sets in the fields of news, social media and question answering are then constructed. A BERT-based punctuation prediction baseline model is implemented. The experimental results show that the performance of punctuation prediction in social media and question answering domains decreases by directly using the model trained in the news domain. The decline in question answering domain is much less than that in Weibo domain(more than 20%). The task of cross domain punctuation prediction is challenging.

关键词

中文标点符号预测 / 跨领域 / 数据集

Key words

Chinese punctuation prediction / cross-domain / dataset

引用本文

导出引用
刘鹏远,王伟康,邱立坤,杜冰洁. CDCPP:跨领域中文标点符号预测. 中文信息学报. 2021, 35(6): 131-140
LIU Pengyuan, WANG Weikang, QIU Likun, DU Bingjie. CDCPP: Cross-Domain Chinese Punctuation Prediction. Journal of Chinese Information Processing. 2021, 35(6): 131-140

参考文献

[1] Beeferman D, Berger A, Lafferty J. Cyberpunc: A lightweight punctuation annotation system for speech[C]//Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998, 2: 689-692.
[2] Liu Y, Shriberg E, Stolcke A, et al. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(5): 1526-1540.
[3] Lu W, Ng H T. Better punctuation prediction with dynamic conditional random fields[C]//Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 2010: 177-186.
[4] Peitz S, Freitag M, Mauser A, et al. Modeling punctuation prediction as machine translation[C]//Proceedings of the International Workshop on Spoken Language Translation, 2011: 238-245.
[5] Tilk O, Alume T. LSTM for punctuation restoration in speech transcripts[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015: 683-687.
[6] Federico M, Cettolo M, Bentivogli L, et al. Overview of the IWSLT 2012 evaluation campaign[C]//Proceedings of the IWSLT-International Workshop on Spoken Language Translation, 2012: 12-33.
[7] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[8] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 2980-2988.
[9] 谢丽星, 周明, 孙茂松. 基于层次结构的多策略中文微博情感分析和特征抽取[J]. 中文信息学报, 2012, 26(1): 73-83.
[10] 古万荣, 董守斌, 曾之肇, 等. 基于微博用户模型的个性化新闻推荐[J]. 中文信息学报, 2016, 30(1): 93-101.
[11] 贺敏, 刘玮, 刘悦, 等. 基于特征驱动的微博话题检测方法[J]. 中文信息学报, 2017, 31(3): 101-108.
[12] 王志宏, 过弋. 微博谣言事件自动检测研究[J]. 中文信息学报, 2019, 33(6): 132-140.
[13] Liu X, Chen Q, Deng C, et al. LCQMC: A large-scale chinese question matching corpus[C]//Proceedings of the 27th International Conference on Computational Linguistics, 2018: 1952-1962.
[14] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[J]. arXiv preprint arXiv:1708.02002, 2017.
[15] Ueffing N, Bisani M, Vozila P. Improved models for automatic punctuation prediction for spoken and written text[C]//Proceedings of the Interspeech, 2013: 3097-3101.
[16] elasko P, Szymański P, Mizgajski J, et al. Punctuation prediction model for conversational speech[J]. arXiv preprint arXiv:1807.00543, 2018.
[17] Hasan M, Doddipatla R, Hain T. Noise-matched training of CRF based sentence end detection models[C]//Proceedings of the 16th Annual Conference on the International Speech Communication Association, 2015: 349-353.
[18] Che X, Wang C, Yang H, et al. Punctuation prediction for unsegmented transcript based on word vector[C]//Proceedings of the 10th International Conference on Language Resources and Evaluation, 2016: 654-658.
[19] Yi J, Tao J, Wen Z, et al. Distilling knowledge from an ensemble of models for punctuation prediction[C]//Proceedings of the Interspeech, 2017: 2779-2783.
[20] Driesen J, Birch A, Grimsey S, et al. Automated production of true-cased punctuated subtitles for weather and news broadcasts[C]//Proceedings of the 15th Annual Conference of the International Speech Communication Association, 2014: 2146-2147.
[21] Cho E, Niehues J, Waibel A. Segmentation and punctuation prediction in speech language translation using a monolingual translation system[C]//Proceedings of the International Workshop on Spoken Language Translation, 2012: 252-259.
[22] Klejch O, Bell P, Renals S. Punctuated transcription of multi-genre broadcasts using acoustic and lexical approaches[C]//Proceedings of the 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016: 433-440.
[23] Klejch O, Bell P, Renals S. Sequence-to-sequence models for punctuated transcription combining lexical and acoustic features[C]//Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017: 5700-5704.
[24] Kim S. Deep recurrent neural networks with layer-wise multi-head attentions for punctuation restoration[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019: 7280-7284.
[25] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. arXiv preprint arXiv:1706.03762, 2017.
[26] Yi J, Tao J. Self-attention based model for punctuation prediction using word and speech embeddings[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019: 7270-7274.
[27] Zhang D, Wu S, Yang N, et al. Punctuation prediction with transition-based parsing[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013: 752-760.
[28] Cho E, Kilgour K, Niehues J, et al. Combination of NN and CRF models for joint detection of punctuation and disfluencies[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015: 3650-3654.
[29] Yi J, Tao J, Bai Y, et al. Adversarial transfer learning for punctuation restoration[J]. arXiv preprint arXiv:2004.00248, 2020.
[30] Pham Q H, Nguyen B T, Cuong N V. Punctuation prediction for Vietnamese texts using conditional random fields[C]//Proceedings of the 10th International Symposium on Information and Communication Technology, 2019: 322-327.
[31] Zhao Y, Wang C, Fu G. A CRF sequence labeling approach to Chinese punctuation prediction[C]//Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation, 2012: 508-514.
[32] 黄建年, 侯汉清. 农业古籍断句标点模式研究[J].中文信息学报,2008,22(4): 31-38.
[33] 张开旭, 夏云庆, 宇航. 基于条件随机场的古汉语自动断句与标点方法[J]. 清华大学学报(自然科学版), 2009, 49(10): 1733-1736.
[34] 王博立, 史晓东, 苏劲松. 一种基于循环神经网络的古文断句方法[J]. 北京大学学报 (自然科学版), 2017, 53(2): 255-260.
[35] 俞敬松, 魏一, 张永伟. 基于 BERT 的古文断句研究与应用[J]. 中文信息学报, 2019, 33(11): 57-63.

基金

北京市自然科学基金(4192057);教育部人文社会科学研究规划基金(18YJA740030);北京语言大学校级项目(中央高校基本科研业务费专项资金)(17PT05)
PDF(2113 KB)

1491

Accesses

0

Citation

Detail

段落导航
相关文章

/