基于小句复合体的句子边界自动识别研究

何晓文,罗智勇,胡紫娟,王瑞琦

PDF(2338 KB)
PDF(2338 KB)
中文信息学报 ›› 2021, Vol. 35 ›› Issue (5) : 1-8.
语言分析与计算

基于小句复合体的句子边界自动识别研究

  • 何晓文,罗智勇,胡紫娟,王瑞琦
作者信息 +

Automatic Recognition of Sentence Boundary Based on Clause Complex

  • HE Xiaowen, LUO Zhiyong, HU Zijuan, WANG Ruiqi
Author information +
History +

摘要

自然语言文本的语法结构层次包括语素、词语、短语、小句、小句复合体、语篇等。其中,语素、词、短语等相关处理技术已经相对成熟,而句子的概念至今未有公认的、适用于语言信息处理的界定。该文重新审视了语言学中句子的定义和自然语言处理中句子的切分问题,提出了中文句子切分的任务;基于小句复合体理论将句子定义为最小的话头自足的标点句序列,也就是自足的话题结构,并设计和实现了基于BERT的边界识别模型。实验结果表明,该模型对句子边界自动识别正确率、F1值分别达到88.37%、83.73%,识别效果优于按照不同的标点符号机械分割的效果。

Abstract

The grammatical structure of natural language text consists of words, phrases, sentences, clause complexes and texts. This paper re-examines the definition of sentences in linguistics and the segmentation of sentences in natural language processing, and puts forward the task of Chinese sentence segmentation. Based on the theory of clause complex, the sentence is defined as the smallest topic self-sufficient punctuation sequence, and a sentence boundary recognition model based on BERT is designed and implemented. The experimental results show that the accuracy and F1 value of the model are 88.37% and 83.73%, respectively, much better than that of mechanical segmentation according to punctuation marks.

关键词

句子 / 小句复合体 / 句子边界识别

Key words

sentence / clause complex / sentence boundary recognition

引用本文

导出引用
何晓文,罗智勇,胡紫娟,王瑞琦. 基于小句复合体的句子边界自动识别研究. 中文信息学报. 2021, 35(5): 1-8
HE Xiaowen, LUO Zhiyong, HU Zijuan, WANG Ruiqi. Automatic Recognition of Sentence Boundary Based on Clause Complex. Journal of Chinese Information Processing. 2021, 35(5): 1-8

参考文献

[1] 宋柔,葛诗利,尚英,等. 面向文本信息处理的汉语句子和小句. 中文信息学报, 2017,31(2): 18-24.
[2] 朱德熙. 语法讲义[M].北京: 商务印书馆,1982.
[3] 邢福义. 汉语复句研究. 北京: 商务印书馆, 2001.
[4] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv: 1810.04805,2018.
[5] Emily Pitler, A Louis, A Nenkova. Automatic sense prediction for implicit discourse relations in text[C]//Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP,2009: 683-691.
[6] Emily Pitler, A Nenkova. Using syntax to disambiguate explicit discourse connectives in text[C]//Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP,2009: 13-16.
[7] 赵元任. 汉语口语语法[M].吕叔湘译.北京: 商务印书馆,1979.
[8] 艾山·吾买尔. 基于最大熵的维吾尔语句子边界识别模型[J]. 计算机工程, 2010, 36(6): 24-26.
[9] 李响, 才藏太, 姜文斌, 等. 最大熵和规则相结合的藏文句子边界识别方法[J]. 中文信息学报, 2011, 25(4): 39-45.
[10] 蒋玉茹, 宋柔. 基于广义话题理论的话题句识别[J]. 中文信息学报, 2012, 26(5): 114-120.
[11] Yuru Jiang, Rou Song. Topic structure identification of pClause sequence based on generalized topic theory[C]//Proceedings of the 1st CCF Conference on Natural Language Processing and Chinese Computing. Berlin: Springer-Verlag,2012: 85-96。
[12] 蒋玉茹, 宋柔. 基于细粒度特征的话题句识别方法[J]. 计算机应用, 2014, 34(5): 1345-1349.
[13] Teng Mao, Yuyao Zhang, Yuru Jiang. Research on construction method of Chinese NT clause based on attention-LSTM [C]//Proceedings of the 7th CCF International Conference, NLPCC 2018, Hohhot, China, Part Ⅱ. 2018.
[14] 宋柔.汉语篇章广义话题结构的流水模型[J].中国语文,2013(6): 483-494.
[15] 宋柔.小句复合体的理论研究和应用.[DB/OL].http: //2011.gdufs.edu.cn/info/1070/2085.htm,2017-11-13.

基金

北京语言大学研究生创新基金(中央高校基本科研业务费专项资金)(19YCX124);国家自然科学基金(62076037)
PDF(2338 KB)

Accesses

Citation

Detail

段落导航
相关文章

/