长度分布约束下的摘要文本无监督分割算法

骆俊帆,陈 黎,于中华,丁革建,罗 谦

PDF(2358 KB)
PDF(2358 KB)
中文信息学报 ›› 2017, Vol. 31 ›› Issue (4) : 138-144.
信息抽取与文本挖掘

长度分布约束下的摘要文本无监督分割算法

  • 骆俊帆1,陈 黎1,于中华1,丁革建2,罗 谦3
作者信息 +

A Length Distribution Constrained Text Segmentation for Paper Abstracts

  • LUO Junfan1, CHEN Li1, YU Zhonghua1, DING Gejian2, LUO Qian3
Author information +
History +

摘要

作为文章内容的浓缩表达,摘要蕴含着文章关键的发现和结论。自动分析挖掘摘要内容,对于充分利用快速增长的科技文献具有重要意义。该文以Medline生物医学文章的摘要为对象,对摘要的文本分割问题进行了研究。针对摘要各论述侧面(内容块)之间在长度分布上倾向于均匀的特点,提出了一种考虑长度分布约束的摘要文本无监督分割算法,该算法以信息熵作为长度分布均匀性的度量指标,将信息熵与块内语义相似度及块间语义相似度相结合作为优化的目标函数,采用动态规划方法搜索最佳分割点。在8 603篇Medline摘要上对算法进行了实验验证,并与文献中最新的无监督分割算法进行了实验对比。结果表明,该文提出的增加了长度分布约束的分割算法更加适用于摘要文本分割,分割的准确率有3%的提高。

Abstract

To deal with the text segmentation for academic paper abstracts, an unsupervised text segmentation algorithm is proposed, which incorporates constraint of the length distribution derived from the preference of length uniformity in different discussion aspects (i.e. content blocks) of an abstract. A metric based on information entropy is introduced to the algorithm to measure the length distribution uniformity, and the object function is designed with further combination of semantic similarities of inter-and intra-content blocks. A standard dynamic programming scheme is employed to determine the best segmentation sequence. Experimented on 8603 abstracts from Medline, the results show an improvement of 3% in accuracy compared with baselines.

关键词

文本分割 / 无监督 / 动态规划 / 生物医学 / 摘要文本

Key words

text segmentation / unsupervised / dynamic programming / biomedical / abstract-text

引用本文

导出引用
骆俊帆,陈 黎,于中华,丁革建,罗 谦. 长度分布约束下的摘要文本无监督分割算法. 中文信息学报. 2017, 31(4): 138-144
LUO Junfan, CHEN Li, YU Zhonghua, DING Gejian, LUO Qian. A Length Distribution Constrained Text Segmentation for Paper Abstracts. Journal of Chinese Information Processing. 2017, 31(4): 138-144

参考文献

[1] 刘娜, 唐焕玲, 鲁明羽. 文本线性分割方法的研究[J]. 计算机工程与应用, 2008, 44(21): 212-216.
[2] Liu Na, Tang Huanling, Lu Mingyu. Study on linear text segmentation method[J]. Computer Engineering and Applications, 2008, 44(21): 212-216.
[3] 童毅见, 唐慧丰. 面向自动文摘的主题划分方法[J]. 北京大学学报(自然科学版), 2013, 49(1): 39-44.
[4] Tong Yijian, Tang Huifeng. Topic partition for automatic summarization[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 39-44.
[5] Li X, Han A. An improved method of statistical model for text segmentation[C]//Proceedings of the IEEE Electronics Information and Emergency Communication(ICEIEC), 2013: 282-285.
[6] Wu J W, Tseng J C R, Tsai W N. A hybrid linear text segmentation algorithm using hierarchical agglomerative clustering and discrete particle swarm optimization[J]. Integrated Computer-Aided Engineering, 2014, 21(1): 35-46.
[7] McKnight L, Srinivasan P. Categorization of sentence types in medical abstracts[C]//Proceedings of the AMIA Annual Symposium Proceedings, 2003: 440.
[8] Lin R T K, Dai H J, Bow Y Y. Result identification for biomedical abstracts using conditional random fields[C]//Proceedings of Information Reuse and Integration Conference on IEEE, 2008: 122-126.
[9] 陈源, 陈蓉, 胡俊锋,等. 面向概括性小文本的文本分割算法[J]. 计算机工程, 2008, 34(22): 43-45.
[10] Chen Yuan, Chen Rong, HU Junfeng.Text segmentation algorithm oriented to small general-text[J]. Computer Engineering, 2008, 34(22): 43-45.
[11] Halliday M A K, Hasan R. Cohesion in English[M].London. Routledge, 1976.
[12] Jeffrey C, Reynar. An automatic method of finding topic boundaries[C]//Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, 1994: 331-333.
[13] Marie Francine Moens, Rik De Busser. Generic topic segmentation of document texts[C]//Proceedings of the 24th International Conference on Research and Developement in Information Retrieval, 2001: 418-419.
[14] Masao Utiyama, Hitoshi Isahara. A statistical model for domain-independent text segmentation[C]//Proceedings of the 39th Annual Meeting on the Association for Computational Linguistics, 2001: 499-506.
[15] Marti A Hearst. TextTiling: Segmenting text into multi-paragraph subtopic passages[J]. Computational Linguistics, 1997, 23(1): 33-64.
[16] Choi F Y Y. Advances in domain independent linear text segmentation[C]//Proceedings of the 1st North American chapter of the Association for Computational Linguistics Conference, 2000: 26-33.
[17] Simon A R, Gravier G, Sébillot P. Leveraging lexical cohesion and disruption for topic segmentation[C]//Proceedings of International Conference on Empirical Methods in Natural Language Processing, 2013.
[18] Beeferman D, Berger A, Lafferty J. Statistical models for text segmentation[J]. Machine Learning, 1999, 34(1-3): 177-210.
[19] Pevzner L, Hearst M A. A critique and improvement of an evaluation metric for text segmentation[J]. Computational Linguistics, 2002, 28(1): 19-36.
[20] Fournier, Chris. Evaluatingtext segmentation using boundary edit distance[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013: 1702-1712.

基金

四川省科技支撑项目(2014GZ0063)
PDF(2358 KB)

Accesses

Citation

Detail

段落导航
相关文章

/