针对古代经典文献的引用查找问题的数据构建与匹配方法

李炜,邵艳秋,毕梦曦,崔晓雅

PDF(1692 KB)
PDF(1692 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (11) : 171-180.
自然语言理解与生成

针对古代经典文献的引用查找问题的数据构建与匹配方法

  • 李炜1,邵艳秋1,毕梦曦2,崔晓雅1
作者信息 +

Data Construction and Matching Method for the Task of Ancient Classics Reference Detection

  • LI Wei1, SHAO Yanqiu1, BI Mengxi2, CUI Xiaoya1
Author information +
History +

摘要

对早期经典的诠释中的引用进行手工标记需要耗费大量时间与人力成本,因此找到一种自动化的引用查找方法非常重要。以预训练语言模型为代表的自然语言处理技术的发展提升了对于文本处理和理解的能力。据此,该文提出多种利用专家知识或深度学习语义理解能力的无监督基线方法来自动查找古代思想家著作中对早期经典的引用。为了验证该文提出的方法的效果并推动自然语言处理技术在数字人文领域的应用,该文以宋代的理学家二程(程颢、程颐)对早期儒家经典的引用为例进行研究,并构建和发布相应的引用查找数据集。实验表明,该文提出方法基于短句的引用探测ROC-AUC值达到了87.83%。基于段落的引用探测ROC-AUC值达到了91.02%。

Abstract

Locating the references explaining early ideological claims is time-consuming, and it is of great importance to develop an automatic detection of reference items. This paper proposes several unsupervised baseline methods to automatically detect the references to early literature. To testify the effectiveness of our proposed method as well as promote the application of natural language processing techniques to the field of Digital Humanities, this paper takes the reference to early Confusian classics by of Two-Cheng of Song Dynasty as an example and manually labels the dataset. The experiment results show that our ensemble method achieves 87.83% on ROC-AUC for sentence level reference detection, and 91.02% on ROC-AUC for paragraph level reference detection.

关键词

引用查找 / 数字人文 / 古代文献

Key words

reference detection / digital humanity / ancient classics

引用本文

导出引用
李炜,邵艳秋,毕梦曦,崔晓雅. 针对古代经典文献的引用查找问题的数据构建与匹配方法. 中文信息学报. 2024, 38(11): 171-180
LI Wei, SHAO Yanqiu, BI Mengxi, CUI Xiaoya. Data Construction and Matching Method for the Task of Ancient Classics Reference Detection. Journal of Chinese Information Processing. 2024, 38(11): 171-180

参考文献

[1] 黄俊杰.东亚儒家经典诠释史中的三个理论问题[J].山东大学学报(哲学社会科学版),2018(02):143-150.
[2] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota: Association for Computational Linguistics, 2019:4171-4186.
[3] QIU X, SUN T, XU Y, et al. Pre-trained models for natural language processing: A survey[J]. Science China Technological Sciences, 2020, 63(10): 1872-1897.
[4] YANG W, XIE Y, LIN A, et al. End-to-end open-domain question answering with BERTserini[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). Minneapolis, 2019:72-77.
[5] ZHANG Z, WU Y, ZHAO H, et al. Semantics-aware BERT for language understanding[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(05): 9628-9635.
[6] CONNEAU A, KHANDELWAL K, GOYAL N, et al. Unsupervised cross-lingual representation learning at scale[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 8440-8451.
[7] GAO T, YAO X, CHEN D. SimCSE: Simple contrastive learning of sentence embeddings[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic,2021: 6894-6910.
[8] SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout: A simple way to prevent neural networks from overfitting[J]. The Journal of Machine Learning Research, 2014, 15(1): 1929-1958.
[9] JAISWAL A, BABU A R, ZADEH M Z, et al. A survey on contrastive self-supervised learning[J]. Technologies, 2020, 9(1): 2-15.
[10] 黄水清,周好,彭秋茹,等.引书的自动识别及文献计量学分析[J].情报学报,2021,40(12):1325-1337.
[11] 周好,王东波,黄水清.古籍引书上下文自动识别研究: 以注疏文献为例[J].情报理论与实践,2021,44(09):169-175.
[12] 耿云冬,张逸勤,刘欢,等.面向数字人文的中国古代典籍词性自动标注研究: 以SikuBERT预训练模型为例[J].图书馆论坛,2022,42(06):55-63.
[13] 刘江峰,冯钰童,王东波,等.数字人文视域下SikuBERT增强的史籍实体识别研究[J].图书馆论坛,2022,42(10):61-72.
[14] 胡昊天,张逸勤,邓三鸿,等.面向数字人文的《四库全书》子部自动分类研究: 以SikuBERT和SikuRoBERTa预训练模型为例[J].图书馆论坛,2022,42(12):138-148.
[15] 徐润华,王东波,刘欢,等.面向古籍数字人文的《资治通鉴》自动摘要研究: 以SikuBERT预训练模型为例[J].图书馆论坛,2022,42(12):129-137.
[16] 王东波,刘畅,朱子赫等.SikuBERT与SikuRoBERTa: 面向数字人文的《四库全书》预训练模型构建及应用研究[J].图书馆论坛,2022,42(06):31-43.
[17] 俞敬松,魏一,张永伟.基于BERT的古文断句研究与应用[J].中文信息学报,2019,33(11):57-63.
[18] 葛瑞汉.二程兄弟的新儒学:中国的两位哲学家[M].郑州:大象出版社,2000:216-217.

基金

国家自然科学基金(62306045);中央高校基本科研业务费(江苏省道德发展智库资助成果2242024S30007)
PDF(1692 KB)

249

Accesses

0

Citation

Detail

段落导航
相关文章

/