基于名词掩盖的跨领域作者识别研究

郭旭,祁瑞华

PDF(2345 KB)
PDF(2345 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (1) : 160-168.
自然语言处理应用

基于名词掩盖的跨领域作者识别研究

  • 郭旭,祁瑞华
作者信息 +

Cross-Domain Authorship Attribution via Noun-maksing

  • GUO Xu, QI Ruihua
Author information +
History +

摘要

为了提高作者识别的跨领域鲁棒性,解决作者写作规律在不同领域间的迁移问题,该文首先通过分析和实验发现: 名词具有较高的领域相关性。然后,采用文本变形算法将名词掩盖掉,以此来降低相关特征的权重,从而迫使机器学习算法选择领域关联度更低的特征拟合样本,进而提高模型的泛化能力。在由21 953个样本组成的跨领域作者识别的实验中,该文分别采用了基于字N-gram、基于BERT和基于集成学习的三种典型作者识别方法,对比了无掩盖和掩盖名词、形容词、动词、副词、功能词的作者识别,其中掩盖名词后的作者识别方法获得了较高的评价指标。实验结果表明,掩盖名词的方法可以提高作者识别的跨领域鲁棒性。

Abstract

To improve the robustness of cross-domain authorship attribution, this paper firstly reveals that nouns are highly domain dependent, and proposes a noun-making strategy for authorship attribution. It forces the algorithm to select features with lower domain dependence so as to improve the generalization ability. In an experiment consisting of 21 953 samples, the proposed method outperforms baselines based on n-Gram, BERT and ensemble learning.

关键词

作者识别 / 跨领域 / 迁移学习 / 掩盖名词

Key words

authorship attribution / cross-domain / transfer learning / covering nouns

引用本文

导出引用
郭旭,祁瑞华. 基于名词掩盖的跨领域作者识别研究. 中文信息学报. 2023, 37(1): 160-168
GUO Xu, QI Ruihua. Cross-Domain Authorship Attribution via Noun-maksing. Journal of Chinese Information Processing. 2023, 37(1): 160-168

参考文献

[1] SHRESTHA S, SENGUPTA S, ALE P, et al. Authorship attribution for social media forensics[J]. IEEE Transactions on Information Forensics and Security, 2017, 12(1): 5-33.
[2] 祁瑞华,郭旭,刘彩虹.中文微博作者身份识别研究[J].情报学报,2017(01): 76-82.
[3] GEORGE K M, ELENI K A. Investigating topic influence in authorship attribution[C]//Proceedings of the SIGIR International Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection. Netherlands: DBLP, 2007.
[4] ROHITH MENON, YEJIN C. Domain independent authorship attribution without domain adaptation[C]//Proceedings of the International ConferenceRecent Advances in Natural Language Processing. Hissar, 2011: 309-315.
[5] REBEKAH O, RACHEL G. Blogs, twitter feeds, and reddit comments: Cross-domain authorship attribution[J]. Proceedings on Privacy Enhancing Technologies, 2016 (3): 155-171.
[6] RAHEEM S, QING L,THANAWIN R, et al. A scalable framework for cross-lingual authorship identification[J]. Information Sciences, 2018, 462: 323-339.
[7] 徐晓霖,蔡满春,芦天亮.基于深度学习的中文微博作者身份识别研究[J]. 计算机应用研究, 2020,37(01): 16-18.
[8] PRASHA S, SEBASTIAN S, FABIO A G, et al. Convolutional neural networks for authorship attribution of short texts[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain: EACL, 2017(2): 669-674.
[9] HINTON G E, SRIVASTAVA N,KRIZHEVSKY A, et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. Computer Science, 2012, 3(4): 212-223.
[10] EFSTATHIOS S. Authorship attribution using text distortion [C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain: EACL, 2017(1): 1138-1149.
[11] HAN H. HAN L P: han language processing[CP/OL]. https://github.com/hankcs/HanLP.[2021-05-06]
[12] DEVLIN J, CHANG M W, LEE K, et al.BERT: Pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. Minnesota, USA: NAACL, 2019(1): 4171-4186.
[13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc., NY, USA, 6000-6010.
[14] Google. Chinese BERT language model [CP/OL].https://github.com/google-research/bert/.[2021-05-06]
[15] LUKAS M, GORDON L, JANEK A. Authorship attribution in fan-fictional texts given variable length character and word N-grams[C]//Proceedings of Cross-Language Evaluation Forum. Labs and Workshops, Notebook Papers, 2019.

基金

国家社会科学基金(15BYY028);辽宁省自然科学基金(2019-ZD-0513);大连外国语大学研究创新团队(2016CXTD06)
PDF(2345 KB)

649

Accesses

0

Citation

Detail

段落导航
相关文章

/