薛扬,梁循,谢华伦,杜玮. 基于最优文档嵌入的《红楼梦》作者辨析[J]. 中文信息学报, 2020, 34(9): 97-110.
XUE Yang, LIANG Xun, XIE Hualun, DU Wei. An Analysis of Authorship of A Dream of Red Mansions Based on Optimal Document Embedding. , 2020, 34(9): 97-110.
An Analysis of Authorship of A Dream of Red Mansions Based on Optimal Document Embedding
XUE Yang, LIANG Xun, XIE Hualun, DU Wei
1.School of Information, Renmin University of China, Beijing 100872, China; 2.State Key Laboratory of Digital Publishing Technology, Peking University Founder Group Corp, Beijing 100871, China
Abstract:A document embedding model is designed and trained over a corpus of 51 contemporary and Ming and Qing literary works including A Dream of Red Mansions.To achieve the optimal high-dimension document embedding vector to represent the semantic characteristics of words and document topics, the document embedding matrix and loss function of different authors are defined according to the unitary invariance of document embedding vector. An authorship identification method is designed by an unsupervised manifold learning dimensionality reduction mapping algorithm and a supervised classification algorithm. The classification accuracy of the known authors reaches 99.6%, even authors with similar styles such as Lu Yao and Chen Zhongshi can be effectively distinguished. The variable-scale sliding window classification model is further proposed to conduct an in-depth analysis of A Dream of Red Mansion. It is found that the first 80 chapters and the last 40 chapters may come from different authors, and there are also some style differences between the first 100 and the last 20 chapters.
[1] 年洪东,陈小荷,王东波. 现当代文学作品的作者身份识别研究[J]. 计算机工程与应用, 2010, 46(4): 226-229. [2] 李晓军,刘怀亮,杜坤. 一种基于复杂网络模型的作者身份识别方法[J]. 图书情报工作, 2015, 59(18): 102-107. [3] 肖天久,刘颖. 基于聚类和分类的金庸与古龙小说风格分析[J]. 中文信息学报, 2015, 29(5): 169-179. [4] 马创新,陈小荷. 文献中的词语分布、词型等级和风格计算[J]. 中文信息学报, 2017, 31(4): 20-27. [5] 赵建明,李春晖,姚念民. 基于机器学习的宋词风格识别[J]. 计算机工程与应用, 2018, 54(1): 186-190. [6] 范亚超,罗天健,周昌乐. 基于降噪自编码器特征学习的作者识别及其在《西游记》诗词上的应用[J]. 厦门大学学报(自然科学版), 2018, 57(6): 884-889. [7] 温浩. 科技文摘创新点语义识别与分类方法研究[J]. 情报学报, 2019, 38(03): 249-256. [8] Adams J, Williams H, Carter J, et al. Genetic heuristic development: Feature selection for author identification[C]//Proceedings of the 2013 IEEE Symposium on Computational Intelligence in Biometrics and Identity Management (CIBIM), Singapore, 2013, 36-41. [9] Zhang C, Wu X, Niu Z, et al. Authorship identification from unstructured texts[J]. Knowledge-Based Systems, 2014, 66: 99-111. [10] Chen T, Sun Y. Task-guided and path-augmented heterogeneous network embedding for author identification[C]//Proceedings of the 10th ACM International Conference on Web Search and Data Mining, New York, 2017, 295-304. [11] Villar E, Ser J, Bilbao M, et al. A feature selection method for author identification interactive communications based on supervised learning and language typicality [J]. Engineering Applications of Artificial Intelligence, 2016, 56: 175-184. [12] Alamearat K, Ayyoub M, Shalabi R, et al. Author gender identification from Arabic text[J]. Journal of Information Security and Applications, 2017, 35: 85-95. [13] Soler J, Wanner L. On the role of syntactic dependencies and discourse relations for author and gender identification [J]. Pattern Recognition Letters, 2018, 105: 87-95. [14] Sboev A, Moloshnikov I, Gudocskikh D, et al. Automatic gender identification of author of Russian text by machine learning and neural net algorithms in case of gender deception[J]. Procedia Computer Science, 2018, 123: 417-423. [15] Digamberrao K, Prasad R. Author identification using sequential minimal optimization with rule-based decision tree on Indian literature in marathi[J]. Procedia Computer Science, 2018, 132: 1086-1101. [16] Zhou X, Liang X, Du X, et al. Structure based user identification across social networks[J], IEEE Transactions on Knowledge and Data Engineering, 2018, 30(6): 1178-1191. [17] Zhao X, Liang X, Tang F, et al. Building character graphs and dividing communtities in Chinese novels based on graph data extracton: community divistion for character emotional polarity networks[J]. IEEE Access, 2020,8: 95559-95573. [18] 韦博成. 《红楼梦》前80回与后40回某些文风差异的统计分析(两个独立二项总体等价性检验的一个应用)[J]. 应用概率统计, 2009, 25(04): 441-448. [19] 徐秉铮,蔡伟鸿.从信息论角度探讨《红楼梦》的作者[J]. 中文信息学报, 1990, 4(2): 1-5. [20] 马创新,陈小荷. 从高频词等级相关角度探析《红楼梦》作者[J]. 中文信息学报, 2018, 32(11): 97-102. [21] Quoc V L, Mikolov T. Distributed representations of sentences and documents[C]//Proceedings of the 31st International Conference on Machine Learning, Beijng, 2014, 1188-1196. [22] Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3: 1137-1155. [23] Hamilton W, Leskovec Jure, Jurafsky D. Diachronic word embedding reveal statistical laws of semantic change[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 1489-1501. [24] Yin Z, Shen Y. On the dimensionality of word embedding[C]//Proceeding of the 32nd Conference on Neural Information Processing Systems, Montreal, Canada, 2018.