该文以金庸与古龙的小说作为语料,从计算风格学的角度考察二人的风格差异。对比了两人小说的文本从众性、句子破碎度,同时,使用文本聚类的方法对词和词类的N元文法,标点符号的N元文法以及多种特征的总体情况进行了考察,还使用主成分分析和文本分类对八种特征从总体上进行了比较,结果证实金庸与古龙小说风格存在较大差异:金庸小说从众性大于古龙,较多使用俚语方言,口语性更强,同时在语法结构、短语结构、文本节奏以及文本可读性和语言变化程度上也有较大的差异。
Abstract
Based on the fictions written by Jin Yong and Gu Long, this paper analyzes the sentence fragmentation and text conformity from the perspective of computational stylistics. The twelve texts are clustered using n-gram of words, n-gram of part of speech, n-gram of punctuations and six other features as features. Besides, the principal components analysis and the text classification are applied with eight features. The results of experiments show that there exist great style differences between Jin Yongs and Gu Longs fictions: Jin Yongs fictions are more colloquial than Gu Long’s; Jin Yong use more words and idioms from dialects and slang while the expressions in Gu Longs fictions are more formal. Whats more, there are differences between the two authors’ fictions on the syntactic structures, phrase structures, rhythms, readabilities and the language variation.
关键词
计算风格学 /
N元文法 /
聚类 /
分类 /
句子破碎度
{{custom_keyword}} /
Key words
computational stylistics /
n-gram /
clustering /
classification /
sentence fragmentation
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Jack Grieve.Quantitative authorship attribution: an evaluation of techniques[J]. Literary and Linguistic Computing, 2007,22(3): 251-270.
[2] Baayen R H, Van Halteren H, Neijt A. et al.An experiment in authorship attribution[C]//Proceedings of the 6th International Conference on the Statistical Analysis of Textual Data.
[3] de Vel O, Anderson A, Corney M, et al. Mining e-mail content for author identification forensics[J]. SIGMOD Record, 2001,30(4): 55-64.
[4] 陆芸.词汇丰富性测量方法及计算机程序开发:回顾与展望[J].南京工业大学学报:社会科学版,2012,11(2):104-108.
[5] Binongo J N G, & Smith M W A. The application of principal component analysis to stylometry[J]. Literary and Linguistic Computing, 1999,14(4): 445-466.
[6] Burrows J F. Word patterns and story shapes: The statistical analysis of narrative style[J]. Literary and Linguistic Computing,1987,2(2), 61-67.
[7] 陈芯莹,李雯雯,王燕.计量特征在语言风格比较及作家判定中的应用——以韩寒《三重门》与郭敬明《梦里花落知多少》为例[J].计算机工程与应用,2012,(30):137-139.
[8] Rong Zheng,Jiexun Li, Hsinchun Chen, et al. A framework for authorship identification of online messages: Writing-style features and classification techniques[J]. Journal of the American Society for Information Science and Technology, 2006,57(3):378-393.
[9] Stamatatos, E.,et al.Computer-based authorship attribution without lexical measures[J]. Computers and the Humanities, 2001,35(2):193-214.
[10] 武晓春,黄萱菁,吴立德. 基于语义分析的作者身份识别方法研究[J].中文信息学报,2006,20(6):61-68.
[11] 李贤平.《红楼梦》成书新说[J].复旦学报:社会科学版,1987,(5):3-16.
[12] Holmes D I. A stylometric analysis of Mormon scripture and related texts[J]. Journal of Royal Statistical Society, 1992,15(5): 91-120.
[13] Ying Zhao,Justin Zobel. Effective and scalable authorship attribution using function words[J].Lecture Notes in Computer Science,2005,2689: 174-189.
[14] 曲俐俐.金庸、古龙武侠小说比较论[D].延吉:延边大学硕士学位论文,2012.
[15] 王开银.金庸、古龙武侠小说语言风格比较研究[D].乌鲁木齐:新疆大学硕士学位论文,2008.
[16] 陈洁.金庸古龙武侠小说比较论[J].浙江大学学报:人文社会科学版,1999,29(5):131-138.
[17] 刘颖,肖天久. 金庸与古龙小说计量风格学研究[J]. 清华大学学报:哲学社会科学版,2014,29(5):135-147.
[18] 阚明刚.几个语体参数的定量对比研究--以新闻报道和访谈对话为例[J].语文学刊,2011,(9):46-48,54.
[19] 张京楣.基于统计方法的文本风格分析研究[D].济南:山东大学博士学位论文,2010.
[20] 黄伯荣,廖序东.现代汉语[M].北京:高等教育出版社,2007.
[21] Christopher D.Manning,PrabhakarRaghavan,HinrichSchütze.信息检索导论[M].王斌译.北京:人民邮电出版社,2010.
[22] 贺湘情,刘颖.基于文本聚类的语言韵律和节奏风格特征挖掘[J].中文信息学报,2014,28(6):194-200,207.
[23] 丁俊苗.不足与需要:论标点符号的语法功能[J].安徽大学学报:哲学社会科学版,2008,32(4):83-88.
[24] 常淑慧.基于写作风格的中文邮件作者身份识别技术研究[D].保定:河北农业大学硕士学位论文,2005.
[25] 汤银才.R语言与统计分析[M].北京: 高等教育出版社,2008.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
清华大学人文社科振兴基金(20145081042);国家自然科学基金(61433015)
{{custom_fund}}