软件标识符的自然语言规范性研究

汶东震,张帆,张晓堃,杨亮,林原,徐博,林鸿飞

PDF(2852 KB)
PDF(2852 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (10) : 144-154.
自然语言处理应用

软件标识符的自然语言规范性研究

  • 汶东震,张帆,张晓堃,杨亮,林原,徐博,林鸿飞
作者信息 +

A Natural Language Perspective to the Readability of Software Identifiers

  • WEN Dongzhen, ZHANG Fan, ZHANG Xiaokun, YANG Liang, LIN Yuan, XU Bo, LIN Hongfei
Author information +
History +

摘要

软件源代码的理解是软件协同开发与维护的核心,而源代码中占半数以上的标识符的理解则在软件理解中起到重要作用,传统软件工程主要研究通过命名规范限制标识符的命名过程以构造更易理解和交流的标识符。该文在梳理分析常见编程语言命名规范的基础上,该文提出一种全新的标识符可理解性评价标准。具体而言,首先总结梳理了常见主流编程语言中的命名规范并类比自然语言语素概念,提出基于软件语素的标识符构成过程,即标识符的构成可被视为软件语素的生成、排列和连接过程;在此基础上,该文提出一种结合自然语料库的软件标识符规范性评价方法,用来衡量软件标识符是否易于理解;最后,通过源代码理解数据集和Github平台中开源项目对规范性指标进行了验证性实验,结果表明,该文所提出的规范性分数能够很好衡量软件项目的可理解性。

Abstract

The software identifiers plays an important role in software understanding. In this paper, we propose a new criterion for evaluating the readability of software identifiers. Firstly, we compare the naming conventions in popular programming languages and propose a Software Morpheme-based identifier construction process, in which the identifiers are considered as an arrangement and concatenation of different software morphemes. Then, this paper proposes a new evaluation metric for software identifier readability. Experiments on the source code comprehension tasks and open source projects on the Github platform show that the proposed method can measure the readability of software projects.

关键词

软件标识符 / 源代码理解 / 软件维护 / 自然语言模型

Key words

software identifiers / source code understanding / software maintenance / natural language models

引用本文

导出引用
汶东震,张帆,张晓堃,杨亮,林原,徐博,林鸿飞. 软件标识符的自然语言规范性研究. 中文信息学报. 2024, 38(10): 144-154
WEN Dongzhen, ZHANG Fan, ZHANG Xiaokun, YANG Liang, LIN Yuan, XU Bo, LIN Hongfei. A Natural Language Perspective to the Readability of Software Identifiers. Journal of Chinese Information Processing. 2024, 38(10): 144-154

参考文献

[1] BROOKS J R, FREDERICK P. The mythical man-month[M]. Essays on Softw(1st.ed.). Addison-wesley Longman Publishing C, Inc., USA, 1978.
[2] LAWRIE D, FEILD H, BINKLEY D. An empirical study of rules for well-formed identifiers[J]. Journal of Software Maintenance and Evolution: Research and practice, 2007, 19(4): 205-229.
[3] BINKLEY D, DAVIS M, LAWRIE D, et al. The impact of identifier style on effort and comprehension[J]. Empirical Software Engineering, 2013, 18(2): 219-276.
[4] JIANG Y, LIU H, ZHU J, et al. Automatic and accurate expansion of abbreviations in parameters[J]. IEEE Transactions on Software Engineering, 2018, 46(7): 732-747.
[5] CARTER B. On choosing identifiers[J]. ACM Sigplan Notices, 1982, 17(5): 54-59.
[6] TU Z, SU Z, DEVANBU P. On the localness of software[C]//Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2014: 269-280.
[7] HINDLE A, BARR E T, GABEL M, et al. On the naturalness of software[J]. Communications of the ACM, 2016, 59(5): 122-131.
[8] 张静宣, 江贺. 代码标识符归一化研究现状及发展趋势[J]. 计算机科学, 2020,47(3): 1-4.
[9] ALON U, ZILBERSTEIN M, LEVY O, et al. Code2vec: Learning distributed representations of code[C]//Proceedings of the ACM on Programming Languages, 2019, 3(POPL): 1-29.
[10] KARAMPATSIS R M, SUTTON C. Scelmo: Source code embeddings from language models[J]. arXiv preprint arXiv: 2004.13214, 2020.
[11] FENG Z, GUO D, TANG D, et al. CodeBERT: A pre-trained model for programming and natural languages[J]. arXiv preprint arXiv: 2002.08155, 2020.
[12] HUSAIN H, WU H H, GAZIT T, et al. Codesearchnet challenge: Evaluating the state of semantic code search[J]. arXiv preprint arXiv: 1909.09436, 2019.
[13] GU X, ZHANG H, KIM S. Deep code search[C]//Proceedings of the IEEE/ACM 40th International Conference on Software Engineering. IEEE, 2018: 933-944.
[14] ZHU Y, PAN M. Automatic code summarization: A systematic literature review[J]. arXiv preprint arXiv: 1909.04352, 2019.
[15] LAWRIE D, BINKLEY D. On the value of bug reports for retrieval-based bug localization[C]//Proceedings of the IEEE International Conference on Software Maintenance and Evolution. IEEE, 2018: 524-528.
[16] LI X, JIANG H, LIU D, et al. Unsupervised deep bug report summarization[C]//Proceedings of the IEEE/ACM 26th International Conference on Program Comprehension. IEEE, 2018: 144-14411.
[17] RASTKAR S, MURPHY G C, MURRAY G. Automatic summarization of bug reports[J]. IEEE Transactions on Software Engineering, 2014, 40(4): 366-380.
[18] AGGARWAL K K, SINGH Y, CHHABRA J K. An integrated measure of software maintainability[C]//Proceedings of the Annual Reliability and Maintainability Symposium. Proceedings. IEEE, 2002: 235-241.
[19] BUTLER S, WERMELINGER M, YU Y, et al. Exploring the influence of identifier names on code quality: An empirical study[C]//Proceedings of the 14th European Conference on Software Maintenance and Reengineering. IEEE, 2010: 156-165.
[20] LIBLIT B, BEGEL A, SWEETSER E. Cognitive perspectives on the role of naming in computer programs[C]//Proceedings of the PPIG, 2006.
[21] EBAD S A, MANZOOR D. An empirical comparison of Java and C# programs in following naming conventions[J]. International Journal of People-Oriented Programming, 2016, 5(1): 39-60.
[22] BRSTLER J, CASPERSEN M E, NORDSTRM M. Beauty and the beast: On the readability of object-oriented example programs[J]. Software Quality Journal, 2016, 24(2): 231-246.
[23] BUSE R P L, WEIMER W R. Learning a metric for code readability[J]. IEEE Transactions on Software Engineering, 2009, 36(4): 546-558.
[24] PAWELKA T, JUERGENS E. Is this code written in English?: A study of the natural language of comments and identifiers in practice[C]//Proceedings of the IEEE International Conference on Software Maintenance and Evolution. IEEE, 2015: 401-410.
[25] CARVALHO N R, ALMEIDA J J, HENRIQUES P R, et al. From source code identifiers to natural language terms[J]. Journal of Systems and Software, 2015, 100: 117-128.
[26] WANG Y, WANG C, LI X, et al. How are identifiers named in open source software? About popularity and consistency[J]. arXiv preprint arXiv: 1401.5300, 2014.
[27] PIRAPURAJ P, PERERA I. Analyzing source code identifiers for code reuse using NLP techniques and WordNet[C]//Proceedings of the Moratuwa Engineering Research Conference. IEEE, 2017: 105-110.
[28] ARNAOUDOVA V, ESHKEVARI L M, DI P M, et al. Repent: Analyzing the nature of identifier renamings[J]. IEEE Transactions on Software Engineering, 2014, 40(5): 502-532.
[29] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv: 1810.04805, 2018.
[30] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv: 1301.3781, 2013.
[31] PETERS M E, NEUMANN M, Iyyer M, et al. Deep contextualized word representations[J]. arXiv preprint arXiv: 1802.05365, 2018.
[32] 聂黎明,江贺,高国军,等.代码搜索与API推荐文献分析[J].计算机科学,2017,44(S1): 475-482.
[33] GUO D, REN S, LU S, et al. GraphcodeBERT: Pre-training code representations with data flow[C]//Proceedings of the 9th International Conference on Learning Representations, 2021.
[34] CHEN M, TWOREK J, JUN H, et al. Evaluating large language models trained on code[J]. arXiv preprint arXiv: 2107.03374, 2021.
[35] LIU C, XIA X, LO D, et al. Opportunities and challenges in code search tools[J]. arXiv preprint arXiv: 2011.02297, 2020.
[36] LE T H M, CHEN H, BABAR M A. Deep learning for source code modeling and generation: Models, applications, and challenges[J]. ACM Computing Surveys, 2020, 53(3): 1-38.
[37] CAMBRONERO J, LI H, KIM S, et al. When deep learning met code search[C]//Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019: 964-974.
[38] SUEN C Y. N-gram statistics for natural language understanding and text processing[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1979 (2): 164-172.
[39] PENNINGTON J, SOCHER R, MANNING C D. Glove: Global vectors for word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014: 1532-1543.
[40] LIU Y, OTT M, GOYAL N, et al. RoBERTa: A robustly optimized bert pretraining approach[J]. arXiv preprint arXiv: 1907.11692, 2019.
[41] WHITE M, TUFANO M, VENDOME C, et al. Deep learning code fragments for code clone detection[C]//Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. IEEE, 2016: 87-98.
[42] 运思婧. 基于词性规则的软件标识符质量评价方法[D].哈尔滨: 哈尔滨工业大学硕士学位论文,2011.
[43] MCCABE T J. A complexity measure[J]. IEEE Transactions on Software Engineering, 1976,4: 308-320.
[44] CHEN M, TWOREK J, JUN H, et al. Evaluating large language models trained on code[J]. arXiv preprint arXiv: 2107.03374, 2021.

基金

国家自然科学基金(62076051,62076046)
PDF(2852 KB)

204

Accesses

0

Citation

Detail

段落导航
相关文章

/