基于主动学习与众包的农业知识标注体系及语料库构建

姜京池,关昌赫,刘劼,关毅,柯善风

PDF(4933 KB)
PDF(4933 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (1) : 33-45.
语言资源建设与应用

基于主动学习与众包的农业知识标注体系及语料库构建

  • 姜京池1,4,关昌赫2,刘劼1,4,关毅2,柯善风3
作者信息 +

Annotation Scheme and Corpus Construction for Agricultural Knowledge Based on Active Learning and Crowdsourcing

  • JIANG Jingchi1,4, GUAN Changhe2, LIU Jie1,4, GUAN Yi2, KE Shanfeng3
Author information +
History +

摘要

农业书籍与网络知识库作为领域专家撰写的蕴含了大量农学常识与农事经验的数据源,具有高可信、知识丰富、结构规范等特点。为了挖掘此类文本源中的农学知识,该文讨论了农业命名实体和实体关系的相关问题,首次提出了主动学习与众包相结合的农业知识标注体系。在农学专家的指导和参与下,构建了包含9类实体以及15大类、37小类语义关系的多源农业知识标注语料库,其中农业书籍源共3.7万个实体、3.5万个实体关系,百度百科源含1.1万个实体以及1.5万个实体关系。在实验部分,我们利用标注一致性评价标准对比了两类数据源的标注质量,并从实体识别、关系抽取两个方面证明了主动学习能够节约标注成本、提升标注效率和模型训练效果,为后续研究打下了坚实基础。

Abstract

As data sources written by experts, agricultural books and network knowledge bases contain a large amount of agricultural common knowledge and experience, which are characterized by high reliability, rich knowledge and standard structure. In order to mine agricultural knowledge from multi-source data, this paper discusses issues related to agricultural named entities and entity relations, and proposes an agricultural knowledge labeling schema combining active learning and crowdsourcing. Under the guidance and participation of agricultural experts, a multi-source agricultural knowledge annotated corpus is constructed, which contains 9 categories of entities, 15 categories and 37 subcategories of semantic relations, totaling 48 000 entities and 50 000 entity relations. In the experiment, we demonstrate that active learning can save the annotation cost and improve the model training from the aspects of entity recognition and relation extraction.

关键词

语料构建 / 农业知识图谱 / 标注体系

Key words

corpus construction / agricultural knowledge graph / annotation scheme

引用本文

导出引用
姜京池,关昌赫,刘劼,关毅,柯善风. 基于主动学习与众包的农业知识标注体系及语料库构建. 中文信息学报. 2023, 37(1): 33-45
JIANG Jingchi, GUAN Changhe, LIU Jie, GUAN Yi, KE Shanfeng. Annotation Scheme and Corpus Construction for Agricultural Knowledge Based on Active Learning and Crowdsourcing. Journal of Chinese Information Processing. 2023, 37(1): 33-45

参考文献

[1] 李贯峰,张鹏.一个基于农业本体的Web知识抽取模型[J].江苏农业科学,2018,46(4): 201-205.
[2] 吴茜.基于知识图谱的农业智能问答系统设计与实现[D].厦门: 厦门大学硕士学位论文,2019.
[3] BISWAS P, SHARAN A, KUMAR A. AGNER: entity tagger in agriculture domain[C]//Proceedings of the 2nd International Conference on Computing for Sustainable Global Development,2015: 1134-1138.
[4] BISWAS P, SHARAN A, VERMA S. Named entity recognition for agriculture domain using word net[J]. International Jounnal of Computing Science and Mathematics,2016,5(10): 29-36.
[5] BISWAS P, SHARAN A, KUMAR A. Contextpattern based agricultural named entity recognition[J]. Research of Computer Science,2019,148(10): 383-399.
[6] MALARKODI C S, LEX E, DEVI S L. Named entity recognition for the agricultural Domain[J]. Research of Computer Science,2016,117: 121-132.
[7] WANG X, JIANG X, LIU M, et al. Bacterial named entity recognition based on dictionary and conditional random field[C]//Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine,2017: 439-444.
[8] 王春雨,王芳.基于条件随机场的农业命名实体识别研究[J].河北农业大学学报,2014,37(1): 132-135.
[9] GUO X, ZHOU H,SU J, et al. Chinese agricultural diseases and pests named entity recognition with multi-scale local context features and self-attention mechanism[J]. Computers and Electronics in Agriculture,2020,179: 105830.
[10] 沈利言,姜海燕,胡滨,等.水稻病虫草害与药剂实体关系联合抽取算法[J].南京农业大学学报,2020,43(6): 1151-1161.
[11] QIAO B, ZOU Z, HUANG Y, et al. A joint model for entity and relation extraction based on BERT[J]. Neural Computing and Applications,2021: 1-11.
[12] HUANG J, LIN J Y, ROZELLE S, et al. Chinese academy of agricultural sciences[M]. Chinese Agricultural Press,1994.
[13] LENAT D B. CYC: A large-scale investment in knowledge infrastructure[J]. Communications of the ACM,1995,38(11): 33-38.
[14] BOLLACKER K D, EVANS C, PARITOSH P, et al. Freebase: A collaboratively created graph database for structuring human knowledge[C]//Proceedings of the 2008 ACM Sigmod Conference on Mangemene of data, 2008: 1247-1250.
[15] 王玉芹,杨晓蓉.韩国农业信息技术的特点和发展方向[J].情报杂志,2004,23(10): 95-96,99.
[16] CHEN Y,KUANG J, CHENG D, et al. ArgiKG: an agricultural knowledge graph and its applications[C]//Proceedings of the International Conference on Database Systems for Advanced Applications,2019: 533-537.
[17] CHENGLIN Q, QING S, PENGZHOU Z, et al. Cn-MAKG: China meteorology and agriculture knowledge graph construction based on semi-structured data[C]//Proceedings of the 17th IEEE/ACIS International Conference on Computer and Information Science, 2018: 692-696.
[18] QIN H, YAO Y. Agriculture knowledge graph construction and application[J]. Journal of Physics: Conference Series,2021, 1756(1): 012010.
[19] 王曦光.农技推广知识服务系统的研究与实现[D].北京: 中国农业科学院硕士学位论文,2014.
[20] NIU M L S, ONN K W, SEAN L Y, et al. Agriculture linked open data[C]//Proceedings of the Joint International Symposium on Natural Language Processing and Agricultural Ontology Service, 2012: 7-12.
[21] 农业部情报研究所.农业科学叙词表[M].北京: 中国农业出版社,1994.
[22] SETTLES B. Active learning literature survey[J]. Science,1995,10(3): 237-304.
[23] KONYUSHKOVA K, SZNITMAN R, FUA P. Learning active learning from data[J]. Advances in Neural Information Processing Systems,2017,30.
[24] SHEN D, ZHANG J,SU J, et al. Multi-criteria-based active learning for named entity recognition[C]//Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics,2004: 589-596.
[25] CHEN Y,LASKO T A, MEI Q, et al. A study of active learning methods for named entity recognition in clinical text[J]. Journal of Biomedical Informatics,2015,58: 11-18.
[26] SHEN Y, YUN H, LIPTON Z C, et al. Deep active learning for named entity recognition[C]//Proceedings of the International Conference on Learning Representations,2018.
[27] CULOTTA A, MCCALLUM A. Reducing labeling effort for structured prediction tasks[C]//Proceedings of the Association for the Advancement of Artificial Intelligence, 2005,5: 746-751.
[28] GAL Y, ISLAM R,GHAHRAMANI Z. Deep bayesian active learning with image data[C]//Proceedings of the International Conference on Machine Learning.PMLR, 2017: 1183-1192.
[29] RADMARD P, FATHULLAH Y, LIPANI A. Subsequence based deep active learning for named entity recognition[C]//Proceedings of the Association for Computational Linguistics,2021,1: 4310-4321.
[30] ZHANG H T, HUANG M L, ZHU X Y. A unified active learning framework for biomedical relation extraction[J]. Journal of Computer Science and Technology,2012,27(6): 1302-1313.
[31] WANG A, HOANG C D V, KAN M Y. Perspectives on crowdsourcing annotations for natural language processing[J]. Language Resources and Evaluation,2013,47(1): 9-31.
[32] HAO S, JIA D, LI F F. Crowdsourcing annotations for visual object detection[C]//Proceedings of Workshops at the 26th AAAI Conference on Artificial Intelligence, 2012: 40-42.
[33] MARCUS A, PARAMESWARAN A. Crowdsourced data management: Industry and academic perspectives[J]. Foundations and Trends in Databases,2015,6(1-2): 1-161.
[34] GOKHALE C, DAS S, DOAN A H, et al. Corleone: Hands-off crowdsourcing for entity matching[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data, 2014: 601-612.
[35] PARK H,WIDOM J. Crowdfill: Collecting structured data from the crowd[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data,2014: 577-588.
[36] CHU X,MORCOS J, ILYAS I F, et al. Katara: A data cleaning system powered by knowledge bases and crowdsourcing[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data,2015: 1247-1261.
[37] 叶晨,王宏志,高宏,等.面向众包数据清洗的主动学习技术[J].软件学报,2020,31(4): 1162-1172.
[38] BARZAN M. Scaling up crowdsourcing to very large datasets: A case for active learning[C]//Proceedings of the Vldb Endowment, 2014,8(2): 125-136.
[39] HAAS D, WANG J, WU E, et al. CLAMShell: Speeding up crowds for low-latency data labeling[J]Proceedings of the VLDB Endowment,2015,9(4): 372-383.
[40] 鲜国建.农业科学叙词表向农业本体转化系统的研究与实现[D].北京: 中国农业科学院硕士学位论文,2008.
[41] 中国农作物病虫害委员会.中国农作物病虫害[M]. 3版. 北京: 中国农业出版社,1979.
[42] 郭旭超,唐詹,刁磊,等.基于部首嵌入和注意力机制的病虫害命名实体识别.农业机械学报,2020,51(S2): 335-343.
[43] HRIPCSAK G, ROTHSCHILD A S. Agreement, the f-measure, and reliability in information retrieval[J]. Journal of the American Medical Informatics Association,2005,12(3): 296-298.
[44] ARTSTEIN R, POESIO M. Inter-coder agreement for computational linguistics[J]. Computational Linguistics,2008,34(4): 555-596.
[45] LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics,2016.
[46] DEVLIN J,CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics,2019: 4171-4186.
[47] LI Y, LONG G, SHEN T, et al. Self-attention enhanced selective gate with entity-aware embedding for distantly supervised relation extraction[C]//Proceedings of the Association for the Advancement of Artificial Intelligence,2020,34(05): 8269-8276.

基金

2030—“新一代人工智能”重大项目(SQ2021AAA010643);国家青年自然基金(NSFC62006063);黑龙江省博士后自然基金(LBH-Z20015)
PDF(4933 KB)

Accesses

Citation

Detail

段落导航
相关文章

/