HowNet义原标注一致性检验方法研究

刘阳光,岂凡超,刘知远,孙茂松

PDF(2783 KB)
PDF(2783 KB)
中文信息学报 ›› 2021, Vol. 35 ›› Issue (4) : 23-34.
知识表示与知识获取

HowNet义原标注一致性检验方法研究

  • 刘阳光1,2,3,岂凡超1,2,3,刘知远1,2,3,孙茂松1,2,3
作者信息 +

Research on Consistency Check of Sememe Annotations in HowNet

  • LIU Yangguang1,2,3, QI Fanchao1,2,3, LIU Zhiyuan1,2,3, SUN Maosong1,2,3
Author information +
History +

摘要

义原(sememe)被定义为人类语言中不可再分的最小语义单位。一个词语的意义可以由多个义原的组合来表示。以往人们已经人工为词语标注义原并构建了知网(HowNet)这一语言知识库,并借此将义原应用到了多种自然语言处理任务。但传统的人工标注费时费力,而且不同的专家进行标注难免会引入标注者的主观偏差,导致标注的一致性和准确性难以保证。因此,保证词的义原标注一致性已成为建设高质量语言知识库HowNet、提升义原应用任务效果的首要任务。该文首次提出了一种对HowNet已标注的义原进行一致性检验的方法。实验结果表明,所提方法切实有效,能够很好地应用于HowNet知识库的标注一致性检验以及完善扩充。

Abstract

Sememes are defined as the minimum semantic units of human languages that cannot be subdivided. The meaning of a word can be defined by a combination of multiple sememes. Sememe-based linguistic knowledge bases(KBs), in which words are manually annotated with sememes, have been successfully constructed and utilized in many NLP tasks. However, the manual annotation of sememes is time-consuming and labor-intensive, and person bias will be inevitably introduced, which prejudices annotation consistency and accuracy. In this paper, we for the first time propose a method to conduct automatic consistency check of sememe annotations in HowNet. Experimental results demonstrate the effectiveness of out method, which show that our method can be applied to the annotation consistency check and extension of HowNet.

关键词

义原标注 / HowNet / 一致性检验

Key words

sememe annotation / HowNet / consistency check

引用本文

导出引用
刘阳光,岂凡超,刘知远,孙茂松. HowNet义原标注一致性检验方法研究. 中文信息学报. 2021, 35(4): 23-34
LIU Yangguang, QI Fanchao, LIU Zhiyuan, SUN Maosong. Research on Consistency Check of Sememe Annotations in HowNet. Journal of Chinese Information Processing. 2021, 35(4): 23-34

参考文献

[1] Zhendong Dong, Qiang Dong. HowNet: A hybrid language and knowledge resource[C]//Proceedings of the International Conference on Natural Language Processing and Knowledge Engineering. IEEE, 2003.
[2] Leonard Bloomfield.A set of postulates for the science of language[J]. International Journal of American Linguistics, 1926, 15(4):195-202.
[3] 董振东, 董强.知网和汉语研究[J]. 当代语言学, 2001, 3(1):33-44.
[4] Kok Wee Gan, Ping Wai Wong.Annotating information structures in Chinese texts using HowNet. [C]//Proceedings of Annual Meeting of the Association for Computational Linguistics, 2000:85-92.
[5] Yilin Niu, Ruobing Xie, Zhiyuan Liu, et al. Improved word representation learning with sememes[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017:2049-2058.
[6] Qun Liu, Sujian Li.Word similarity computing based on HowNet[J]. Computational Linguistics and Chinese Language Processing, 2002, 7(2):59-76.
[7] Xiangyu Duan, Jun Zhao, Bo Xu.Word sense disambiguation through sememe labeling[C]//Proceedings of the International Joint Conference on Artificial Intelligence, 2007:1594-1599.
[8] Kokwee Gan, Chiyung Wang, Brian Mak.Knowledge-based sense pruning using the HowNet: An alternative to word sense disambiguation[C]//Proceedings of the International Symposium on Chinese Spoken Language Processing, 2002.
[9] Minlie Huang, Borui Ye, Yichen Wang, et al. New word detection for sentiment analysis[C]//Proceedings of Annual Meeting of the Association for Computational Linguistics. 2014:531-541.
[10] 孙景广, 蔡东风, 吕德新,等. 基于知网的中文问题自动分类[J]. 中文信息学报, 2007, 21(1):90-95.
[11] 梅立军, 周强, 臧路,等. 知网与同义词词林的信息融合研究[J]. 中文信息学报, 2005, 19(1):64-71.
[12] 赵鹏, 蔡庆生. 一种基于《知网》的中文文本聚类算法的研究[J]. 计算机工程与应用, 2007, 43(12):162-163.
[13] 毛金涛, 贾可亮, 傅继彬, 等. 基于知网和术语相关度的本体关系抽取研究[J]. 数据分析与知识发现, 2008, 24(9):36-40.
[14] Lei Zhang, Fanchao Qi, Zhiyuan Liu, et al. Multi-channelreverse dictionary model[C]//Proceedings of the Association for the Advancement of Artificial Intelligence. 2020.
[15] Ruobing Xie,Xingchi Yuan,Zhiyuan Liu, et al. Lexicalsememe prediction via word embeddings and matrix factorization[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017:4200-4206.
[16] Huiming Jin,Hao Zhu,Zhiyuan Liu, et al. Incorporating Chinese characters of words for lexical sememe prediction[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018:2439-2449.
[17] Fanchao Qi,Yankai Lin,Maosong Sun, et al. Cross-linguallexical sememe prediction[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018:358-368.
[18] 乔剑敏, 张仰森. 词义标注一致性检验系统的设计与实现[J]. 中文信息学报, 2010, 24(4): 44-51.
[19] 张虎, 郑家恒, 刘江.语料库词性标注一致性检查方法研究[J]. 中文信息学报, 2004, 18(5):11-16.
[20] 刘江, 郑家恒, 张虎. 中文文本语料库分词一致性检验技术的初探[J]. 计算机应用研究, 2005,(9):52-54.
[21] 刘博, 郑家恒, 张虎. 规则与统计相结合的分词一致性检验[J].计算机工程与设计, 2008,(07):230-232,243.
[22] Cliff Goddard, Anna Wierzbicka. Semantic and lexical universals: Theory and empirical findings[M]. John Benjamins Publishing, 1994.
[23] 孙茂松, 陈新雄. 借重于人工知识库的词和义项的向量表示:以HowNet为例[J]. 中文信息学报, 2016, 30(6):1-6.
[24] Mikolov T, Chen K, Corrado G S, et al. Efficient estimation of word representations in vector space[C]//Proceedings of the International Conference on Learning Representations, 2013.
[25] Niu Y, Xie R, Liu Z, et al. Improvedword representation learning with sememes[C]//Proceedings of Annual Meeting of the Association for Computational Linguistics, 2017: 2049-2058.
[26] Neelakantan Arvind, Chang Mingwei. Inferring missing entity type instances for knowledge base completion: New dataset and methods[C]//Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2015.
[27] R Baezayates, B Ribeironeto, D Mills. Modern information retrieval. volume 463[M]. ACM Press, New York, 1999.

基金

国家重点研发计划(2020AAA0106501)
PDF(2783 KB)

1586

Accesses

0

Citation

Detail

段落导航
相关文章

/