语料库语言学是借助大规模语料库对语言现象进行发现、挖掘的学科,目前已经存在很多在线语料库辅助语言学的研究。该文提供了一个按时间分片进行管理的语料库,并基于此提出了一个由社区维护的在线词典编纂系统,该系统将语料库查询结果动态结合在被编辑的词条中。该文还介绍了一个多义词词义发现和层次化聚类算法,用以自动生成一个默认的词条框架。该文概述了词典编纂系统的总体情况,重点介绍系统的设计和使用方法。
Abstract
Corpus linguistics is a research to discover linguistic phenomena by means of large-scale corpus. At present, there are many online corpora to assist linguists. This paper provides a corpus managed by time slice, and further proposes a community-maintained online lexicography system dynamically combining the corpus query results into edited terms. This paper also introduces a polysemous word meaning discovery and hierarchical clustering algorithm to automatically generate a default term frame. This article reviews the overall lexicographic system and highlight the design and use of the system.
关键词
词典编纂 /
历时语料库 /
系统设计 /
词义发现
{{custom_keyword}} /
Key words
lexicography /
diachronic corpus /
system design /
word sense discovery
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 荀恩东, 饶高琦, 肖晓悦, 等. 大数据背景下 BCC 语料库的研制[J]. 语料库语言学, 2016, 3(1): 93-118.
[2] 白荃, 岑玉珍. “轻易” 的语义及用法分析[J]. 语言教学与研究, 2007 (5): 76-81.
[3] 孙德金. 现代汉语书面语中的代词 “其”[J]. 语言教学与研究, 2010 (2): 55-62.
[4] 陈露, 韦汉. 英语口语语料库在英语口语教学中的作用[J]. 外语电化教学, 2005 (3): 23-26.
[5] 钱厚生. 语料库建设与词典编纂[J]. 辞书研究, 2002(1): 58-68.
[6] Hardie A.CQPweb—combining power, flexibility and usability in a corpus analysis tool[J]. International Journal of Corpus Linguistics, 2012, 17(3): 380-409.
[7] 俞士汶, 段慧明, 朱学锋, 等. 北京大学现代汉语语料库基本加工规范[J]. 中文信息学报, 2002, 16(5): 51-66.
[8] 段慧明, 徐国伟, 胡国昕, 等. 大规模汉语标注语料库的制作与使用[J]. 语言文字应用, 2000 (2): 72-77.
[9] 俞士汶, 朱学锋, 王惠, 等. 现代汉语语法信息词典规格说明书[J]. 中文信息学报, 1996, 10(2): 1-22.
[10] Neelakantan A, Shankar J, Passos A, et al. Efficient non-parametric estimation of multiple embeddings per word in vector space[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1059-1069.
[11] Shi H, Li C, Hu J. Real multi-sense or pseudo multi-sense: An approach to improve word representation[C]//Proceedings of the Computational Linguistics for Lingunistics Complexity, 2016, 2016: 79.
[12] He S, Zou X, Xiao L, et al. Construction of diachronic ontologies from peoples Daily of fifty years[C]//Proceedings of the LREC, 2014: 3258-3263.
[13] Wolff J. Approximate nearest neighbor query methods for large scale structured datasets[D].PhD diss., Freiburg: Uni-Freiburg, 2016.
[14] Bouma G. Normalized (pointwise) mutual information in collocation extraction[C]//Proceedings of GSCL 2009, 2009: 31-40.
[15] Ragan-Kelley M, Perez F, Granger B, et al. The Jupyter/IPython architecture: A unified view of computational research, from interactive exploration to communication and publication[C]//Proceedings of the AGU Fall Meeting Abstracts, 2014.
[16] Ukkonen E. Approximate string-matching with q-grams and maximal matches[J]. Theoretical Computer Science, 1992(1): 191-211.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61472017)
{{custom_fund}}