中文模糊限制语语料库的研究与构建

周惠巍,杨 欢,张 静,亢世勇,黄德根

PDF(3509 KB)
PDF(3509 KB)
中文信息学报 ›› 2015, Vol. 29 ›› Issue (6) : 83-89.
综述

中文模糊限制语语料库的研究与构建

  • 周惠巍1,杨 欢1,张 静2,亢世勇2,黄德根1
作者信息 +

The Research and Construction of Chinese Hedge Corpus

  • ZHOU Huiwei1,YANG Huan1,ZHANG Jing2,KANG Shiyong2,HUANG Degen1
Author information +
History +

摘要

模糊限制语常用来表示不确定性和可能性的含义,由模糊限制语所引导的信息为模糊限制信息。为进行中文事实信息的抽取,应将模糊限制信息与事实信息区分开来。然而中文模糊限制语语料资源却十分缺乏,影响了中文模糊限制语和模糊限制信息检测的研究。该文研究了中文模糊限制语的分类,并在生物医学和维基百科两个领域,设计构建了一个具有2.4万句规模的中文模糊限制语语料库。统计分析了语料标注的一致性,以及模糊限制语的类型和领域之间的关系。这些资源对于中文模糊限制信息检测研究,以及中文事实信息的抽取具有重要意义。同时,为语言学家从语义和语用等方面进行模糊限制语的研究提供了强大的知识库支持。

Abstract

Hedge is usually used to express uncertainty and possibility. When authors cannot back up their statements, they usually use hedge to express uncertain information. To avoid extracting uncertain statements as factual information, uncertain information should be distinguished from factual information. However, inadequate Chinese hedge corpus limited the research of Chinese hedge. This paper discusses the categorization of Chinese hedge, introduces the design and construction of a 24,000-sentence Chinese hedge corpus in the biomedical and Wikipedia domains. We calculate agreement rates for the corpus and reveal the domain and genre dependency of hedges. The construction of the corpus is of great significance in the research of Chinese hedge detection and Chinese information extraction. Meanwhile, the resource provides a great support for linguists to study the semantic hedge and the pragmatic hedge.
Key words Chinese hedge; categorization; corpus; agreement analysis
   
   
   

关键词

中文模糊限制语 / 分类 / 语料库 / 一致性分析

Key words

Chinese hedge / categorization / corpus / agreement analysis

引用本文

导出引用
周惠巍,杨 欢,张 静,亢世勇,黄德根. 中文模糊限制语语料库的研究与构建. 中文信息学报. 2015, 29(6): 83-89
ZHOU Huiwei,YANG Huan,ZHANG Jing,KANG Shiyong,HUANG Degen. The Research and Construction of Chinese Hedge Corpus. Journal of Chinese Information Processing. 2015, 29(6): 83-89

参考文献

[1] Lakoff G. Hedges: a study in meaning criteria and the logic of fuzzy concepts [J]. Journal of Philosophical Logic, 1973, 2(4): 458-508.
[2] Farkas R, Vincze V, Móra G, et al. The CoNLL 2010 shared task: learning to detect hedges and their scope in natural language text [C]//Proceedings of the CoNLL, Uppsala, Sweden, 2010, 1-12.
[3] Szarvas G, Vincze V, Farkas R, et al. The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes [J]. BMC Bioinformatics, 2008, 9(11): S9.
[4] Medlock B and Briscoe T. Weakly supervised learning for hedge classification in scientific literature [C]//Proceedings of the ACL, 2007: 992-999.
[5] Kim J D, Ohta T, Tsujii J. Corpus annotation for mining biomedical events from literature [J]. BMC Bioinformatics, 2008, 9(10): 1-25.
[6] Settles B, Craven M, Friedland L. Active learning with real annotation costs [C]//Proceedings of the NIPS Workshop on Cost-Sensitive Learning, Vancouver, Canada, 2008: 1-10.
[7] Shatkay H, Pan F, Rzhetsky A, et al. Multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users [J], Bioinformatics, 2008, 24(18): 2086-2093.
[8] Nawaz R, Thompson P, Ananiadou S. Evaluating a meta-knowledge annotation scheme for bioevents [C]//Proceedings of the Workshop on Negation and Speculation in Natural language Proceeding, Uppsala, 2010: 69-77.
[9] Uzuner O, Zhang X R, Sibanda T. Machine learning and rule-based approaches to assertion classification [J]. Journal of the American Medical Informatics Association, 2009, 16(1): 109-115.
[10] Rubin V L, Liddy E D, Kando N. Certainty identification in texts: Categorization model and manual tagging results [J]. Computing Attitude and Affect in Text: Theory and Applications, 2006, 20: 61-76.
[11] Wilson T A. Fine-grained subjectivity and sentiment analysis: Recognizing the intensity, polarity, and attitudes of private states [D]. Ph.D. thesis, University of Pittsburgh, PA. 2008.
[12] Saurí R, Pustejovsky J. FactBank: A corpus annotated with event factuality [J]. Language Resources and Evaluation, 2009, 43(3): 227-268.
[13] Rubin V L. Epistemic modality: From uncertainty to certainty in the context of information seeking as interactions with texts [J]. Information Processing and Management, 2010, 46(5): 533-540.
[14] 王舟. 英汉学术论文摘要中模糊限制语的对比研究--一项基于语料库的研究[J]. 华中科技大学学报: 社会科学版, 2008, 22( 6): 59-63.
[15] 陈萍, 蒋跃. 中英医学论文摘要中模糊限制语的对比研究[J]. 外语艺术教育研究, 2009, 3(1): 15-20.
[16] 范晓晖, 李晓, 李莹. 中英作者医学论文英文摘要中模糊限制语的对比研究[J]. 西北医学教育, 2010, 18(5): 1019-1021.
[17] 顾敏, 周红. 英汉访谈节目中模糊限制语语用功能的对比研究[J]. 嘉兴学院学报, 2013, 25(1): 87-91.
[18] Prince E F, Frader J, Bosk C. On hedging in physician-physician discourse [J]. Linguistics and the Professions, 1982: 83-97.
[19] Szarvas G, Vincze V, Farkas R, et al. Cross-Genre and Cross-Domain Detection of Semantic Uncertainty [J]. Association for Computational Linguistics, 2012, 38(2): 335-367.
[20] 何自然. 模糊限制语与言语交际[J]. 外国语(上海外国语学院学报), 1985, (5): 27-31.
[21] 文旭. 语义模糊与翻译[J]. 中国翻译, 1996, (2): 5-8.
[22] 苏远连. 英汉模糊限制语的分类和功能[J]. 广州大学学报: 社会科学版, 2002, 1(4): 29-32.
[23] 蒋平. 国内模糊语言研究:现状与目标[J]. 外国语(上海外国语大学学报), 2013, 36(5): 43-49.

基金

国家自然科学基金(61272375,61173100)
PDF(3509 KB)

786

Accesses

0

Citation

Detail

段落导航
相关文章

/