CRFs融合语义信息的英语功能名词短语识别

马建军;裴家欢;黄德根

PDF(848 KB)
PDF(848 KB)
中文信息学报 ›› 2016, Vol. 30 ›› Issue (6) : 59-66.
综述

CRFs融合语义信息的英语功能名词短语识别

  • 马建军1;裴家欢2;黄德根2
作者信息 +

Identification of English Functional Noun Phrases #br# by CRFs and the Semantic Information

  • MA Jianjun1; PEI Jiahuan2; HUANG Degen2
Author information +
History +

摘要

名词短语识别在句法分析中有着重要的作用,而英汉机器翻译的瓶颈之一就是名词短语的歧义消解问题。研究英语功能名词短语的自动识别,则将名词短语的结构消歧问题转化成名词短语的识别问题。基于名词短语在小句中的语法功能来确定名词短语的边界,选择商务领域语料,采用了细化词性标注集和条件随机域模型结合语义信息的方法,识别了名词短语的边界和句法功能。在预处理基于宾州树库细化了词性标注集,条件随机域模型中加入语义特征主要用来识别状语类的名词短语。实验结果表明,结合金标准词性实验的F值达到了89.04%,改进词性标注集有助于提高名词短语的识别,比使用宾州树库标注集提高了2.21%。将功能名词短语识别信息应用到NiuTrans统计机器翻译系统,英汉翻译质量略有提高。

Abstract

The study on the automatic identification of English functional noun phrases (NP) may transform the task of resolving structural ambiguity caused by noun phrases into the task of NP chunking. Functional noun phrases refer to those noun phrases which are defined based on their syntactic functions in clauses. On a corpus of business domain, this study aims to identify both the scope of NP chunks and their syntactic function types by refining the Part-of-speech (POS) tagset, and adopting conditional random fields (CRFs) model combined with the semantic information. Modification to the Penn Treebank tagset is completed in the pre-processing, and semantic features are added to the CRFs model to improve the recognition of the adjunct types of noun phrases. Test results show that the system has achieved an F-score of 89.04% in the open test using our gold standard tags; and refining the POS tagset is a better approach for NP chunking, which has increased the F-score by 2.21%, compared with the model using the Penn Tree bank POS tags. This knowledge of English functional noun phrases is then combined with the NiuTrans SMT system, which slightly improves the English Chinese translation performance.

关键词

功能名词短语 / 名词短语识别 / 条件随机域模型 / 语义信息

Key words

functional noun phrases / noun phrase identification / CRFs / semantic information
 
/   /   /
 
/   /   /
 
/   /  

引用本文

导出引用
马建军;裴家欢;黄德根. CRFs融合语义信息的英语功能名词短语识别. 中文信息学报. 2016, 30(6): 59-66
MA Jianjun; PEI Jiahuan; HUANG Degen. Identification of English Functional Noun Phrases #br# by CRFs and the Semantic Information. Journal of Chinese Information Processing. 2016, 30(6): 59-66

参考文献

[1] Church K. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text[C]//Proceedings of Second Conference on Applied Natural Language Processing. Austin, USA: Association for Computational Linguistics, 1988: 136-143.
[2] Voutilamen A. NPTool, A Detector of English Noun Phrases[C]//Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives. Columbus, USA: Association for Computational Linguistics, 1993: 48-57.
[3] Ramshaw L, Marcus R. Text Chunking using Transformation-Based Learning[C]//Proceedings of the Fourth Workshop on Very Large Corpus. Copenhagen, Denmark: Association for Computational Linguistics, 1995: 82-94.
[4] Koehn P, Knight K. Feature-Rich Statistical Translation of Noun Phrases[C]//Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan: Association for Computational Linguistics, 2003: 311-318.
[5] 马建军. 基于规则和统计的机器翻译方法歧义问题比较分析[J].大连理工大学学报(社会科学版), 2010, 31(3): 114-119.
[6] 马建军,黄德根.英语功能名词短语的研究及其应用[J].大连理工大学学报(自然科学版), 2012, 52(1): 126-131.
[7] Brill E. Transformation-based error-driven parsing[C]//Proceedings of the Third International Workshop on Parsing Technologies. Tiburg, Netherlands: Association for Computational Linguistics, 1993: 13-16.
[8] Veenstra J, Buchholz S. Fast NP Chunking Using Memory-Based Learning Techniques[C]//Proceedings of the Eighth Belgian-Dutch Conference on Machine Learning. Wageningen, Netherlands: Wageningen ATO-DLO, 1998: 71-78.
[9] 郭永辉,杨红卫,马芳,等. 基于粗糙集的基本名词短语识别[J]. 中文信息学报, 2006, 20(3): 14-21.
[10] 李生, 孟遥. 基于决策树的英语BNP识别[J]. 黑龙江工程学院学报, 2001, 15(1): 36-39.
[11] Kong L, Ren F, Sun X. et al. Word Frequency Statistics Model for Chinese Base Noun Phrase Identification[C]//Proceedings of the 10th International Conference on Intelligent Computing (ICIC). Taiyuan, China: Springer International Publishing, 2014: 635-644.
[12] Kudo T, Magsumoto Y. Chunking with support vector machines[C]//Proceedings of NAACL-2001. Pittsburgh, USA: Association for Computational Linguistics, 2001: 192-199.
[13] Wu Y C, Lee Y S, Yang J C. Robust and Efficient Multiclass SVM Models for Phrase Pattern Recognition[J]. Pattern Recognition, 2008(41): 2874-2889.
[14] Koeling R. Chunking with Maximum Entropy Models[C]//Proceedings of CoNLL-2000 and LLL-2000. Lisbon, Portugal: Association for Computational Linguistics, 2000: 139-141.
[15] 周雅倩, 郭以昆, 黄萱菁,等. 基于最大熵方法的中英文基本名词短语识别[J]. 计算机研究与发展, 2003, 40(3): 440-446.
[16] 王晓娟, 赵春. 最大熵方法在英语名词短语识别中的应用研究[J]. 计算机仿真, 2011, 28(3): 414-417.
[17] Molina A, Pla F. Shallow Parsing using Specialized HMMs[J]. Journal of Machine Learning Research, 2002(2): 595-613.
[18] Shen H, Sarkar A. Voting between Multiple Data Representations for Text Chunking[C]//Proceedings of the Eighteenth Meeting of the Canadian Society for Computational Intelligence, Canadian AI. Victoria, Canada: Springer Berlin Heidelberg, 2005: 389-400.
[19] Sha F, Pereira F. Shallow Parsing with Conditional Random Fields[C]//Proceedings of HLT-NAACL 2003. Edmonton, Canada: Association for Computational Linguistics, 2003: 213-220.
[20] Sun X, Morency L P, Okanohara D et al. Modeling Latent-Dynamic in Shallow Parsing: A Latent Conditional Model with Improved Inference[C]//Proceedings of the 22nd International Conference on Computational Linguistics. Manchester, UK: Association for Computational Linguistics, 2008: 841-848.
[21] 梁颖红,赵铁军,翟舒. 规则和边界统计相结合的英语基本名词短语识别[C].全国第七届计算语言学联合学术会议论文集. 哈尔滨, 中国: 中文信息学会,2003: 173-178.
[22] 吕琳,刘玉树. 最大熵和Brill方法结合识别英语BaseNP[J]. 北京理工大学学报, 2006, 26(6): 500-503.
[23] 谭魏璇, 孔芳, 倪吉,等. 基于混合统计模型的中文基本名词短语识别[J]. 计算机应用与软件, 2011, 28(8): 254-156.
[24] 钱小飞, 侯敏. 基于混合策略的汉语最长名词短语识别[J]. 中文信息学报, 2013, 27(6): 16-22.
[25] Halliday M A K. 功能语法导论[M]. 北京: 外语教学语研究出版社, 2008.
[26] 马建烟. 面向机器翻译的英语功能名词短语识别研究[D].大连:大连理工大学,2012.
[27] Sinclair J. 柯林斯COBUILD英语语法句型2: 名词与形容词[M].上海: 上海外语教育出版社, 2000.
[28] Marcus M P, Santorini B, Marcinkiewicz M A. Building a large annotated corpus of English: the Penn Treebank[J]. Computational Linguistics, 1993, 19(2): 313-330.

基金

教育部人文社会科学研究规划基金(13YJAZH062)
PDF(848 KB)

625

Accesses

0

Citation

Detail

段落导航
相关文章

/