越南语中存在大量的交叉歧义片段。为了解决交叉歧义给分词、词性标注、实体识别和机器翻译等带来的影响,该文选取统计特征、上下文特征和歧义字段内部特征,尝试性地构建最大熵模型,对越南语的交叉歧义进行消解。该文通过三种方法整理出包含174 646词条的越南语词典,然后通过正向和逆向最大匹配方法从25 981条人工标注好的越南语分词句子中抽取5 377条歧义字段,分别测试了三类特征对歧义模型的贡献程度,并对歧义字段做五折交叉验证实验,准确率达到了87.86%。同时,与CRFs进行对比实验,结果表明该方法能更有效消解越南语交叉歧义。
Abstract
To deal with the rich cross ambiguities in Vietnamese, this paper adopts the Maximum Entropy approach using the selected statistical features, contextual features and internal features of the ambiguity segments. It constructs a Vietnamese dictionary of 174 646 entries, which brings about 5 377 segments of cross ambiguities among 25 981 Vietnamese sentences with golden labels. A 5-fold cross validation experiment shows that the accuracy of the proposed method canachieve 87.86% which out performs the CRFs.
关键词
交叉歧义 /
歧义消解 /
最大熵模型 /
越南语词典 /
CRFs
{{custom_keyword}} /
Key words
cross ambiguity /
disambiguation /
maximum entropy model /
Vietnamese dictionary /
CRFs
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Phuong, L H, Huyen,N T M, Azim,R,et al. A hybrid approach to word segmentation of Vietnamese texts[C]//Proceedings of the 2nd International Conference on Language and Automata Theory and Applications,Tarragona,Spain. Springer LNCS 5196, 2008: p240-249.
[2] 钟宁,袁鼑荣.基于关联规则的交集型歧义消解算法[J].郑州大学学报(理学版), 2010,42(1): 66-69.
[3] 李蓉,刘少辉,叶世伟,等.基于SVM和k-NN结合的汉语交集型歧义切分方法[J]. 中文信息学报,2001,15(6): 13-18.
[4] 梁妍.基于统计机器学习的中文词法分析研究[D].南开大学博士学位论文,2009.
[5] Dinh. Building a training corpus for word sense disambiguation in English-to-Vietnamese machine translation[C]//Proceedings of the COLING-02 on Machine Translation in Asia Morristown,NJ,USA, Association for Computational Linguistics,2002: 1-7.
[6] Minh Hai Nguyuen,Kiyoaki Shirai.Study on supervied learning of Vietnamese word sense disambiguation classifier[J].Journal of Natural Language Processing,2012,19(1): 25-50.
[7] 于洪志,李亚超,冷本扎西,等. 融合音节特征的最大熵藏文词性标注研究[J].中文信息学报,2013,27(5): 160-165.
[8] 何钟豪,史晓东,黄研洲,等. 引入集成学习的最大熵短语调序模型[J].中文信息学报,2014,28(1): 87-93.
[9] H P Le,T M H Nguyen,A Roussanaly T V. A Hybrid Approach to Word Segmentation of Vietnamese Text[C]//Proceeding of 2nd LATA.
[10] 翟凤文,赫枫龄,左万利.字典与统计相结合的中文分词方法[J]. 小型微型计算机系统,2006.27(9): 1766-1771.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61262041,61472168);云南省自然科学基金(2013FA030)
{{custom_fund}}