长文本武侠小说中主人公以侠客和义士为主,人物个性鲜明,外号可以概括人物最显著的特征。传统命名实体识别主要集中在人名、地名、机构名等领域,对于识别外号尚未有相关研究,但作为武侠小说中不可或缺的元素,外号识别对于同义词识别等研究方向具有借鉴意义。鉴于此,该文提出对武侠小说中武侠人名对应的外号的未登录词扩展识别筛选并辅以固定句式法则的识别方法。未登录词扩展识别筛选方法融合了对于左邻字符串的拓展和筛选同时定义了竞争外号子串和候选外号子串等概念,固定句式法则方法是通过外号指示词对观察窗口的候选外号子串进行筛选。经过统计和分类提出了武侠小说高频词表和低频指示字典,用于对竞争外号子串进行筛选。实验证明该文方法可行有效。
Abstract
In the full-length knight-errant novels, the protagonists are dominated by knights and martyrs with distinct characters. The nickname can summarize the most prominent features of the characters. To recognize such nicknames, this paper proposes a method combing OOV extension recognition and screening method and syntax patterns. OOV extension recognition and screening method combines the expansion and screening of the left-neighbor strings. The syntaxs pattern are performed to identify candidate nickname substrings of the observation window using nickname indicator. This paper also defines concepts such as candidate nickname substrings and optional nickname substrings. The high frequency word list of the martial arts novels and low-frequency pointer dictionary are derived from statistics and classification,The results show that this method is feasible and effective.
关键词
外号识别 /
竞争外号子串 /
高频词表 /
固定句式法则
{{custom_keyword}} /
Key words
nickname recognition /
competent nickname substring /
high frequency word list /
fixed sentence principle
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 贾崇柏.赵树理小说人物外号的艺术性[J].山西大学学报(哲学社会科学版),1989(3):82-85.
[2] Han J,Qu M,Ren X.Automatic synonym discovery with knowledge bases[C]//Proceedings of ACM SIGKNHDD International Conference on Knowledge Discovery and Data Mining.ACM,2017:997-1005.
[3] 刘冰洋,伍大勇,刘欣然,等.融合全局词语边界特征的中文命名实体识别方法[J].中文信息学报,2017,31(2):86-91.
[4] 郭喜跃,何婷婷.信息抽取研究综述[J].计算机科学,2015,42(2):14-17.
[5] 谢志宁.中文命名实体识别算法研究[D].杭州:浙江大学硕士学位论文,2017.
[6] 黄德根,岳广玲,杨元生.基于统计的中文地名识别[J].中文信息学报,2003,17(2):37-42.
[7] 宋柔,朱宏.基于语料库和规划库的人名识别法[C].全国计算机语言学联合学术会议.1993.
[8] Finkel J R,Grenager T,Manning C.Incorporating non-local information into information extraction systems by gibbs sampling[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics.Association for Computational Linguistics,2005:363-370.
[9] 张海楠,伍大勇,刘悦,等.基于深度神经网络的中文命名实体识别[J].中文信息学报,2017,31(4):28-35.
[10] 王俊.基于混合模型的中文人名识别方法研究[D].南京:华东交通大学硕士学位论文,2015.
[11] Eunji Yi.SVM-based biological named entity recognition using minimum edit-distance feature boosted by virtual examples[C]//Proceedings of IJCNLP 2004,2004:800-818.
[12] 钱晶,张玥杰,张涛等.基于最大熵的汉语基人名地名识别方法研究[J].小型微型计算机系统,2006,27(9):1761-1765.
[13] Bikel D M,Schwartz R,Weischedel R M.An algorithm that learns whats in a name[J].Machine Learning,1999,34(1-3):211-231.
[14] Mccallum A,Li W.Early results for named entity recognition with conditional random fields,feature induction and web-enhanced lexicons[C]//Proceedings of Conference on Natural Language Learning at HLT-NAACL.Association for Computational Linguistics,2003:188-191.
[15] Isozaki H.Japanese named entity recognition based on a simple rule generator and decision tree learning[J].IPSJ Journal,2002,43(5):1481-1491.
[16] 刘浏,王东波.命名实体识别研究综述[J].情报学报,2018,37(3):329-340.
[17] Collins M.Unsupervised models for named entity classification[C]//Proceedings of Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.1999:100--110.
[18] Mikheev A,Moens M,Grover C.Named entity recog- nition without gazetteers[C]//Proceedings of Conference on European Chapter of the Association for Computational Linguistics.Association for Computational Linguistics,1999:1-8.
[19] Guo H.The unreasonable effectiveness of word rep resentations for twitter named entity recognition[C]//Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics.2015.
[20] Tomori S,Ninomiya T,Mori S.Domain specific named entity recognition referring to the real world by deep neural networks[C]//Proceedings of Meeting of the Association for Computational Linguistics.2016:236-242.
[21] Dong X,Qian L,Guan Y,et al.A multiclass classifica tion method based on deep learning for named entity recognition in electronic medical records[C]//Proceedings of New York Scientific Data Summit.IEEE,2016:1-10.
[22] Lample G,Ballesteros M,Subramanian S,et al.Neu ral architectures for named entity recognition[C]//Proceedings of NAACL 2016,2016:260-270.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
北大方正集团有限公司数字出版技术国家重点实验室开放课题;国家自然科学基金(71531012,71271211);北京市自然科学基金(4172032);中国人民大学科学研究基金(中央高校基本科研业务费专项资金)项目成果(19XNH120)
{{custom_fund}}