该文采用基于SVD和NMF矩阵分解相结合的改进潜在语义分析的方法为生物医学文献双语摘要进行建模,该模型将英汉双语摘要映射到同一语义空间,不需要外部词典和知识库,建立不同语言之间的对应关系,便于在双语空间中进行检索。该文充分利用医学文献双语摘要语料中的锚信息,通过不同的k值构建多个检索模型,计算每个模型的信任度,使得多个模型都对查询和文本的相似度做出贡献。在语义空间上进行项与项、文本与文本、项与文本之间的相似度计算,实现了双语摘要的跨语言检索,取得了较好的实验效果。
Abstract
Focused on the cross language information retrieval, this paper applies the improved Latent Semantic Indexing (LSI)by combining SVD and NMF to construct the semantic space for the abstracts of biomedical literatures. It maps the Chinese document and English document into the same semantic space without external dictionary and knowledge base and for the bilingual information retrieval. The proposed method also utilizes the anchor information included the abstracts of biomedical literatures and builds a series models corresponding to different K-dimensions, all contributing to the similarity between query and documents with different credibility. As a result, the similarities of term to term, document to document and term to document are calculated forthe bilingual information retrieval of biomedical abstract. The experiment gets a better result.
Key wordscomputer application; Chinese information processing; improved latent semantic indexing; semantic space; cross language IR; SVD; NMF
关键词
计算机应用 /
中文信息处理 /
改进潜在语义分析 /
语义空间 /
跨语言检索 /
SVD /
NMF
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
improved latent semantic indexing /
semantic space /
cross language IR /
SVD /
NMF
/
/
/
/
/
/
/
/
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Kazuaki Kishida. Technical Issues of Cross-Language Information Retrieval: a Review[J]. Information Processing and Management, 2005, 41(3): 433-455.
[2] Gina-Anne Levowa, Douglas W. Oardb, Philip Resnikc. Dictionary-based techniques for cross-language information retrieval[J]. Information Processing and Management, 2005, 41(3):523-547.
[3] Dong Zhou, Mark Truran. A Graph-Based Technique for Resolving Ambiguity in Query Translation Candidates. Symposium on Applied Computing [C]// Proceedings of the 2008 ACM symposium on Applied computing, Fortaleza, Ceara, Brazil: ACM New York, USA, 2008: 1566-1573.
[4] Dong Zhou, Mark Truran. A Hybrid Technique for English-Chinese Cross Language Information Retrieval[J]. ACM Transactions on Asian Language Information Processing (TALIP), 2008, 7(2):1-35.
[5] Guihong Cao,Jianfeng Gao. Extending query translation to cross-language query expansion with markov chain models [C]// Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007: 351-360.
[6] J. Y. Nie, M. Simard, P. Cross-Language Information Retrieval based on Parallel Texts and Automatic Mining of Parallel Texts in the Web [C]// Proceedings of SIGIR’99, Berkeley, 1999: 74-81.
[7] GAO JF, Nie JY. Trec-9 CLIR Experiments at MSRCN [C]// Proceeding of the Ninth Text Retrieval Conference. USA, 2000: 343-353.
[8] Susan T. Dumais, Furnas G W. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
[9] Michael L. Littman, Susan T. Dumais, Thomas K. Landauer. Automatic cross-language retrieval using latent semantic indexing [C]// Proc. of SIGIR’96, 1996: 16-23.
[10] Berry, M.W., Young, P.G. Using Latent Semantic Indexing for Multilingual Information Retrieval[J]. Computers and Humanities, 1995, 29(6):413-429
[11] Michael W. Berry, Murray Browne, Amy N. Langville. Algorithms and applications for approximate nonnegative matrix factorization[J]. Computational Statistics & Data Analysis, 2007, 52(1): 155-173.
[12] H. Bast and D. Majumdar. Why spectral retrieval works [C]// Proceedings of SIGIR’05, 2005: 11-18.
[13] Miles Efron. Model-averaged latent semantic indexing [C]// Proceedings of SIGIR’07, 2007: 755-756.
[14] K. P. Burnham and D. R. Anderson. Model Selection and Multimodel Inference[M]. Springer, New York, 2002.
[15] C. H. Q. Ding. A similarity-based probability model for latent semantic indexing [C]// Proceedings of SIGIR’99, 1999: 58-65.
[16] 陈相,林鸿飞. 基于锚信息的生物医学文献双语摘要句子对齐[J].中文信息学报,2009,23(1):58-62.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60673039,60973068);国家863高科技计划资助项目(2006AA01Z151);教育部留学人员归国科研启动基金和教育部博士点基金资助(20090041110002)
{{custom_fund}}