基于小世界模型的复合关键词提取方法研究

马力,焦李成,白琳,周雅夫,董洛兵

PDF(643 KB)
PDF(643 KB)
中文信息学报 ›› 2009, Vol. 23 ›› Issue (3) : 121-129.
综述

基于小世界模型的复合关键词提取方法研究

  • 马力1,2,焦李成1,白琳2,周雅夫2,董洛兵3
作者信息 +

Research on a Compound Keywords Detection MethodBased on Small World Model

  • MA Li1,2, JIAO Licheng1, BAI Lin2, ZHOU Yafu2, DONG Luobing3
    (1. Institute of Intelligence Information Processing, Xidian University, Xi’an, Shanxi 710071, China;
    2. Information center, Xi’an Institute of Post and Telecommunications, Xi’an, Shanxi 710061, China;
    3. Library ,Xidian University, Xi’an, Shanxi 710071, China)
Author information +
History +

摘要

该文提出了一种新的基于小世界网络特性的关键词提取算法。首先,利用K最邻近耦合图构成方式,将文档表示成为词语网络。引入词语聚类系数变化量和平均最短路径变化量来度量词语的重要性,选择重要性大的词语组成候选关键词集。利用侯选关键词集词语位置关系和汉语词性搭配关系,提取出复合关键词。 实验结果表明该方法是可行和有效的,获取复合关键词比一般关键词所表达的含义更便于人们对文本的理解。

Abstract

In this paper, a new algorithm is proposed for extracting compound keywords from the Chinese document by the small world network. Using k-nearest-neighbor coupled graph, a Chinese document is first represented as a networkthe node represent the term, and the edge represent the co-occurrence of terms. Then, two variables, clustering coefficient increment and average path length increment, are introduced to measure term's importance and to generate the candidate keyword set. With factors such as co-operation between two any terms of part of speech in a sentence and the neighborhood between any two terms of the candidate set, some related words in the candidate set are combined as the compound keywords. The experimental results show that the algorithm is effective and accurate in comparision with the manual keywords extraction from the same document. The semantic representation by the compound keywords of a document is far more clearer than that of single keywords set, facilitating a better comprehension of the document.
Key words computer application; Chinese information processing; small world network; term network graph; average shortest path length increment; average clustering coefficient increment; compound keywords

关键词

计算机应用 / 中文信息处理 / 小世界网络 / 词语网络 / 平均最短路径变化量 / 聚类系数变化量 / 复合关键词

Key words

computer application / Chinese information processing / small world network / term network graph / average shortest path length increment / average clustering coefficient increment / compound keywords

引用本文

导出引用
马力,焦李成,白琳,周雅夫,董洛兵. 基于小世界模型的复合关键词提取方法研究. 中文信息学报. 2009, 23(3): 121-129
MA Li, JIAO Licheng, BAI Lin, ZHOU Yafu, DONG Luobing
(. Institute of Intelligence Information Processing, Xidian University, Xi’an, Shanxi 7007, China;
. Information center, Xi’an Institute of Post and Telecommunications, Xi’an, Shanxi 7006, China;
. Library ,Xidian University, Xi’an, Shanxi 7007, China)
.
Research on a Compound Keywords Detection MethodBased on Small World Model. Journal of Chinese Information Processing. 2009, 23(3): 121-129

参考文献

[1] Lvhn H. P. A statistical approach to the mechanized encoding and searching of literary information [J]. IBM Research and Development ,1957.1(4):309-317.
[2] Salton G, Yang CS on the specification of term values in automatic indexing [J]. Documentation, 1973, 29(4):351-372.
[3] Tunney P.D Learning to extract Keyphrases from text [R]. National Research Council , Canada , NRC Technical Report ERB-1057,1999.
[4] Witten I.H ,Paynter G.W, Frank E,Gutwin C,
Nwvill-Manning C,G,KEA [C]//Proceedings of the 4th ACM conference on Digital Libraried. Berkeley,California ,US,1999:254-256.
[5] 程岚岚, 何丕廉,孙越恒. 基于朴素贝叶斯模型的中文关键词提取算法研究[J]. 计算机应用, 2005,25(12):2780-2782.
[6] 李素建,王厚峰,俞士汶. 关键词自动标引的最大熵模型应用研究[J].计算机学报, 2004,27(9):1192-1197.
[7] J Morris , G Hirst. Lexical Cohesion Computed by Thesaural relations as an Indicator of the structure of Text[J]. Computational Linguistics ,1991,17(1):21-48.
[8] 索红光,刘玉树,曹淑英. 一种基于词汇链的关键词抽取方法[J]. 中文信息学报, 2006,20(6):25-30.
[9] Watts, D.J. and S.H. Strogatz, Collective dynamics of ’small-world’ networks [J]. Nature. 1998 Jun 4, 1998.Vol.393: 440-442.
[10] Mathias, N. and V. Gopal, Small Worlds: How and Why [J]. Phys. Rev. E,2001. 63: 63-75.
[11] Milgram, S., The small world problem [J]. Psychology Today, 1967, 2: 60-67.
[12] Cancho, R.F.I. and R.V. Sole, The small world of human language [C]//Proceedings of The Royal Society of London, London ,2001. Series B , Biological Sciences. 268(2001): 2261-2265.
[13] Li, M., W.-C. Lee, and A. Sivasubramaniam. Semantic Small World: An Overlay Network for Peer-to-Peer Search[C]//Proceedings of the 12th IEEE International Conference on Network Protocols(ICNP 2004). Berlin, Germany. 2004: 180-189.
[14] Holme,P. , Characteristics of Small World Networks [M]. Sweden: Umea University, 2001: 12-26.
[15] Mengxiao, Z., C. Zhi, and C.Q.A. keywords extraction of Chinese Document Using Small World Structure [C]//Proceedings of the 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems (MASCOTS’03). 2003: 201-209.
[16] Yutaka Matsuo, Yukio Ohsawa and Mitsurn Ishizuka. A Document as a Small World [C]//Proceedings of JSAI2001 Workshops, LNAI2253, 2001: 444-448.
[17] 汪小帆,李翔,陈关荣.复杂网络理论及其应用[M].北京: 清华大学出版社,2006.
[18] Yutaka Matsuo, Yukio Ohsawa and Mitsuru Ishizuka KeyWorld: Extracting Keywords in a Document as a Small World [C]//DS-2001, 2001: 271-281.
[19] 中文文本分类语料[DB/OL].http://www.nlp.org.cn/categories/default.php?cat_id=16.
[20] 刘开瑛,薛翠芳,郑家恒,等. 中文文本中抽取特征信息的区域和技术[J]. 中文信息学报,1998.12(2):1-7.


基金

国家自然科学基金资助项目(60803162);陕西省自然科学基金资助项目(SJ08-ZT15);陕西省教育厅科研计划资助项目(08JK245)
PDF(643 KB)

Accesses

Citation

Detail

段落导航
相关文章

/