该文提出了一种新的基于小世界网络特性的关键词提取算法。首先,利用K最邻近耦合图构成方式,将文档表示成为词语网络。引入词语聚类系数变化量和平均最短路径变化量来度量词语的重要性,选择重要性大的词语组成候选关键词集。利用侯选关键词集词语位置关系和汉语词性搭配关系,提取出复合关键词。 实验结果表明该方法是可行和有效的,获取复合关键词比一般关键词所表达的含义更便于人们对文本的理解。
Abstract
In this paper, a new algorithm is proposed for extracting compound keywords from the Chinese document by the small world network. Using k-nearest-neighbor coupled graph, a Chinese document is first represented as a networkthe node represent the term, and the edge represent the co-occurrence of terms. Then, two variables, clustering coefficient increment and average path length increment, are introduced to measure term's importance and to generate the candidate keyword set. With factors such as co-operation between two any terms of part of speech in a sentence and the neighborhood between any two terms of the candidate set, some related words in the candidate set are combined as the compound keywords. The experimental results show that the algorithm is effective and accurate in comparision with the manual keywords extraction from the same document. The semantic representation by the compound keywords of a document is far more clearer than that of single keywords set, facilitating a better comprehension of the document.
Key words computer application; Chinese information processing; small world network; term network graph; average shortest path length increment; average clustering coefficient increment; compound keywords
关键词
计算机应用 /
中文信息处理 /
小世界网络 /
词语网络 /
平均最短路径变化量 /
聚类系数变化量 /
复合关键词
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
small world network /
term network graph /
average shortest path length increment /
average clustering coefficient increment /
compound keywords
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Lvhn H. P. A statistical approach to the mechanized encoding and searching of literary information [J]. IBM Research and Development ,1957.1(4):309-317.
[2] Salton G, Yang CS on the specification of term values in automatic indexing [J]. Documentation, 1973, 29(4):351-372.
[3] Tunney P.D Learning to extract Keyphrases from text [R]. National Research Council , Canada , NRC Technical Report ERB-1057,1999.
[4] Witten I.H ,Paynter G.W, Frank E,Gutwin C,
Nwvill-Manning C,G,KEA [C]//Proceedings of the 4th ACM conference on Digital Libraried. Berkeley,California ,US,1999:254-256.
[5] 程岚岚, 何丕廉,孙越恒. 基于朴素贝叶斯模型的中文关键词提取算法研究[J]. 计算机应用, 2005,25(12):2780-2782.
[6] 李素建,王厚峰,俞士汶. 关键词自动标引的最大熵模型应用研究[J].计算机学报, 2004,27(9):1192-1197.
[7] J Morris , G Hirst. Lexical Cohesion Computed by Thesaural relations as an Indicator of the structure of Text[J]. Computational Linguistics ,1991,17(1):21-48.
[8] 索红光,刘玉树,曹淑英. 一种基于词汇链的关键词抽取方法[J]. 中文信息学报, 2006,20(6):25-30.
[9] Watts, D.J. and S.H. Strogatz, Collective dynamics of ’small-world’ networks [J]. Nature. 1998 Jun 4, 1998.Vol.393: 440-442.
[10] Mathias, N. and V. Gopal, Small Worlds: How and Why [J]. Phys. Rev. E,2001. 63: 63-75.
[11] Milgram, S., The small world problem [J]. Psychology Today, 1967, 2: 60-67.
[12] Cancho, R.F.I. and R.V. Sole, The small world of human language [C]//Proceedings of The Royal Society of London, London ,2001. Series B , Biological Sciences. 268(2001): 2261-2265.
[13] Li, M., W.-C. Lee, and A. Sivasubramaniam. Semantic Small World: An Overlay Network for Peer-to-Peer Search[C]//Proceedings of the 12th IEEE International Conference on Network Protocols(ICNP 2004). Berlin, Germany. 2004: 180-189.
[14] Holme,P. , Characteristics of Small World Networks [M]. Sweden: Umea University, 2001: 12-26.
[15] Mengxiao, Z., C. Zhi, and C.Q.A. keywords extraction of Chinese Document Using Small World Structure [C]//Proceedings of the 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems (MASCOTS’03). 2003: 201-209.
[16] Yutaka Matsuo, Yukio Ohsawa and Mitsurn Ishizuka. A Document as a Small World [C]//Proceedings of JSAI2001 Workshops, LNAI2253, 2001: 444-448.
[17] 汪小帆,李翔,陈关荣.复杂网络理论及其应用[M].北京: 清华大学出版社,2006.
[18] Yutaka Matsuo, Yukio Ohsawa and Mitsuru Ishizuka KeyWorld: Extracting Keywords in a Document as a Small World [C]//DS-2001, 2001: 271-281.
[19] 中文文本分类语料[DB/OL].http://www.nlp.org.cn/categories/default.php?cat_id=16.
[20] 刘开瑛,薛翠芳,郑家恒,等. 中文文本中抽取特征信息的区域和技术[J]. 中文信息学报,1998.12(2):1-7.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60803162);陕西省自然科学基金资助项目(SJ08-ZT15);陕西省教育厅科研计划资助项目(08JK245)
{{custom_fund}}