该文利用DF与CHI统计量相结合的特征选取方法,针对互联网上对外汉语相关领域的网页进行特征提取,并在此基础上,构建了基于标题与正文相结合的两步式主题相关度判定分类器。基于该分类器做对外汉语相关主题的网页爬取工作,实验表明,效率和召回率比传统分类器都有较大程度的提高,目前该分类器已经用于为大型对外汉语语料库构建提供数据源。
Abstract
This paper combines DF and CHI to select features of web pages related to the area of teaching Chinese as a second language (TCSL). A classifier is first constructed based on two-step topic similarity measurement by the title and the main text. The classifier is then applied to crawling web pages related to TCSL, and the results show substantial improvements on efficiency and recall rate compared with traditional methods. Now this classifier has been deployed for data collection for a big TCSL corpus in actual practice.
Key wordsDF; CHI; classifier; focused crawler
关键词
DF /
CHI统计量 /
分类器 /
主题爬取
{{custom_keyword}} /
Key words
DF /
CHI /
classifier /
focused crawler
/
/
/
/
/
/
/
/
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 刘汉兴,刘财兴. 主题爬虫的搜索策略研究 [J]. 计算机工程与设计,2008,29(12).
[2] Y. Yang. A Comparative Study on Feature Selection in Text Categorization[C]//Proceeding of the Fourteenth International Conference on Machine Learning (ICML 97),412-420,1997.
[3] 彭时名. 中文文本分类中特征提取算法研究[D]. 硕士学位论文.
[4] 宗成庆 统计自然语言处理 清华大学出版社,2008.5
[5] 代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究 [J].中文信息学报,2004(l): 26-33.
[6] 庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类方法的研究与实现[J].计算机应用研究, 2001: 5-6.
[7] F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers[C]//Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001.
[8] Dekang Lin, An information-theoretic definition of similarity[C]//Proceedings of the 15th International Conf. on Machine Learning, pp.296-304. Morgan Kaufmann, San Francisco, CA, (1998).
[9] Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, Marco Gori. Focused Crawling using Context Graphs[C]//26th International Conference on Very Large Databases, VLDB 2000, Cairo, Egypt, pp.527-534, 2000.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}