语义元数据提供数据的语义信息,在数据的理解、管理、发现和交换中起着极为重要的作用。随着互联网上数据爆炸式的增长,对自动元数据生成技术的需求也就变得更加迫切。获得目标语义元数据及得到足够的训练语料是使用自动生成技术的两个基本问题。由于获得目标语义元数据需要专家知识,而获得足够的训练语料需要大量的手工工作,这也就使得这两个问题在构建一个成功的系统时至关重要。该文基于Wikipedia来解决这两个问题通过分析一个类别中条目的目录表(table-of-contents)来抽取目标语义元数据,通过对分析文档结构和赋予目标结构正确的语义元数据来构建训练语料库。实验结果表明,该文的方法能够有效地解决这两个问题,为进一步的大规模的语义元数据应用系统打下了坚实的基础。
Abstract
Semantic metadata, which provides semantic information about data, plays an important role in document management, fusion and information search. The automatic metadata generation technique, which subsumes the acquisition of target semantic metadata and the collection of training corpus as two fundamental problems, becomes more demanding in the data explosion time. The first problem involves expert knowledge and the second problem needs lots of manual work, and accordingly, they are critical to a successful system. In this paper, we resolve the two problems based on Wikipediaextracting the target metadata by analyzing the table-of-contents of Wikipedia's entries and building the training corpus by analyzing the Wikipedia entry's structure and assigning its true semantic metadata. The experiment results demonstrate that this approach can resolve the two issues in automatic metadata generation effectively.
关键词
计算机应用 /
中文信息处理 /
元数据 /
语义元数据 /
数据处理 /
语料库构建 /
语义标注
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
metadata, semantic metadata, data processing, corpus construction, semantic annotation.
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] A.-H. Tan. Text mining: The state of the art and the challenges[C]//Ning Zhong and Lizhu Zhou. Proceedings of PAKDD 1999. China:Springer, 1999:65-70.
[2] Chien-Chung Huang,et al. Using a web-based categorization approach to generate thematic metadata from texts[J]. ACM Transactions on Asian Language Information Processing, 2004, 3(3):190-212.
[3] S. Bechhofer, C. Gobel. Towards annotation using daml+oil[C]//Yolanda Gil, et al. Proceedings of K-CAP 2001. Canada:ACM, 2001.
[4] M. Erdmann, et al. From manual to semi-automatic semantic annotation: About ontology-based text annotation tools[C]//Buitelaar, P. and Hasida, K. Proceeding of COLING 2000. Germany: Morgan Kaufmann,2000.
[5] S. Handschuh, S. Stabb. Authoring and annotation of web pages in cream[C]//David Lassner, et al. Proceeding of WWW 2002. USA:ACM, 2002:462-473.
[6] M.-R. Koivunen, R. Swick. Metadata based annotation infrastructure offer flexibility and extensibility for collaborative applications and beyond[C]//Yolanda Gil, et al. Proceedings of K-CAP 2001. Canada:ACM, 2001.
[7] P. Martin, P. Eklund. Embedding knowledge in web documents[J]. Computer Networks, 1999, 31:1403-1419.
[8] M. Vargas-Vera, et al. Knowledge extraction by using an ontology based annotation tool[C]//Yolanda Gil, et al. Proceedings of K-CAP 2001. Canada:ACM, 2001.
[9] S. Handschuh, et al. S-cream-semiautomatic creation of metadata[C]//Asunción Gómez-Pérez, V. Richard Benjamins. Proceeding of EKAW 2002. Spain:Springer, 2002:358-372.
[10] K.Winkler, M. Spiliopoulou. Extraction of semantic xml dtds from texts using data mining techniques[C]//Yolanda Gil, et al. Proceedings of K-CAP 2001. Canada:ACM, 2001.
[11] H-C Yang, C-H Lee. Automatic Metadata Generation for Web Pages Using a Text Mining Approach[C]//International Workshop on Challenges in Web Information Retrieval and Integration, 2005. USA: IEEE Computer Society,2005:186-194.
[12] A. Dingli, etal. Automatic semantic annotation using unsupervised information extraction and integration[C]//John Gennari, et al. Proceedings K-CAP 2003. USA:ACM, 2003.
[13] S. Dill, et al. A case for automated largescale semantic annotation[J]. Web Semantics: Science, Services and Agents on the World Wide Web, 2003, 1(1):115-132.
[14] H. Graubitz, et al. Semantic tagging of domain-specific text documents with diasdem[C]//Gunter Saake, et al. Proceedings of DBFusion 2001. USA:ACM, 2001:61-72.
[15] J. Li, et al. Learning to generate semantic annotation for domain specific sentences[C]//Yolanda Gil, et al. Proceedings of K-CAP 2001. Canada:ACM, 2001:44-57.
[16] S. Handschuh, S. Staab. Cream: Creating metadata for the semantic web[J]. Computer Networks, 2003, 42(5):579-598.
[17] P. Cimiano, et al. Towards the self annotating web[C]//Stuart I. Feldman, et al. Proceedings of WWW 2004. USA:ACM, 2004:462-471.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60673042);国家863计划资助项目(2006AA01Z144);北京市自然科学基金资助项目(4073043)
{{custom_fund}}