刘汇丹,诺明花,马龙龙,吴 健,贺也平. Web藏文文本资源挖掘与利用研究[J]. 中文信息学报, 2015, 29(1): 170-177.
LIU Huidan, NUO Minghua, MA Longlong, WU Jian, HE Yeping. Mining Tibetan Web Text Resources and Its Application. , 2015, 29(1): 170-177.
Web藏文文本资源挖掘与利用研究
刘汇丹1,2,诺明花1,2,马龙龙1,吴 健1,贺也平1
1. 中国科学院 软件研究所,北京 100190; 2. 中国科学院大学,北京 100049
Mining Tibetan Web Text Resources and Its Application
LIU Huidan1,2, NUO Minghua1,2, MA Longlong1, WU Jian1, HE Yeping1
1. Institute of Software, Chinese Academy of Sciences, Beijing 100190, China; 2. Graduate University of the Chinese Academy of Sciences, Beijing 100049, China
Abstract:Based on link analysis and Tibetan encoding detection, this paper focuses on mining the Tibetan text resources over the internet with a crawler, and analyzes the distribution of Tibetan text. Statistical data shows that, more than 50% inland Tibetan web sites are hold by organizations in Qinghai province, and about 87% web pages belong to 31 large web sites. People prefer to use Unicode as the encoding of their new web pages rather than legacy encodings. It is practical to to extract Tibetan text from the pages with the natural tag information, such as HTML elements, column information and punctuations. The text can be used to build raw corpus, text classification corpus, and internet word/phrase corpus and so on. Word frequency statistics and language model can also be derived. In addition, some bilingual corpus can also be extracted.