Other Language in/around China
LIU Huidan, NUO Minghua, MA Longlong, WU Jian, HE Yeping
2015, 29(1): 170-177.
Based on link analysis and Tibetan encoding detection, this paper focuses on mining the Tibetan text resources over the internet with a crawler, and analyzes the distribution of Tibetan text. Statistical data shows that, more than 50% inland Tibetan web sites are hold by organizations in Qinghai province, and about 87% web pages belong to 31 large web sites. People prefer to use Unicode as the encoding of their new web pages rather than legacy encodings. It is practical to to extract Tibetan text from the pages with the natural tag information, such as HTML elements, column information and punctuations. The text can be used to build raw corpus, text classification corpus, and internet word/phrase corpus and so on. Word frequency statistics and language model can also be derived. In addition, some bilingual corpus can also be extracted.