Article
LIU Hao, HONG Yu, Yao Liang, LIU Le, YAO Jianmin, ZHOU Guodong
2017, 31(2): 25-35.
Identifying and locating domain-specific bilingual websites is a crucial step for the Web-based bilingual resource construction. However, the quality of sentence pairs varies among different bilingual websites. In contrast to the existing method focusing only on the sentence internal features, we explore the sentence pairs' origin information for identifying and filtering the low-quality sentences pairs. We hypothesize that, if a website is authoritative in the target domain, it tends to contain more high-quality sentence pairs. Thus, we propose a HITS based optimization method for mining domain-specific bilingual sentence pairs. In this method, we first construct a directed-graph model based on the link-info among the websites. Secondly, we propose a HITS based method for evaluating the authority of websites. Finally, we only extract the sentence pairs from the authoritative websites, and use them to enlarge the training-set of our machine translation system. Experimented on the education domain, our system achieves improvements of 0.44% BLEU score compared with existing method. A further proposed GHITS method achieve additional improvements of 0.40% BLEU score.