ZHENG Ze-zhi,ZHANG Pu,YANGJian-guo
2005, 19(2): 79-86.
Nowadays , more and more lettered2words are used in Chinese texts , most of which are new terms or proper nouns. And this may become a trend quite obvious to us. Usually , lettered2words are unknown phrases or words in automatic Chinese segmentation. Based on the observation of lettered2words in our Chinese corpus , the correct identification of them will improve the quality of Chinese segmentation , information retrieval , searching technology , machine translation , etc. This paper analyzes the complex features of Chinese lettered2words , and discusses the difficulties in extracting them. An algorithm for the automatic identification of Chinese lettered2words is presented here , which uses a letter string as the anchor and search its left and right contexts for the boundaries of the lettered2word. The algorithm is simple , but it is effective. Our experiment on the corpus of the Peopleps Daily (Year 2002) shows the precision and the recall rates being 80 % and 100 % respectively.