Language Transcription Processing: Model and Application
ZHANG Xiaoheng
.
2015, 29(4):
144-150.
A duplicate-encoded character is a character which has been assigned two or more code points in a coding system such as Unicode. When output in distinct codes, the glyphs of a duplicate-encoded character appear the same to human users, while in the computer, they are different characters. Such a human-computer inconsistency would cause confusion in language information processing, resulting in incomplete information retrieval, inaccurate statistic calculation, and inferior quality of data sorting and categorizing. This paper discusses duplicate encoding of Chinese characters in Unicode, MS Office and the WWW, including (a) duplicate encoding arising from new code assignment in the Unihan public area to characters already encoded in the private use area, (b) duplicate encoding caused by compatibility encoding, (c) duplicate encoding brought forward by building dedicated lists for CJK strokes and radicals, and (d) duplicate encoding of characters in half-width and full-width forms. Some effective solutions to the problems are also suggested.