中文的同形异码字问题

张小衡

PDF(1179 KB)
PDF(1179 KB)
中文信息学报 ›› 2015, Vol. 29 ›› Issue (4) : 144-150.
语言文字信息处理技术及其应用

中文的同形异码字问题

  • 张小衡
作者信息 +

Duplicate Encoding of Chinese Characters

  • ZHANG Xiaoheng
Author information +
History +

摘要

同一个字符拥有不同的计算机内部代码,这意味着有两个或两个以上字形在人的眼中是同一个字,而计算机却认为是不同的字。这种“人机看法不一致”会给语言信息处理带来混乱,导致信息检索不全,统计数字不准,字词分类排序不一致等情况。该文结合Unicode实例专题讨论当前计算机上存在的中文同形异码字问题,包括 (a) 私人造字公有化所形成的同形异码字,(b) 兼容编码所形成的同形异码字,(c) 建立专门的笔画部首表而形成的同形异码字,(d) 半宽和全宽字形分别编码而造成的同形异码字等,并探讨解决问题的方法。

Abstract

A duplicate-encoded character is a character which has been assigned two or more code points in a coding system such as Unicode. When output in distinct codes, the glyphs of a duplicate-encoded character appear the same to human users, while in the computer, they are different characters. Such a human-computer inconsistency would cause confusion in language information processing, resulting in incomplete information retrieval, inaccurate statistic calculation, and inferior quality of data sorting and categorizing. This paper discusses duplicate encoding of Chinese characters in Unicode, MS Office and the WWW, including (a) duplicate encoding arising from new code assignment in the Unihan public area to characters already encoded in the private use area, (b) duplicate encoding caused by compatibility encoding, (c) duplicate encoding brought forward by building dedicated lists for CJK strokes and radicals, and (d) duplicate encoding of characters in half-width and full-width forms. Some effective solutions to the problems are also suggested.

关键词

中文字符 / 同形异码 / Unicode

引用本文

导出引用
张小衡. 中文的同形异码字问题. 中文信息学报. 2015, 29(4): 144-150
ZHANG Xiaoheng. Duplicate Encoding of Chinese Characters. Journal of Chinese Information Processing. 2015, 29(4): 144-150

参考文献

[1]曾荫权. 中华人民共和国香港特别行政区政府二零一一至一二年施政报告:继往开来[R]. http://www.policyaddress.gov.hk/11-12/chi/pdf/Policy11-12.pdf.2011
[2] 香港政府资讯科技总监办公室(2008). 香港增补字符集[S]. 香港:政府资讯科技总监办公室http://www.ogcio.gov.hk/tc/business/tech_promotion/ccli/hkscs/.
[3] 陈壮. 中国在ISO/IEC JTC1/SC2 的活动与中文编码的国际标准化[J]. 中文信息学报, 2007,21(4).
[4] Google. Unicode Over 60 Percent of the Web [EB]. Posted on Google Official Blog by Mark Davis, International Software Architect, http://googleblog.blogspot.hk/2012/02/unicode-over-60-percent-of-web.html 2012.
[5] 张小衡,李笑通. 一二三笔顺检字手册[M]. 北京: 语文出版社. 2013.
[6] 崔世安. 中华人民共和国澳门特别行政区政府二○一二年财政年度施政报告[R]. http://portal.gov.mo/web/guest/info_detail?infoid=134838.2011.
[7] The Unicode Consortium (2012a). The Unicode Standard, Version 6.2.0 [S], Mountain View, CA: The Unicode Consortium, http://www.unicode.org/versions/Unicode6.2.0/
[8] The Unicode Consortium (2012b). CJK Radicals, the Unicode Standard 6.2.0 [S]. http://www.unicode.org/charts/PDF/U2F00.pdf
[9] The Unicode Consortium (2012c). CJK Radicals Supplement, the Unicode Standard 6.2.0 [S]. http://www.unicode.org/charts/PDF/U2E80.pdf 
[10] The Unicode Consortium (2012d). CJK Strokes, the Unicode Standard 6.2.0 [S]. http://www.unicode.org/charts/PDF/U31C0.pdf.
[11] Zhang, X. Computer Input of Non-ASCII Non-Hanzi Chinese Characters [J]. The Journal of Modernization of Chinese Language Education (中文教学现代化学报), 2012(2).
[12] 傅永和. 汉字规范化60 年[J]. 语言文字应用. 2009(4).
[13] 张小衡. 一个支持人工校对的中文简繁体转换工具[C]. In 孙茂松, 陈群秀编, 中国计算语言学研究前沿进展 (2009-2011). 北京:清华大学出版社, 2011: 569-575.
[14] The Unicode Consortium (2012f). Superscripts and Subscripts, the Unicode Standard 6.2.0 [S]. http://www.unicode.org/charts/PDF/U2070.pdf. 
[15] Whistler K. On the Encoding of Latin, Greek, Cyrillic, and Han. Unicode Technical Note #26 [R]. http://www.unicode.org/notes/tn26/tn26-2.html.2010.
[16] The Unicode Consortium (2012e). Latin-Extended B, the Unicode Standard 6.2.0 [S]. http://www.unicode.org/charts/PDF/U0180.pdf.

基金

PolyU RGC Direct Allocation Fund. Project Account Code; A-PK14
PDF(1179 KB)

Accesses

Citation

Detail

段落导航
相关文章

/