一种支持ANSI编码的中文文本压缩算法

常为领1,方滨兴1,2,云晓春2,王树鹏2,余翔湛1

PDF(1651 KB)
PDF(1651 KB)
中文信息学报 ›› 2010, Vol. 24 ›› Issue (5) : 96-106.
综述

一种支持ANSI编码的中文文本压缩算法

  • 常为领1,方滨兴1,2,云晓春2,王树鹏2,余翔湛1
作者信息 +

An Efficient Compression Method for ANSI Coded Chinese Text

  • CHANG Weiling1, FANG Binxing1,2, YUN Xiaochun2, WANG Shupeng2, YU Xiangzhan1
Author information +
History +

摘要

该文提出了一种高效的中文文本压缩算法CRecode,算法根据中文文本中字词的概率分布特点,对中文字词根据其使用频率,采用8bit、16bit和24bit三种长度的编码重新编码,克服了Huffman编码在压缩中文数据时打乱数据中蕴含的语义信息,致使其压缩数据再压缩性差的缺点。测试中,CRecode在与现有主流压缩软件联合使用时,可提高压缩率4%到30%,最大平均压缩比可达2.86。CRecode作为独立压缩算法,压缩中文文本时可获得优于Huffman编码、接近于LZ系列算法的性能。

Abstract

After surveying the proposal for compressing Chinese text, we present in this paper a universal compression algorithm for Chinese text, CRecode, which demonstrates an accurate understanding of the properties of the ANSI coded Chinese text. CRecode highlights the importance of pre-processing work for Chineseit collect the Chinese Characters and sorts them by frequency order, then recode them into 8-bit, 16-bit or 24-bit code. CRecode can act as a pre-processing tool for ANSI coded Chinese text by all the popular compression utilities, which can improve their compression ratio from 4% to 30%.
Key wordsCRecode; data compression; Huffman; compression algorithm

关键词

CRecode / 数据压缩 / Huffman / 压缩算法

Key words

CRecode / data compression / Huffman / compression algorithm

引用本文

导出引用
常为领1,方滨兴1,2,云晓春2,王树鹏2,余翔湛1. 一种支持ANSI编码的中文文本压缩算法. 中文信息学报. 2010, 24(5): 96-106
CHANG Weiling1, FANG Binxing1,2, YUN Xiaochun2, WANG Shupeng2, YU Xiangzhan1. An Efficient Compression Method for ANSI Coded Chinese Text. Journal of Chinese Information Processing. 2010, 24(5): 96-106

参考文献

[1] Huffman, D. A. A Method for the Construction of Minimum-Redundancy Codes [C]//Proc. IRE 40, 9 (Sept.), 1952: 1098-1101.
[2] Ziviani, N., Moura, E., Navarro, G., & Baeza-Yates, R. Compression: a key for next-generation text retrieval systems[J].IEEE Computer, 2000, 33(11): 37-44.
[3] Witten, I., Moffat, A., & Bell, T. Managing gigabytes 2nd [M]. Morgan Kaufmann Publishers. 1999.
[4] Ziv, J., and Lempel, A. A Universal Algorithm for Sequential Data Compression [J]. IEEE Transactions on Information Theory, 1977, 23(3): 337-343.
[5] Ziv, J., and Lempel, A. Compression of Individual Sequences via Variable-Rate Coding [J]. IEEE Transactions on Information Theory,1978,24(5):530-536.
[6] J. A. Storer and T. G. Szymanski. Data Compression via Textual Substitution [J]. Journal of the ACM, 1982, 29: 928-951.
[7] Welch, T. A. A Technique for High-Performance Data Compression [J]. Computer, 1978, 17(6): 8-19.
[8] 王忠效.汉语文本压缩研究及其应用[J].中文信息学报,1997,11(3):57-64.
[9] 华强. 中文文本压缩的LZSSCH算法[J]. 中文信息学报, 1998, 12(1): 50-56
[10] T. Bell, J. Cleary, and I. Witten. Data compression using adaptive coding and partial string matching [J]. In: IEEE Transactions on Communications, 1984, 32 (4): 396-402.
[11] A. Moffat. Implementing the PPM data compression scheme [J]. IEEE Transactions on Communications, 1990, 38 (11): 1917-1921.
[12] Joaquín Adiego and Pablo de la Fuente. On the Use of Words as Source Alphabet Symbols in PPM [C]//Data Compression Conference (DCC’06), 2006: 435.
[13] Peiliang Wu, W. J. Teahan. Modelling Chinese For Text Compression [C]//Data Compression Conference (DCC’05), 2005: 488-488.
[14] M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithm [R]. Digital Systems Research Center Research Report 124, May 1994.
[15] Matt Mahoney. Rationale for a Large Text Compression Benchmark[EB/OL]. Online. Internet. Available from http://www.cs.fit.edu/~mmahoney/compression/rationale.html, 5, Jan. 2008.
[16] P. Deutsch. RFC1951: DEFLATE Compressed Data Format Specification version 1.3 [Z]. 1995.
[17] 常为领, 云晓春, 方滨兴, 王树鹏. HitIct: 中文无损压缩算法性能评估测试集[J]. 通信学报,2009,30(3):42-47.
[18] Nelson M., Gailly J. The Data Compression Book, 2nd edition[M].M&T Books,New York,NY,1995.
[19] Ghim Hwee Ong and Shell Ying Huang. Compression of Chinese Text Files Using a Multiple Four-Bit Coding Scheme [C]//IEEE Singapore ICCS/ISITA ’92, Nov. 1992: 1062-1066.
[20] Phil Vines, Justin Zobel, Compression Techniques for Chinese Text [J]. Software-Practice and Experience, 1998, 28(12):1299-1314.
[21] David Salomon. Data compression: The Complete Reference, Third Edition [M]. New York : Springer, 2004: 134-155.
[22] 北京语言学院语言教学研究所.汉语词汇的统计与分析[M].北京: 外语教学与研究出版社,1985.
[23] Wikipedia, the free encyclopedia. Huffman coding[BE/OL]. Online. Internet. Available from http://en.wikipedia.org/wiki/Huffman_coding. May 2010.
[24] Maximum Compression’s English Text compression test[BE/OL].Online.Internet.Available from http://www.maximumcompression.com,3,Sept.2009.

基金

国家重点基础研究发展计划“973” 基金资助项目(2007CB311101);国家863高技术研究发展计划基金资助项目(2009AA01A403, 2007AA01Z406, 2007AA010501,2009AA01Z437)
PDF(1651 KB)

660

Accesses

0

Citation

Detail

段落导航
相关文章

/