快速中文字符串模糊匹配算法

陈开渠,赵洁,彭志威

PDF(407 KB)
PDF(407 KB)
中文信息学报 ›› 2004, Vol. 18 ›› Issue (2) : 59-66.

快速中文字符串模糊匹配算法

  • 陈开渠,赵洁,彭志威
作者信息 +

Fast Approximate String Matching for Chinese Text

  • CHEN Kai-qu,ZHAO Jie,PENG Zhi-wei
Author information +
History +

摘要

本文解决了中文字符串模糊匹配的两个主要问题:空间问题和时间问题。目前字符串模糊匹配的两个主要方法是位向量方法和过滤方法。由于汉字众多,应用位向量方法时,需要大量空间。对于某些内存很少的小型计算机,比如嵌入式系统,这将会是一个问题。本文改进了位向量方法,使其在应用于中文字符串时,空间需求降低到约5%。本文还利用汉字非常多的特点,提出一种新的基于过滤方法的中文字符串模糊匹配算法,BPM-BM,其速度比世界上最快的算法至少提高14%;在大部分情况下,是其速度的1.5~2倍。

Abstract

For now there are two effective methods to improve approximate string matching : bit-vector method and filter method. Since Chinese alphabet has many characters , it needs much computer memory for bit-vector method. This would be a problem for some little computer which has a small memory , such as embedded system. We present a new bit-vector method which needs only about 5% computer memory of original bit-vector method. And , we also utilize the fact that Chinese alphabet is very large and develop a new filter method , BPM-BM , for approximate string matching of Chinese text . It runs at least 14% faster than the known fasted algorithms. In most cases , our algorithm is even 1.5~2 times faster.

关键词

计算机应用 / 中文信息处理 / 字符串匹配 / 模糊匹配 / 中文字符串匹配

Key words

computer application / Chinese information processing / string matching / approximate matching / Chinese string matching

引用本文

导出引用
陈开渠,赵洁,彭志威. 快速中文字符串模糊匹配算法. 中文信息学报. 2004, 18(2): 59-66
CHEN Kai-qu,ZHAO Jie,PENG Zhi-wei. Fast Approximate String Matching for Chinese Text. Journal of Chinese Information Processing. 2004, 18(2): 59-66

参考文献

[1] Sellers , P. . The theory and computation of evolutionary distance : pattern recognition. Journal of Algorithms[J] , 1980 ,1 :359 - 373.
[2] Baeza-Yates , R. A. ,Gonnet , G. H. : A new approach to text searching , Communications of the ACM [J] . 35 (10) :74 - 82.
[3] Wu , S. , Manber , U. . Fast text searching allowing errors , Communications of the ACM[J] . 35 (10) : 83 - 91.
[4] Baeza-Yates , R. A. , Navarro , G. . Faster approximate string matching. Algorithmica [J] , 23 (2) : 1999 ,127 - 158.
[5] Myers , G. : A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM[J] , 46 (3) :1999 ,395 - 415.
[6] Chang , W. , Marr , T. . Approximate string matching and local similarity[A] . In : Proc. 5th Combinational Pattern Matching (CPM'94) [C] , LNCS 807 , pages 1994 ,259 - 271.
[7] Navarro , G. , Baeza-Yates , R. A. . Very fast and simple approximate string matching. Information Processing Letters[J] , 1999 ,72 :65 - 70.
[8] Navarro , G. , Raffinot , M. . Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics (JEA) [J] , 2000 ,5 (4) .
[9] Sutinen , E. , Tarhio , J. . On using q-gram locations in approximate string matching[A] . In : Proc. European Symposium on Algorithms (ESA'95) [C] , LNCS 979 , 1995 ,327 - 340.
[10] Tarhio , J. , Ukkonen , E. . Approximate Boyer-Moore string matching. SIAM Journal on computing [J] , 1993 ,22 (2) :243 - 260.
[11] Hyyr?, H. , Navarro , G. . Faster bit-parallel approximate string matching[A] . In : Proc. 13th Combination Pattern Matching (CPM'02) [C] , LNCS 2002 ,2373.
[12] Boyer , R. S. , Moore , J. S. . A fast string searching algorithm. Communications of the ACM[J] . 1977 ,20 :762 - 772.
[13] Ukkonen , E. . Algorithms for approximate string matching. Information and Contro [J] , 1985 ,64 : 100 - 118.
PDF(407 KB)

1438

Accesses

0

Citation

Detail

段落导航
相关文章

/