[IR] BWT+MTF+AC 您所在的位置:网站首页 bwtburrows [IR] BWT+MTF+AC

[IR] BWT+MTF+AC

2023-04-05 19:05| 来源: 网络整理| 查看: 265

 

BWT (Burrows–Wheeler_transform)数据转换算法

MTF(Move-to-front transform)数据转换

基于统计的压缩算法:游程编码

良心PPT: bwt_based_compression_verbin.ppt

 

BWT Idea: 

压缩技术主要的工作方式就是找到重复的模式,进行紧密的编码。

BWT(Burrows–Wheeler_transform)将原来的文本转换为一个相似的文本,转换后使得相同的字符位置连续或者相邻;

之后可以使用其他技术如:Move-to-front transform 和 游程编码(RLE) 进行文本压缩。

 

一般压缩可以将文本先使用Burrows–Wheeler transform生成局部相关性很好的序列,再使用MTF减少信息熵,最后再进行压缩。

 

Burrows–Wheeler transform + Run-length coding

Ori:rabcabcababaabacabcabcabcababaa$BWT:aabbbbccacccrcbaaaaaaaaaabbbbba$RLE:aab4ccac3rcba10b5a$

 

Encoding:

1. All rotations

#BANANAS S#BANANA AS#BANAN NAS#BANA ANAS#BAN NANAS#BA ANANAS#B BANANAS#

2. Sort the rows and pick up the last col.

#BANANAS ANANAS#B ANAS#BAN AS#BANAN BANANAS# NANAS#BA NAS#BANA S#BANANA

正好是个行列一致的Matrix。

ori=[#BANANAS]T

bwt=[SBNN#AAA]T

 

Decoding:

while( 循环len(bwt)次 ) {

  "前插一列,字典排序"

}

Result:

#BANANAS ANANAS#B ANAS#BAN AS#BANAN BANANAS# NANAS#BA NAS#BANA S#BANANA

 

 

MTF Idea: 

可见,BWT转换将文本转换为局部相关性很好的序列。

而刚好MTF转换就是利用了空间局部性原理减少信息熵。

即:最近访问的字符总是出现在“recently used symbols”的前面位置,如果字符的空间局部性较好,编码之后就会出现很多小的数字,如”0“或”1“。

所以,在如下List列中,使用的字符即可提前到head pos处,如此一来,index标号比较小,省空间。

这样,就方便了下一步压缩策略。

The main idea is that each symbol in the data is replaced by its index in the stack of “recently used symbols”.

For example, long sequences of identical symbols are replaced by as many zeroes, whereas when a symbol that has not been used in a long time appears, it is replaced with a large number.

Thus at the end the data is transformed into a sequence of integers; if the data exhibits a lot of local correlations, then these integers tend to be small.

 

Encoding:

先建立字符集大小的List,“recently used symbols”,这里只考虑26个小写字母a~z。

From: https://en.wikipedia.org/wiki/Move-to-front_transform

Input StreamSequenceListOperation bananaaa 1 (abcdefghijklmnopqrstuvwxyz) b放到head位置 bananaaa 1,1 (bacdefghijklmnopqrstuvwxyz) a放到head位置 bananaaa 1,1,13 (abcdefghijklmnopqrstuvwxyz) n放到head位置 bananaaa 1,1,13,1 (nabcdefghijklmopqrstuvwxyz) a放到head位置 bananaaa 1,1,13,1,1 (anbcdefghijklmopqrstuvwxyz) n放到head位置 bananaaa 1,1,13,1,1,1 (nabcdefghijklmopqrstuvwxyz) a放到head位置 bananaaa 1,1,13,1,1,1,0 (anbcdefghijklmopqrstuvwxyz) a放到head位置 bananaaa 1,1,13,1,1,1,0,0 (anbcdefghijklmopqrstuvwxyz) a放到head位置

Final

1,1,13,1,1,1,0,0

(anbcdefghijklmopqrstuvwxyz)

 

ori=[b a n  a n a a a]T

mtf=[1 1 13 1 1 1 0 0]T

 

Decoding:

Input StreamSequence    List 1,1,13,1,1,1,0,0 b (abcdefghijklmnopqrstuvwxyz) 1,1,13,1,1,1,0,0 ba (bacdefghijklmnopqrstuvwxyz) 1,1,13,1,1,1,0,0 ban (abcdefghijklmnopqrstuvwxyz) 1,1,13,1,1,1,0,0 bana (nabcdefghijklmopqrstuvwxyz) 1,1,13,1,1,1,0,0 banan (anbcdefghijklmopqrstuvwxyz) 1,1,13,1,1,1,0,0 banana (nabcdefghijklmopqrstuvwxyz) 1,1,13,1,1,1,0,0 bananaa (anbcdefghijklmopqrstuvwxyz) 1,1,13,1,1,1,0,0 bananaaa (anbcdefghijklmopqrstuvwxyz)

 

可见,这里的思想就在于,利用了变化过程中的顺序性!

这样一来,string变为了数字。如果之前的string是对应一堆大的数字的话,那么,这次变为了一堆小数字,且0,1较多,Entropy小了!

 

Burrows–Wheeler transform + Move-To-Front

Ori:rabcabcababaabacabcabcabcababaa$BWT:aabbbbccacccrcbaaaaaaaaaabbbbba$MTF:0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,$ ARI:

 

问:MTF是否确实提高了压缩率呢?一堆小数字对压缩来说总会是好事。减小了信息熵,便自然想到了huffman或者arithmetic coding。

 

过程如下:

Input StreamSequenceList aabbbbccacccrcbaaaaaaaaaabbbbba$ 0 (abcdefghijklmnopqrstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0 (abcdefghijklmnopqrstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1 (abcdefghijklmnopqrstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0 (bacdefghijklmnopqrstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2 (bacdefghijklmnopqrstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0 (cbadefghijklmnopqrstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2 (cbadefghijklmnopqrstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1 (acbdefghijklmnopqrstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0 (cabdefghijklmnopqrstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17 (cbadefghijklmnopqrstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1 (rcbadefghijklmnopqstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2 (rcbadefghijklmnopqstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3 (brcadefghijklmnopqstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0 (abrcdefghijklmnopqstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1 (abrcdefghijklmnopqstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0 (barcdefghijklmnopqstuvwxyz) aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1 (barcdefghijklmnopqstuvwxyz)

Final

0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1

 

 

继续采用Arithmetic coding处理:

32 in total(暂未考虑$)

0:  22/32

1:  5/32

2:  3/32

3:  1/32

17: 1/32

使用了8 Bytes。

 

计算Entropy:

计算式:(-22/32)*log2(22/32)+(-5/32)*log2(5/32)+(-3/32)*log2(3/32)+(-1/32)*log2(1/32)+(-1/32)*log2(1/32)

计算器:https://web2.0calc.com/

 

 

对比: 

大约每个字符平均采用1.43bits来进行编码,也就是:

1.43*32=45.76 约为46bits,即6Bytes

加上结尾符:6+1($) = 7Bytes

Finally, 15 Bytes 搞定!

 

RLE:aab4ccac3rcba10b5a$

共18 Bytes!

 

 

总结:

有很多种组合方式,以上只是探讨了BWT+MTF+Arithmetic Coding的效果。

 

 

BWT+Run-Length --> Run-length FM-Index

 

参考: [IR] Advanced XML Compression - XBW

都利用了同样的压缩思想(Run-length encoding)。这里,通过B、S列替代了Last列。B列更有利于压缩。

 

注意:这里是'1'为首,'0'为尾。

通过B'、S列替代了Front列。B'列更有利于压缩。但S需要在使用前排序,也就是S-->C列。

注意:S、C列中每个elem代表的是block。且相同elem内部具有保序性。

index  Last   B   S*   B'  Front   C*  1 e 1 e 1 $ $ 2 e 0 d 1 _ _ 3 d 1 _ 1 _ _ 4 _ 1 n 1 a a 5 n 1 r 1 d d 6 r 1 h 1 e e 7 r 0 t 0 e e 8 h 1 $ 1 e h 9 h 0 a 0 e n 10 t 1 e 1 h r 11 $ 1 _ 0 h t 12 a 1   1 n   13 e 1   1 r   14 e 0   0 r   15 _ 1   1 t  

 

如此这般,

B+B' = 15bits + 15bits = 2 Bytes.

S:11 Bytes

In total: 2+11+1(表len)=14 Bytes.

 

原来的方式:

Last列:15 Bytes

C Table (for Front列):row:9 * col:2 = 18 Bytes

 $     1    _ 2 a 4 d 5 e 6 h 10 n 12 r 13 t 15

In total: 15+18=33 Bytes.

 

时间换空间,还是有点效果。

 



【本文地址】

公司简介

联系我们

今日新闻

    推荐新闻

    专题文章
      CopyRight 2018-2019 实验室设备网 版权所有