Compression

Arta · November 20, 2004, 06:39 PM

Disclaimer: I know nothing about compression.

I need a compression algorithm that is fast, and will provide acceptible compression for (mostly) textual Data. The data in question looks like it ought to be very compressible to me. The most important thing is that the algorithm is as fast as possible.

Can anyone advise?

Spht · November 20, 2004, 06:48 PM

Quote from: Arta[vL] on November 20, 2004, 06:39 PM
Disclaimer: I know nothing about compression.

I need a compression algorithm that is fast, and will provide acceptible compression for (mostly) textual Data. The data in question looks like it ought to be very compressible to me. The most important thing is that the algorithm is as fast as possible.

Can anyone advise?

I ran in to a simple algorithm a long time ago which was fairly interesting for someone that doesn't know anything about compression. I don't recall the name of it, but it just subsituted characters for bits, so for example 'A' could be 1; 'B' is 10, 'C' is 11, 'D' is 100, etc.

"HELLO" would result in being 1000 101 1100 1100 1111, which is 04 5c cf (3 bytes).

Arta · November 20, 2004, 06:54 PM

Isn't that Huffman?

(ok, I might know a tiny bit

)

PS: But I don't know how fast huffman is

K · November 20, 2004, 07:46 PM

decoding huffman is just accessing nodes a binary tree, so the amortized time would be O(log(n)). Keep in mind that the number of nodes you will have will be greater than the number of characters since data can only be stored at leafs.

encoding huffman is O(1)/character => O(n) if you already have a tree generated. And since you're dealing with text, you could just pre-generate a tree based on the letter probability of the english language.

TehUser · November 20, 2004, 07:47 PM

How are the compressed characters read out? For instance, how would you know that 11 is 'C' instead of "AA"?

K · November 20, 2004, 07:48 PM

Quote from: TehUser on November 20, 2004, 07:47 PM
How are the compressed characters read out? For instance, how would you know that 11 is 'C' instead of "AA"?

When the bit-path reaches a leaf you will have found the character. There is never a case where one character is represented by the same sequence of bits of another character + some, since data is only stored at leaves.

Edit:

for example, the tree:

Code Select


    (root)
   /       \
 /  \     /  \
a   /\  d   e
    b c

let a 0 bit represent a left-branch and a 1 bit represent a right branch beginning from the root moving downward. a would be encoded as 00; b = 010; c = 011; d = 10; e = 11;

pratice example!

decode: 0100010 1110

iago · November 20, 2004, 07:57 PM

Huffman encoding is da bomb. Read up on it, and I have the algorithm coded in Java if you want to have a look.

K · November 20, 2004, 07:59 PM

Quote from: iago on November 20, 2004, 07:57 PM
Huffman encoding is da bomb. Read up on it, and I have the algorithm coded in Java if you want to have a look.

Completely agree

-- a good source to read up on it is "Introduction to Algorithms - Second Edition," which I'm sure you can aquire from somewhere.

* K coughs at TehUser

Arta · November 21, 2004, 05:13 AM

It's *mostly* textual data - there is some binary data too. My tree would have to contain every value from 0x00 to 0xFF: does that make a difference?

From what I remember, huffman relies on using the unused bits in the data (?). I'm not sure I have any!

Banana fanna fo fanna · November 21, 2004, 08:43 AM

Huffman uses binary trees to represent bytes, just as K explained. What you want to do if you want good compression is put the more numerous occuring bytes (for example, in a text document the " " character occurs often) near the top of the tree, so it is represented by a smaller number of bits than the lower-level leaves.

After this, your next compression optimization is to scan for often-repeated sequences of characters (i.e. if you're compressing HTML, <td>, <tr>, </td> etc) and place those in the tree as well.

HTH

iago · November 21, 2004, 10:10 AM

Huffman (the way I've done it) will take the most common character and give it the shortest binary sequence, and take the least common character, and give it the longest sequence. It can probably be optimized to use sequences of characters ("<td>") instead of single characters, but I've never done it like that.

Kp · November 21, 2004, 11:53 AM

One possible problem with Huffman coding is you must know the frequency of each character before compression can begin, so for optimum performance you must scan the data twice: once to compute probabilities, and again to compute the compressed stream based on the tree formed from those probabilities. This is mainly a drawback if it's a problem to retain the data (i.e. it doesn't work well as a stream cipher, since you'd have to buffer up the entire stream to allow doing that second pass). Also, you must pack along your Huffman dictionary so that the decompressor can determine how to lay out the tree. Finally, choose the format of your dictionary carefully. I've seen some people try to send the dictionary as the probability table, which can cause some interesting results if the decompressor builds the tree differently (for instance, given two nodes of equal weight, which one becomes the left node? -- if the compressor and decompressor disagree, you produce garbage output)

If these constraints are a problem in your case, look into LZ78 compression.

Skywing · November 21, 2004, 12:15 PM

FYI, zlib might suit your needs (and it's got a liberal license).

iago · November 21, 2004, 02:25 PM

If your data is always similar (always text with a certain occurance of other symbols), you might be able to use a static huffman tree. Although it might not be completely optimal, it'll save you from having to serialize/send the entire tree.

K · November 21, 2004, 04:48 PM

Quote from: iago on November 21, 2004, 10:10 AM
Huffman (the way I've done it) will take the most common character and give it the shortest binary sequence, and take the least common character, and give it the longest sequence. It can probably be optimized to use sequences of characters ("<td>") instead of single characters, but I've never done it like that.

The standard algorithm to create the tree is

Code Select


For All Nodes N
    Insert N into V
While Count[V] > 1
    A = PopSmallest[V]
    B = PopSmallest[V]
    X = new Node
    Left[X] = A
    Right[X] = B
    Weight[X] = Weight[A] + Weight[B]
     Insert X into V

Root = Pop[V]