question on compression and storage: help me be less stupid

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > Living Room

<< < (2/2)

Nod5:
ethan, mouser: thank you both for the feedback.
I urge you to delve more into compression, it really is a beautiful subject.
--- End quote ---
I'm trying! ;D

If I understand you correct, the answer is, it can.
--- End quote ---

There is nothing to stop you from compressing at the bit level, with things like strings of bits, but the question is would it give you higher compression than working at the higher levels, and the answer is almost never, because the chunks of redundant information (the patterns) are occuring at a higher level (like repeated letter pairs in english, or byte-patterns representing colors, etc.). It's also the fact that working at the bit level is more time consuming for the cpu as opposed to working with bytes or words.
--- End quote ---

Hm, I fail to understand how any of the two articles ethan linked to, and what mouser says about chunks/patterns, exemplifies the kind of compression I was imagining. :( Either I missed something in the text / your replies or I was confused in my description of the kind of compression my question concerned. On the latter, I previously explicitly distinguished between "higher level"/"lower level" but I now think I really meant more. Maybe this is a clearer way to put it: the compression I was imagining would occur on a certain LEVEL and with a certain APPROACH: (i) at the lower level (i.e. ones and zeroes) AND (ii) working on the binary string as a whole.

Let me try to clarify what I mean with (ii): I (think I) see the difference in Huffman works from the the run-length encoding example i quoted above. Huffman works on the lower level. But they still seem to me similar in the approach: they find a "translation table" for chunks of the original string such that the table and the new string (after translation) together is shorter than the original string. That's different from the compression I was imagining I think. Maybe I can put it like this: We want to compress a very large text file. We do Huffman compression. Whatever the optimal output Hoffman gives us, it will be constituted by a string of binary states on the harddrive: 011000 ... 0010101. The compression I'm imagining would at that point jump in, see the entire string as one "mathematical entity" and try to find a much shorter mathematical representation for it. But that compression would NOT look for parts of the string that are repeated (like "10000001" in "100000011000000110000001"). Instead, it should find a shorter mathematical way to represent the same, entire string. An example of what I'm imagining but with numerals: the entire number "28421709430404007434844970703135" can have the shorter representation: "(5 to the power 45)+10". Why can't the string of binary states seen as one mathematical entity be shortened like that? Would it be impossible for some reason? Or perhaps just not practical due to larger CPU and time requirements? Or is there some other reason? (Or is the question, still, confused?) :tellme:

hwtan:
I guess the problem would be to find the function.
Below is a program that encodes a number using power tables.
For some data value, you will find that in order to represent it, you would need more storage than its original form.

--- Code: Python ---class CompressionTable: values = {} def __init__(self): for b in xrange(2, 9): CompressionTable.values[b] = [ b, b**2, b**3, b**4, b**5, b**6, b**7, b**8 ] def to_func(x): v = x while v > 1: lc = 0 lb = 0 ld = v part_found = False for b in xrange(2, 9): for c in xrange(8): d = v - CompressionTable.values[b][c] if abs(d) < abs(ld): lb = b lc = c ld = d print "%d^%d -> %d" % (lb, lc+1, ld) v = abs(ld) print "final value of v = %d" % v to_func = staticmethod(to_func) t = CompressionTable() while True: v = long(raw_input()) if v == 0: break t.to_func(v) print "-End-"
--- ---257
2^8 -> 1
final value of v = 1
258
2^8 -> 2
2^1 -> 0
final value of v = 0
9999
6^5 -> 2223
3^7 -> 36
6^2 -> 0
final value of v = 0
16384
4^7 -> 0
final value of v = 0
987654
7^7 -> 164111
7^6 -> 46462
6^6 -> -194
6^3 -> -22
5^2 -> -3
3^1 -> 0
final value of v = 0
9876543
7^8 -> 4111742
7^8 -> -1653059
6^8 -> -26557
8^5 -> -6211
3^8 -> -350
7^3 -> 7
7^1 -> 0
final value of v = 0
16777216
8^8 -> 0
final value of v = 0
16777219
8^8 -> 3
3^1 -> 0
final value of v = 0
1677719
6^8 -> -1897
3^7 -> -290
2^8 -> 34
2^5 -> 2
2^1 -> 0
final value of v = 0

Jibz:
Your idea is very good of course, and I think it is one that anybody who starts to look at compression has at some point :Thmbsup:.

Let's look at why no algorithm you may device will be able to compress any large amount of data into a short formula.

You want your decompression function to be a bijection, by which I mean it should be possible to decompress into any original data, and given compressed data there should be only one possible decompressed value.

The short formula your algorithm produces will be represented in the computer in a number of bytes. Since these bytes only give you a finite number of possible values, which is small compared to the vast amount of possible values of the original large data, the decompression function cannot possibly be bijective.

Try to imagine making an algorithm that will compress any integer in the range 1-1000 down to an integer in the range 1-10, and which can decompress that value back to the original value.

So any general compression algorithm that is able to compress some samples of data, will have to expand other samples.

There is a fairly good explanation of this on wikipedia also: Lossless data compression must always make some files longer.

mouser:
Jibz!!!!!!!
Now i know the trick to getting you to post.. must talk about compression :)

Nod5:
ethan, jibz, thanks very much for the replies and examples and links!

ethan:
I guess the problem would be to find the function.
Below is a program that encodes a number using power tables.
For some data value, you will find that in order to represent it, you would need more storage than its original form.
--- End quote ---
Ok, now I more clearly see the complications. Still, let me try resisting some more: if the representation only becomes longer in SOME cases, then it means that this would be shorter, maybe even drastically shorter, in some cases? If there was some chance of getting for example a 1 GB file compressed to a file 5 MB or smaller in size, then wouldn't it in many cases be worth to spend the time and CPU effort required to test if this type of compression works? :tellme:

Also, couldn't a compression tool be made to loop over a number of such programs/algorithms large enough to guarantee that a drastically compressed output would be had through trial and error for any input file? Or would the problem then be that a completely unpractical amount of time time/CPU power is needed?

jibz:
Let's look at why no algorithm you may device will be able to compress any large amount of data into a short formula.

You want your decompression function to be a bijection, by which I mean it should be possible to decompress into any original data, and given compressed data there should be only one possible decompressed value.

The short formula your algorithm produces will be represented in the computer in a number of bytes. Since these bytes only give you a finite number of possible values, which is small compared to the vast amount of possible values of the original large data, the decompression function cannot possibly be bijective.
--- End quote ---
Ok, I think I only understand the general idea in the linked documents. Not the detailed math on the bijection wikipedia page. And I don't understand the proof on the other wikipedia page - "Lossless data compression must always make some files longer" - but I understand by the tone of it that they firmly see it as impossible. I'll try parsing through it again later tonight though.

But still trying to resist, one objection I think of is this: our algorithm need not be able to compress ANY large amount of data. Since I imagined taking the raw binary string as input, this would always be a long string of only 1s and 0s, say 1 GB in size. By then treating that as a numeral, wouldn't we get a significantly more limited range of inputs that the algorithm must be able to find a shorter representation for? (all numbers of lenght N containing only 1s and 0s instead of all N length numbers)

And in contrast, the representation can be a complex function containing the numerals 0-9 and all the rich components from the "math toolbox". (That representation must then in turn be constituted on the disk as a string of 1s and 0s, but if the representation on the higher level only takes some lines of text (including special math characters) then that's only some KB or maybe MB in size on the disk - still an extreme compression)

Also, the compression utility containing the algorithm could be optimized to only work on input of a set lenght, like 1MB or 1GB. For example, it could as a preliminary step always store the initial input file in 1MB chunks, like a multiple part zip archive. That limits the number of possible inputs some more, right?

Also, the idea above in response to what ethan wrote could be added: looping over multiple algorithms in a trial and error kind of way until one that gives good compression for the specific input is found. Since the size of the compression utility wouldn't be an issue it could be several MB in size and contain VERY many algorithms (which could together be seen as one big algorith I guess)

Wouldn't these things (that I'm intuitively trying to pull out of the hat here) make a big difference? Or have they already been taken into account somehow in the negative answers given in the links above?

Finally, I notice I'm posting many new questions here. There's risk that I'll suck you guys into a long discussion with more and more skeptical questions like these that perhaps no one else but me will get something out of ;D So I just want to say: feel free to opt out at any time. That said, I appreciate all answers I get of course.

Navigation

[0] Message Index

[*] Previous page

Go to full version