Author Topic: question on compression and storage: help me be less stupid (Read 10153 times)

Nod5 · « **on:** September 06, 2007, 02:20 PM »

Ok, this is one of those posts where you ask a question that you have a firm hunch is silly or even stupid and will have a simple answer that you should probably already know but don't. I guess most people can feel a hesitation to ask such questions since they then risk looking stupid. Well, the alternative is often worse: to remain stupid. So I'll take that risk...

The question concerns how data compression and data storage works.

I think I understand the basic idea to compression, as described here http://en.wikipedia....iki/Data_compression :

In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes. For example, this article could be encoded with fewer bits if one were to accept the convention that the word "compression" be encoded as "comp."

And the basic idea of data storage, as described here http://en.wikipedia....ard_drive#Technology :

The magnetic surface of each platter is divided into many small sub-micrometre-sized magnetic regions, each of which is used to encode a single binary unit of information. In today's HDDs each of these magnetic regions is composed of a few hundred magnetic grains.

(So the hard disk is basically "a long string of ones and zeroes")

Ok, now my question:
Why can't compression of a file work by seeing the long string of "ones and zeroes" on the hard drive that constitutes the file (presupposing the file is not fragmented) as a very long number and then find some much, much shorter way to mathematically represent that same number. To later decompress the file, the mathematical shorthand is expanded to a long binary string again and that in turn gets written (in unfragmented form) to the hard drive. And so the file is back. I imagine that if that was possible then a file with many MB in size could be compressed to a mathematical expression a few lines of text long.

I'm asking why this is NOT possible because I have a strong hunch that if it was possible then it would already be done all the time. But I'd just like to get a general grasp of why it is not possible. So help me out here.

mouser · « **Reply #1 on:** September 06, 2007, 02:48 PM »

Data Compression is a truly beautiful subject, and there are tons and tons of fascinating algorithms employed.

Since you're interested in it, i encourage you to visit some of the links at the bottom of that wiki page and surf for a nice good tutorial on it. It's going to be hard for someone to post a reply as comprehensive as you probably need to learn what you want.

ps.
does everyone already know that jibz and I made a little (and little used) open source compression toolkit called octane?

Nod5 · « **Reply #2 on:** September 07, 2007, 01:56 AM »

Hi Mouser,
Yes, I understand this kind of question is hard to answer in forum posts like these, especially to someone like me who doesn't even know the basics. I've previously tried browsing through some online guides to compression aimed at laymen like me, but the ones I've seen either don't seem to touch on my question or are to complex for me to master. But maybe you or someone else can help me out with some keywords relevant to the "type" of compression that I'm curious about? Because even while it is not possible I suspect there still might be a name for it, X, and texts explaining why X is not possible.

More specifically, it seems to me that most popular descriptions on how compression works that I find seem to focus on strings of characters, like this for example:

Run-length encoding (RLE) is a very simple form of data compression in which runs of data (that is, sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. This is most useful on data that contains many such runs: for example, relatively simple graphic images such as icons, line drawings, and animations.

For example, consider a screen containing plain black text on a solid white background. There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text. Let us take a hypothetical single scan line, with B representing a black pixel and W representing white:

WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
If we apply the run-length encoding (RLE) data compression algorithm to the above hypothetical scan line, we get the following:

12WB12W3B24WB14W
Interpret this as twelve W's, one B, twelve W's, three B's, etc.

http://en.wikipedia..../Run-length_encoding

That I grasp. But examples like the one above only focus on compression on the "higher level" of strings of characters ( http://computer.hows...ile-compression2.htm has similar examples). They don't say anything about applying compression to the "lower level" binary structure on the hard drive etc that (somehow) constitutes the characters.

But my question was on applying compression to that underlying binary structure. Let me try to flesh out the imagined compression a bit more: imagine that the long string WWWW ... WWW above is constituted by the binary states 10111101100100011 ... 1001011 on the harddrive. After the "higher level" "Run-length encoding" compression above, the short string 12W ... 14W above is constituted by the shorter string of binary states 1001 ... 11001 . My question was why compression couldn't work directly on that string, seeing it as a long number 1001 ... 11001 and find a much shorter mathematical representation, M, for it. M is represented on the "higher level" in a few rows of characters, which in turn is saved as the compressed file, constituted by a much, much shorter string of binary states 101 ... 011 on the hard disk.

hwtan · « **Reply #3 on:** September 07, 2007, 03:51 AM »

If I understand you correct, the answer is, it can.
See http://en.wikipedia..../wiki/Huffman_coding . If I'm not mistaken, ZIP contain something similar to Huffman.

You may also want to read up on http://www.data-comp...sion.com/theory.html about data entropy and compression.

mouser · « **Reply #4 on:** September 07, 2007, 04:04 AM »

Compression works when you can find patterns in the data that can be respresented in a more compact form.. If you are compressing english text, you know that certain words and letters occur more often than others, so you choose a compression scheme that can take advantage of that. If you are compressing images where each byte represents a pixel, and you expect to have long sequences of the same color, then you would focus on a compression algorithm that takes advantage of that fact.

There is nothing to stop you from compressing at the bit level, with things like strings of bits, but the question is would it give you higher compression than working at the higher levels, and the answer is almost never, because the chunks of redundant information (the patterns) are occuring at a higher level (like repeated letter pairs in english, or byte-patterns representing colors, etc.). It's also the fact that working at the bit level is more time consuming for the cpu as opposed to working with bytes or words.

I urge you to delve more into compression, it really is a beautiful subject.

Nod5 · « **Reply #5 on:** September 07, 2007, 07:02 PM »

ethan, mouser: thank you both for the feedback.

I urge you to delve more into compression, it really is a beautiful subject.

I'm trying!

If I understand you correct, the answer is, it can.

There is nothing to stop you from compressing at the bit level, with things like strings of bits, but the question is would it give you higher compression than working at the higher levels, and the answer is almost never, because the chunks of redundant information (the patterns) are occuring at a higher level (like repeated letter pairs in english, or byte-patterns representing colors, etc.). It's also the fact that working at the bit level is more time consuming for the cpu as opposed to working with bytes or words.

Hm, I fail to understand how any of the two articles ethan linked to, and what mouser says about chunks/patterns, exemplifies the kind of compression I was imagining.

Either I missed something in the text / your replies or I was confused in my description of the kind of compression my question concerned. On the latter, I previously explicitly distinguished between "higher level"/"lower level" but I now think I really meant more. Maybe this is a clearer way to put it: the compression I was imagining would occur on a certain LEVEL and with a certain APPROACH: (i) at the lower level (i.e. ones and zeroes) AND (ii) working on the binary string as a whole.

Let me try to clarify what I mean with (ii): I (think I) see the difference in Huffman works from the the run-length encoding example i quoted above. Huffman works on the lower level. But they still seem to me similar in the approach: they find a "translation table" for chunks of the original string such that the table and the new string (after translation) together is shorter than the original string. That's different from the compression I was imagining I think. Maybe I can put it like this: We want to compress a very large text file. We do Huffman compression. Whatever the optimal output Hoffman gives us, it will be constituted by a string of binary states on the harddrive: 011000 ... 0010101. The compression I'm imagining would at that point jump in, see the entire string as one "mathematical entity" and try to find a much shorter mathematical representation for it. But that compression would NOT look for parts of the string that are repeated (like "10000001" in "100000011000000110000001"). Instead, it should find a shorter mathematical way to represent the same, entire string. An example of what I'm imagining but with numerals: the entire number "28421709430404007434844970703135" can have the shorter representation: "(5 to the power 45)+10". Why can't the string of binary states seen as one mathematical entity be shortened like that? Would it be impossible for some reason? Or perhaps just not practical due to larger CPU and time requirements? Or is there some other reason? (Or is the question, still, confused?)

hwtan · « **Reply #6 on:** September 10, 2007, 12:14 AM »

I guess the problem would be to find the function.
Below is a program that encodes a number using power tables.
For some data value, you will find that in order to represent it, you would need more storage than its original form.

Code: Python [Select]

class CompressionTable:
    values = {}
    
    def __init__(self):
        for b in xrange(2, 9):
            CompressionTable.values[b] = [ b, b**2, b**3, b**4, b**5, b**6, b**7, b**8 ]
            
    
    def to_func(x):
        v = x
        while  v > 1:
            lc = 0
            lb = 0
            ld = v
            part_found = False
            for b in xrange(2, 9):
                for c in xrange(8):
                    d = v - CompressionTable.values[b][c]
                    if abs(d) < abs(ld):
                        lb = b
                        lc = c
                        ld = d
            
            print "%d^%d -> %d" % (lb, lc+1, ld)
            v = abs(ld)
            
        print "final value of v = %d" % v
    
    to_func = staticmethod(to_func)
 
t = CompressionTable()
 
while True:    
    v = long(raw_input())
    if v == 0:
        break
    t.to_func(v)
 
print "-End-"

[Select]

257
2^8 -> 1
final value of v = 1
258
2^8 -> 2
2^1 -> 0
final value of v = 0
9999
6^5 -> 2223
3^7 -> 36
6^2 -> 0
final value of v = 0
16384
4^7 -> 0
final value of v = 0
987654
7^7 -> 164111
7^6 -> 46462
6^6 -> -194
6^3 -> -22
5^2 -> -3
3^1 -> 0
final value of v = 0
9876543
7^8 -> 4111742
7^8 -> -1653059
6^8 -> -26557
8^5 -> -6211
3^8 -> -350
7^3 -> 7
7^1 -> 0
final value of v = 0
16777216
8^8 -> 0
final value of v = 0
16777219
8^8 -> 3
3^1 -> 0
final value of v = 0
1677719
6^8 -> -1897
3^7 -> -290
2^8 -> 34
2^5 -> 2
2^1 -> 0
final value of v = 0

Jibz · « **Reply #7 on:** September 10, 2007, 05:27 AM »

Your idea is very good of course, and I think it is one that anybody who starts to look at compression has at some point

.

Let's look at why no algorithm you may device will be able to compress any large amount of data into a short formula.

You want your decompression function to be a bijection, by which I mean it should be possible to decompress into any original data, and given compressed data there should be only one possible decompressed value.

The short formula your algorithm produces will be represented in the computer in a number of bytes. Since these bytes only give you a finite number of possible values, which is small compared to the vast amount of possible values of the original large data, the decompression function cannot possibly be bijective.

Try to imagine making an algorithm that will compress any integer in the range 1-1000 down to an integer in the range 1-10, and which can decompress that value back to the original value.

So any general compression algorithm that is able to compress some samples of data, will have to expand other samples.

There is a fairly good explanation of this on wikipedia also: Lossless data compression must always make some files longer.

mouser · « **Reply #8 on:** September 10, 2007, 05:40 AM »

Jibz!!!!!!!
Now i know the trick to getting you to post.. must talk about compression

Nod5 · « **Reply #9 on:** September 12, 2007, 01:27 PM »

ethan, jibz, thanks very much for the replies and examples and links!

ethan:

I guess the problem would be to find the function.
Below is a program that encodes a number using power tables.
For some data value, you will find that in order to represent it, you would need more storage than its original form.

Ok, now I more clearly see the complications. Still, let me try resisting some more: if the representation only becomes longer in SOME cases, then it means that this would be shorter, maybe even drastically shorter, in some cases? If there was some chance of getting for example a 1 GB file compressed to a file 5 MB or smaller in size, then wouldn't it in many cases be worth to spend the time and CPU effort required to test if this type of compression works?

Also, couldn't a compression tool be made to loop over a number of such programs/algorithms large enough to guarantee that a drastically compressed output would be had through trial and error for any input file? Or would the problem then be that a completely unpractical amount of time time/CPU power is needed?

jibz:

Let's look at why no algorithm you may device will be able to compress any large amount of data into a short formula.

You want your decompression function to be a bijection, by which I mean it should be possible to decompress into any original data, and given compressed data there should be only one possible decompressed value.

The short formula your algorithm produces will be represented in the computer in a number of bytes. Since these bytes only give you a finite number of possible values, which is small compared to the vast amount of possible values of the original large data, the decompression function cannot possibly be bijective.

Ok, I think I only understand the general idea in the linked documents. Not the detailed math on the bijection wikipedia page. And I don't understand the proof on the other wikipedia page - "Lossless data compression must always make some files longer" - but I understand by the tone of it that they firmly see it as impossible. I'll try parsing through it again later tonight though.

But still trying to resist, one objection I think of is this: our algorithm need not be able to compress ANY large amount of data. Since I imagined taking the raw binary string as input, this would always be a long string of only 1s and 0s, say 1 GB in size. By then treating that as a numeral, wouldn't we get a significantly more limited range of inputs that the algorithm must be able to find a shorter representation for? (all numbers of lenght N containing only 1s and 0s instead of all N length numbers)

And in contrast, the representation can be a complex function containing the numerals 0-9 and all the rich components from the "math toolbox". (That representation must then in turn be constituted on the disk as a string of 1s and 0s, but if the representation on the higher level only takes some lines of text (including special math characters) then that's only some KB or maybe MB in size on the disk - still an extreme compression)

Also, the compression utility containing the algorithm could be optimized to only work on input of a set lenght, like 1MB or 1GB. For example, it could as a preliminary step always store the initial input file in 1MB chunks, like a multiple part zip archive. That limits the number of possible inputs some more, right?

Also, the idea above in response to what ethan wrote could be added: looping over multiple algorithms in a trial and error kind of way until one that gives good compression for the specific input is found. Since the size of the compression utility wouldn't be an issue it could be several MB in size and contain VERY many algorithms (which could together be seen as one big algorith I guess)

Wouldn't these things (that I'm intuitively trying to pull out of the hat here) make a big difference? Or have they already been taken into account somehow in the negative answers given in the links above?

Finally, I notice I'm posting many new questions here. There's risk that I'll suck you guys into a long discussion with more and more skeptical questions like these that perhaps no one else but me will get something out of

So I just want to say: feel free to opt out at any time. That said, I appreciate all answers I get of course.

Author Topic: question on compression and storage: help me be less stupid (Read 10153 times)

Nod5

question on compression and storage: help me be less stupid

mouser

Re: question on compression and storage: help me be less stupid

Nod5

Re: question on compression and storage: help me be less stupid

hwtan

Re: question on compression and storage: help me be less stupid

mouser

Re: question on compression and storage: help me be less stupid

Nod5

Re: question on compression and storage: help me be less stupid

hwtan

Re: question on compression and storage: help me be less stupid

Jibz

Re: question on compression and storage: help me be less stupid

mouser

Re: question on compression and storage: help me be less stupid

Nod5

Re: question on compression and storage: help me be less stupid