question on compression and storage: help me be less stupid

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > Living Room

(1/2) > >>

Nod5:
Ok, this is one of those posts where you ask a question that you have a firm hunch is silly or even stupid and will have a simple answer that you should probably already know but don't. I guess most people can feel a hesitation to ask such questions since they then risk looking stupid. Well, the alternative is often worse: to remain stupid. So I'll take that risk... :-[ The question concerns how data compression and data storage works.

I think I understand the basic idea to compression, as described here http://en.wikipedia.org/wiki/Data_compression :
In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes. For example, this article could be encoded with fewer bits if one were to accept the convention that the word "compression" be encoded as "comp."
--- End quote ---

And the basic idea of data storage, as described here http://en.wikipedia.org/wiki/Hard_drive#Technology :
The magnetic surface of each platter is divided into many small sub-micrometre-sized magnetic regions, each of which is used to encode a single binary unit of information. In today's HDDs each of these magnetic regions is composed of a few hundred magnetic grains.
--- End quote ---

(So the hard disk is basically "a long string of ones and zeroes")

Ok, now my question:
Why can't compression of a file work by seeing the long string of "ones and zeroes" on the hard drive that constitutes the file (presupposing the file is not fragmented) as a very long number and then find some much, much shorter way to mathematically represent that same number. To later decompress the file, the mathematical shorthand is expanded to a long binary string again and that in turn gets written (in unfragmented form) to the hard drive. And so the file is back. I imagine that if that was possible then a file with many MB in size could be compressed to a mathematical expression a few lines of text long.

I'm asking why this is NOT possible because I have a strong hunch that if it was possible then it would already be done all the time. But I'd just like to get a general grasp of why it is not possible. So help me out here. :tellme:

mouser:
Data Compression is a truly beautiful subject, and there are tons and tons of fascinating algorithms employed.

Since you're interested in it, i encourage you to visit some of the links at the bottom of that wiki page and surf for a nice good tutorial on it. It's going to be hard for someone to post a reply as comprehensive as you probably need to learn what you want.

ps.
does everyone already know that jibz and I made a little (and little used) open source compression toolkit called octane?

Nod5:
Hi Mouser,
Yes, I understand this kind of question is hard to answer in forum posts like these, especially to someone like me who doesn't even know the basics. I've previously tried browsing through some online guides to compression aimed at laymen like me, but the ones I've seen either don't seem to touch on my question or are to complex for me to master. But maybe you or someone else can help me out with some keywords relevant to the "type" of compression that I'm curious about? Because even while it is not possible I suspect there still might be a name for it, X, and texts explaining why X is not possible.

More specifically, it seems to me that most popular descriptions on how compression works that I find seem to focus on strings of characters, like this for example:
Run-length encoding (RLE) is a very simple form of data compression in which runs of data (that is, sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. This is most useful on data that contains many such runs: for example, relatively simple graphic images such as icons, line drawings, and animations.

For example, consider a screen containing plain black text on a solid white background. There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text. Let us take a hypothetical single scan line, with B representing a black pixel and W representing white:

WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
If we apply the run-length encoding (RLE) data compression algorithm to the above hypothetical scan line, we get the following:

12WB12W3B24WB14W
Interpret this as twelve W's, one B, twelve W's, three B's, etc.
--- End quote ---
http://en.wikipedia.org/wiki/Run-length_encoding

That I grasp. But examples like the one above only focus on compression on the "higher level" of strings of characters ( http://computer.howstuffworks.com/file-compression2.htm has similar examples). They don't say anything about applying compression to the "lower level" binary structure on the hard drive etc that (somehow) constitutes the characters.

But my question was on applying compression to that underlying binary structure. Let me try to flesh out the imagined compression a bit more: imagine that the long string WWWW ... WWW above is constituted by the binary states 10111101100100011 ... 1001011 on the harddrive. After the "higher level" "Run-length encoding" compression above, the short string 12W ... 14W above is constituted by the shorter string of binary states 1001 ... 11001 . My question was why compression couldn't work directly on that string, seeing it as a long number 1001 ... 11001 and find a much shorter mathematical representation, M, for it. M is represented on the "higher level" in a few rows of characters, which in turn is saved as the compressed file, constituted by a much, much shorter string of binary states 101 ... 011 on the hard disk.

hwtan:
If I understand you correct, the answer is, it can.
See http://en.wikipedia.org/wiki/Huffman_coding . If I'm not mistaken, ZIP contain something similar to Huffman.

You may also want to read up on http://www.data-compression.com/theory.html about data entropy and compression.

mouser:
Compression works when you can find patterns in the data that can be respresented in a more compact form.. If you are compressing english text, you know that certain words and letters occur more often than others, so you choose a compression scheme that can take advantage of that. If you are compressing images where each byte represents a pixel, and you expect to have long sequences of the same color, then you would focus on a compression algorithm that takes advantage of that fact.

There is nothing to stop you from compressing at the bit level, with things like strings of bits, but the question is would it give you higher compression than working at the higher levels, and the answer is almost never, because the chunks of redundant information (the patterns) are occuring at a higher level (like repeated letter pairs in english, or byte-patterns representing colors, etc.). It's also the fact that working at the bit level is more time consuming for the cpu as opposed to working with bytes or words.

I urge you to delve more into compression, it really is a beautiful subject.

Navigation

[0] Message Index

[#] Next page

Go to full version