ethan, jibz, thanks very much for the replies and examples and links!

ethan:

I guess the problem would be to find the function.

Below is a program that encodes a number using power tables.

For some data value, you will find that in order to represent it, you would need more storage than its original form.

Ok, now I more clearly see the complications. Still, let me try resisting some more: if the representation only becomes longer in SOME cases, then it means that this would be shorter, maybe even drastically shorter, in some cases? If there was some chance of getting for example a 1 GB file compressed to a file 5 MB or smaller in size, then wouldn't it in many cases be worth to spend the time and CPU effort required to test if this type of compression works?

Also, couldn't a compression tool be made to loop over a number of such programs/algorithms large enough to guarantee that a drastically compressed output would be had through trial and error for any input file? Or would the problem then be that a completely unpractical amount of time time/CPU power is needed?

jibz:

Let's look at why no algorithm you may device will be able to compress any large amount of data into a short formula.

You want your decompression function to be a bijection, by which I mean it should be possible to decompress into any original data, and given compressed data there should be only one possible decompressed value.

The short formula your algorithm produces will be represented in the computer in a number of bytes. Since these bytes only give you a finite number of possible values, which is small compared to the vast amount of possible values of the original large data, the decompression function cannot possibly be bijective.

Ok, I think I only understand the general idea in the linked documents. Not the detailed math on the bijection wikipedia page. And I don't understand the proof on the other wikipedia page - "Lossless data compression must always make some files longer" - but I understand by the tone of it that they firmly see it as impossible. I'll try parsing through it again later tonight though.

But still trying to resist, one objection I think of is this: our algorithm need not be able to compress ANY large amount of data. Since I imagined taking the raw binary string as input, this would always be a long string of only 1s and 0s, say 1 GB in size. By then treating that as a numeral, wouldn't we get a significantly more limited range of inputs that the algorithm must be able to find a shorter representation for? (all numbers of lenght N containing only 1s and 0s instead of all N length numbers)

And in contrast, the representation can be a complex function containing the numerals 0-9 and all the rich components from the "math toolbox". (That representation must then in turn be constituted on the disk as a string of 1s and 0s, but if the representation on the higher level only takes some lines of text (including special math characters) then that's only some KB or maybe MB in size on the disk - still an extreme compression)

Also, the compression utility containing the algorithm could be optimized to only work on input of a set lenght, like 1MB or 1GB. For example, it could as a preliminary step always store the initial input file in 1MB chunks, like a multiple part zip archive. That limits the number of possible inputs some more, right?

Also, the idea above in response to what ethan wrote could be added: looping over multiple algorithms in a trial and error kind of way until one that gives good compression for the specific input is found. Since the size of the compression utility wouldn't be an issue it could be several MB in size and contain VERY many algorithms (which could together be seen as one big algorith I guess)

Wouldn't these things (that I'm intuitively trying to pull out of the hat here) make a big difference? Or have they already been taken into account somehow in the negative answers given in the links above?

Finally, I notice I'm posting many new questions here. There's risk that I'll suck you guys into a long discussion with more and more skeptical questions like these that perhaps no one else but me will get something out of

So I just want to say: feel free to opt out at any time. That said, I appreciate all answers I get of course.