Beyond FLAC - AI big data compression?

Graham849 · Sep 30, 2023

An interesting read that is still to be peer reviewed.

AI language models can exceed PNG and FLAC in lossless compression, says study

Is compression equivalent to general intelligence? DeepMind digs up more potential clues.

arstechnica.com

AnalogSteph · Sep 30, 2023

It better be. I have no idea where they are getting their 43.4% for ImageNet and 16.4% for LibriSpeech from when the table says 48.0% and 21.0% at best, respectively. Still better, but anyway.

It also takes a 140 GB model size at that point, and who knows how much in terms of computing resources / time. I bet you could also make a conventional algorithm do better if you throw that much at it.

AudiOhm · Sep 30, 2023

In this day an age of incredible internet/CPU speeds, and massive storage capabilities, why compress anything?

Ohms

Vladetz · Sep 30, 2023

AudiOhm said:
In this day an age of incredible internet/CPU speeds, and massive storage capabilities, why compress anything?

Ohms

To get better quality with same size

JSmith · Sep 30, 2023

AudiOhm said:
In this day an age of incredible internet/CPU speeds, and massive storage capabilities, why compress anything?

Because of hosting and transit charges, this costs streaming service companies a lot.

JSmith

CapMan · Sep 30, 2023

JSmith said:
Because of hosting and transit charges, this costs streaming service companies a lot.

JSmith

Genuine question - from a purely environmental impact, is it better to send big files around the internet and server farms or use CPU grunt to compress them and send / manage smaller files.

Frankly whichever causes the least environmental harm is the right answer (IMHO)

solderdude · Sep 30, 2023

CapMan said:
Frankly whichever causes the least environmental harm is the right answer (IMHO)

That may well be from an environmental viewpoint. However, I think the majority of these decisions is made from an economical benefit.

JPA · Sep 30, 2023

CapMan said:
Genuine question - from a purely environmental impact, is it better to send big files around the internet and server farms or use CPU grunt to compress them and send / manage smaller files.

Frankly whichever causes the least environmental harm is the right answer (IMHO)

It costs more in power (energy) to move data around than to compress/decompress it.

palm · Oct 1, 2023

The tradeoff is different when compressing material for your own use and when it is streamed at a very large scale

stunta · Oct 1, 2023

And there is a cost to decompressing too... its just distributed.

TonyJZX · Oct 1, 2023

doesnt this butt into the basic laws of entropy

a data file be it a flac or excel spreadsheet can only be compressed down so much IF you wanted to maintain the 'integrity of the data'

for an excel sheet you cannot have any lesser than 100% of the data maintained

for flac its the same thing

for MP3 it is NOT the same thing

for JPG it is NOT the same thing for obvious reasons

for JPG MP3 you can remove 'data' if human senses do not detect the change

for an Excel sheet you will detect the change even if its changed by a fraction of a cent (for example)

ie. there's nothing left or no repetition left to throw out

Zoomer · Oct 1, 2023

The research paper is titled "Language Modeling Is Compression," and if I understand the Ars article correctly it explores the idea that "the ability to compress data effectively is akin to a form of general intelligence... So theoretically, if a machine can compress this data extremely well, it might indicate a form of general intelligence—or at least a step in that direction."

A much more interesting question than "will we have smaller FLAC files next year."

CapMan · Oct 1, 2023

JPA said:
It costs more in power (energy) to move data around than to compress/decompress it.

I guess an analogy is soft drinks manufacturers who produce and ship the concentrate which is rehydrated and bottled in market for resale?

RandomEar · Oct 1, 2023

Zoomer said:
The research paper is titled "Language Modeling Is Compression," and if I understand the Ars article correctly it explores the idea that "the ability to compress data effectively is akin to a form of general intelligence... So theoretically, if a machine can compress this data extremely well, it might indicate a form of general intelligence—or at least a step in that direction."

A much more interesting question than "will we have smaller FLAC files next year."

I'm sorry, but I think that's just arstechnica dreaming. Current LLMs are trained pattern generators. Very advanced pattern generators admittedly, but there's nothing "intelligent" in the fact that they can replicta patterns in audio or image data despite being trained on text. It's just somewhat different patterns. LLM have no understanding of what they are doing, they lack the ability to reflect or make logical deductions and don't recognize their own mistakes.

Concerning the article: It's not surprising at all that an LLM over a hundred GB in size can compress data better than a PNG implementation which is typically below 1 MB. I know I may sound disillusioned saying this, but in the end, it's a trivial "cheat": They simply took some of the file's entropy and packed it into the compression algorithm a.k.a. the LLM. If you send a file turbo-compressed by the LLM to anyone else, they still need to acquire that LLM to decompress anything. Typical compression algorithms have implementations which are a couple of MB and their size is usually irrelevant considering the amount of data that is being processed. The opposite is true for the LLM-approach presented here. The LLM approach "cheats" by requiring you to effectively pre-load fractions of billions of files' entropy before even starting to work. That's why it compresses well.

Compression always is a trade off between algorithm complexity, runtime computational effort and file size. According to this article, LLM's don't change that a all. They trade insane algorithm complexity and a significant runtime computational effort for somewhat smaller files.

JPA · Oct 2, 2023

CapMan said:
I guess an analogy is soft drinks manufacturers who produce and ship the concentrate which is rehydrated and bottled in market for resale?

Yes, that's not a bad analogy.

Timcognito · Oct 3, 2023

Not sure where to put this but $400 of AI, Python, Excel, and Machine Learning software (42 apps) for $25. Not a scam. I have bought many bundles from these guys, mostly CAD stuff. Just picked the most recent AI post on ASR. FYI

The Complete Excel, Python and Machine Learning Mega Bundle

Get cutting edge instruction on machine learning, neural networks, data visualization and more with this online course bundle from Mammoth Interactive!

www.humblebundle.com

Beyond FLAC - AI big data compression?

Graham849

Active Member

AI language models can exceed PNG and FLAC in lossless compression, says study

AnalogSteph

Major Contributor

AudiOhm

Senior Member

Vladetz

Active Member

JSmith

Master Contributor

CapMan

Major Contributor

solderdude

Grand Contributor

JPA

Active Member

palm

Member

stunta

Major Contributor

TonyJZX

Major Contributor

Zoomer

Senior Member

CapMan

Major Contributor

RandomEar

Senior Member

JPA

Active Member

Timcognito

Major Contributor

The Complete Excel, Python and Machine Learning Mega Bundle

Similar threads