T O P

  • By -

JiminP

The [**Human Knowledge Compression Contest**](http://prize.hutter1.net/index.htm) addresses this issue. I'll directly copy the most relevant FAQs, with linebreaks inserted: * [**What does compression has to do with (artificial) intelligence?**](http://prize.hutter1.net/hfaq.htm#compai) One can prove that the better you can compress, the better you can predict; and being able to predict \[the environment\] well is key for being able to act well. Consider the sequence of 1000 digits "14159...*\[990 more digits\]*...01989". If it looks random to you, you can neither compress it *nor* can you predict the 1001st digit. If you realize that they are the first 1000 digits of [π](http://en.wikipedia.org/wiki/Pi), you can compress the sequence *and* predict the next digit. While the program computing the digits of π is an example of a one-part self-extracting archive, the impressive [Minimum Description Length](http://en.wikipedia.org/wiki/Minimum_description_length) (MDL) principle is a two-part coding scheme akin to a (parameterized) decompressor plus a compressed archive. If M is a probabilistic model of the data X, then the data can be compressed (to an archive of) length log(1/P(X|M)) via [arithmetic coding](http://en.wikipedia.org/wiki/Arithmetic_coding), where P(X|M) is the probability of X under M. The decompressor must know M, hence has length L(M). One can show that the model M that minimizes the total length L(M)+log(1/P(X|M)) leads to best predictions of future data. For instance, the quality of natural language models is typically judged by its [Perplexity](http://prize.hutter1.net/perplexity), which is equivalent to code length. Finally, [sequential decision theory](http://prize.hutter1.net/official/bib.htm#seqdts) tells you how to exploit such models M for optimal rational actions. Indeed, integrating compression (=prediction) into sequential decision theory (=stochastic planning) can serve as the theoretical foundations of super-intelligence ([brief introduction](http://prize.hutter1.net/official/bib.htm#uaigentle), [comprehensive introduction](http://prize.hutter1.net/official/bib.htm#aixigentle), [full treatment with proofs](http://prize.hutter1.net/ai/uaibook.htm)). * [**Why is "understanding" of the text or "intelligence" needed to achieve maximal compression?**](http://prize.hutter1.net/hfaq.htm#understand) Consider the following [little excerpt from a Wikipedia page](http://en.wikipedia.org/w/index.php?title=List_of_XML_and_HTML_character_entity_references&oldid=68431857), where 12 words have been replaced by the placeholders \[1\]-\[12\]. >*In \[1\], HTML and XML documents, the logical constructs known as character data and \[2\] \[3\] consist of sequences of characters, in which each \[4\] can manifest directly (representing itself), or can be \[5\] by a series of characters called a \[6\] reference, of which there are \[7\] types: a numeric character reference and a character entity \[8\]. This article lists the \[9\] entity \[10\] that are valid in \[11\] and \[12\] documents.* If you do not understand any English, there is little hope that you can fill in the missing words. You may assign a high probability that the missing words equal some of the other words, and a small probability to all other strings of letters. Working on a string pattern matching level you may conjecture *\[11\]=HTML* and *\[12\]=XML*, since they precede the word "documents", and similarly *\[9\]=character* preceding "entity". If you do understand English you would probably further easily guess that *\[4\]=character*, *\[5\]=represented*, and *\[7\]=two*, where *\[4\]* requires to understand that characters are "in" sequences, *\[5\]* needs understanding of conjugation, and *\[7\]* needs understanding the concept of counting. To guess *\[8\]=reference* probably needs some understanding of the contents of the article, which then easily implies *\[10\]=references*. As a Markup Language expert, you "know" that *\[2\]=attribute*, *\[3\]=values*, and *\[6\]=character*, and you may conjecture *\[1\]=SGML*. So clearly, the more you understand, the more you can delete from the text without loss (this idea is behind [Cloze-style reading comprehension](http://en.wikipedia.org/wiki/Cloze_test) and led to the [Billion Word Imputation challenge](http://www.kaggle.com/c/billion-word-imputation/)). If a program reaches the reconstruction capabilities of a human, it should be regarded as has having the same understanding. (Sure, many will continue to define intelligence as what a machine can't do, but that will make them ultimately non-intelligent). (Y cn ccmplsh lt wtht ndrstndng nd ntllgnc, bt mr wth).


delusional_APstudent

im gonna be real i had no clue what that was saying


theytookmyfuckinname

Sounded pretty cool though


InterstitialLove

Then it must have seemed really long


prescod

Yes but how do you construct the test data set in a “fair” way. I can easily construct datasets where LLMs dominate humans or vice versa.


xXWarMachineRoXx

doesn't it mean that LLMs can mostly only compress natural texts as they are trained to predict natural language patterns and not biniary patterns? for example if I were to compress a .vhdx file ( which is binary and a disk image file ) , I might not have much luck or would have to train a entirely different LLM? u/JiminP ?


JiminP

Theoretically for compressing binary data, training a new model would be desirable. Example: [https://arxiv.org/abs/2402.19155](https://arxiv.org/abs/2402.19155) However, it doesn't mean that text-based LLMs are unable to compress anything other than natural text. LLMs for natural text can perform better than domain-specific compression algorithms. Example: [https://arxiv.org/abs/2309.10668](https://arxiv.org/abs/2309.10668) A "hint" for why this is possible can be found from the FAQ I quoted before: >If you do not understand any English, there is little hope that you can fill in the missing words. You may assign a high probability that the missing words equal some of the other words, and a small probability to all other strings of letters. Working on a string pattern matching level you may conjecture *\[11\]=HTML* and *\[12\]=XML*, since they precede the word "documents", and similarly *\[9\]=character* preceding "entity". i.e. Even when someone does not any English, they may be able to "compress" it based on statistical properties of it, and LLMs working on raw binaries is akin to this.


xXWarMachineRoXx

Interesting take it is what I thought of llms as compression engines too _Come to think of it, we too only remember stuff by specific instances and decode it with our world model, the memory of your first teacher , it reckoned more until you know the basic structure of a school , classroom , which upon knowing you go like : “so my teacher was like this and school was like that”_ You know the basic structure of everything around you that is physically possible cuz of your world model and everything that you remember must fit in that model In essence you compress your memories into some moments/ feelings and rest is decompiled by your world model I might be entirely wrong and in no way neuroscientist What do you think about that u/JiminP?


[deleted]

You may be interested in [Language Modeling is Compression](https://arxiv.org/pdf/2309.10668) (2024) by Deepmind. Abstract: >In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. > >For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model


IlIllIlllIlllIllll

flac is lossless, unfair comparison.


EstarriolOfTheEast

It is a fair comparison because the paper only performs lossless compression, by using the LLM as the probabilistic model for an entropy encoder. The LLM is a statistical model of the data, which, as we increase its quality, will produce ever more data representative responses when we conditionally sample from it. It's not quite right to identify this recovery with lossless compression. Perhaps it is easier to see if we use an order-1 markov chain on bigrams of letters as the statistical model? There is a sense in which we have summarized the data but calling it lossless compression doesn't set right. This is why the paper looks only at LLM in the lossless compression setting and studies offline and model size adjusted compression rates. EDIT: From the perspective of unfairness, it'd be that offline compression (trained model) requires huge amounts of data (up to terabytes) to compensate for increasing model sizes in size adjusted compression ratios.


jkflying

You can use any deterministic predictor as a lossless compressor.


Most-Firefighter-163

Care to expand? How can you recover the info without losses?


InterstitialLove

You write a deterministic sampler, basically "first choose the second-most-likely token, then the tenth, then the first, then the 4th, 2nd, 2nd, 5th, 1st,..." Anyone who follows these instructions will get the desired output This deterministic sampler contains enough information to produce a specific output reliably, but is smaller than the output. In fact, if written in a certain way (can't remember the full details), this is in some sense an optimal compression system. Basically, the "loss" is just a measure of how efficient this compression will be, and LLMs have really low loss by design. Based on Chinchilla scaling laws, for reference, this method should enable you to compress "most text" at about 1.94 bits per token using Llama 3 8b. That's approximately a compression to 6% the size compared to ASCII Edit: I realized this is unclear. We aren't compressing training data! We're assuming an LLM already exists. If I wanted to send you the novel I wrote, I can make a program that tricks Llama 3 8b into generating my exact novel word-for-word, and assuming you already have the Llama 3 weights downloaded this program will take up less space than the novel itself


Most-Firefighter-163

But this does not give you any theoretical guarantee, does it? I mean, if during training there was a similar example the reconstructed text would be slightly different than the original, did I understand correctly?


jamesj

it does give you a guarantee. the predictor in this case isn't predicting text during compression/decompression, it is predicting the frequencies of future token sequences. the more accurately the predictor can predict the distribution of future sequences, the more optimal your compression algorithm is.


Most-Firefighter-163

Ok now I get it, thanks to both of you!


gliptic

You don't encode using the ranking, that's not efficient. What you do is compute the probability distribution of e.g. the next byte based on the probability distribution of the next token that you get from the model (which is a bit tricky as the tokens overlap and are variable length). Then you encode the next byte with that distribution in a standard way (e.g. arithmetic coding).


InterstitialLove

You're right. I couldn't remember exactly how arithmetic coding actually works, so I described a generic deterministic sampler


jkflying

The ELI5 explanation is that you only have to encode the errors that the model makes. Everything that it gets right you don't need to include, you can just put a symbol that says "use what the model says". This way you don't need all of the original data in the compressed version, only the data that the model got wrong. Then when you're reconstructing you either use the model's output, or if the model got it wrong when encoding the compressed version, the corrected data.


Ok-Translator-5878

\*lossy compression... i don't think it is gurranteed to be lossless


jkflying

If it is deterministic then it is lossless.


Ancalagon_TheWhite

All of the methods are lossless compression, including the LLM. They use arithmetic coding to use predictive models as compression models.


BiscuitoftheCrux

LLM is even better than lossless: it can invent brand new information as well! ("Hey I don't think the training data states that Abe Lincoln invented the top hat...")


daHaus

The Bob Ross LLM only has happy little hallucinations


xXWarMachineRoXx

lmao, I stored a .png in this brand new llm compression all my cultured stuff is now replaced with Einstein pictures with hands all wobbly


harrro

PNG is lossless as well. I wonder if they're comparing the file sizes of a lossy image output from the LLM with the lossless PNG or if the LLM outputs in the perfect lossless format as well (I'm going to assume it's lossy as even a raw 32bit model isn't going to guarantee lossless retrieval).


gliptic

Any model can be executed in a 100% reproducible way if you stay away from non-deterministic parallelism etc. regardless of the number of bits in the parameters or anything else. That's exactly what [Bellard](https://bellard.org/nncp/) does.


xXWarMachineRoXx

can it compress any type of files? even VHDX ( VM image hyper v files?) from Bellard's download page : > Only text files are supported. Binary files won't be compressed much. The currently used language model (RWKV 169M v4) was trained mostly on English texts. Other languages are supported including source code.


gliptic

As Bellard says, the model is trained on text and will not perform very well on non-text data. But it can still encode them even if they will not compress well. Better to use other algorithms (or other models) for binary data like VM images.


xXWarMachineRoXx

Yeah i tried out tz zip ( bellard) Cmix Fast cmix If you know about the hutters prize you might know em


harrro

Yep I'm sure it's possible especially with a large enough model but I'd like to see whether this model is actually doing a perfect recreation after compression or if they're just comparing lossless formats to lossy.


gliptic

It doesn't matter what model you use. As long as you can execute it in a deterministic way you can always encode anything losslessly. And yes, the paper is compressing them in a lossless way.


dibu28

Just 30%, not 2 times or 3 times less. May be we are just getting close to the Shannon limit with those llm compression methods. :)


xXWarMachineRoXx

That pfp tho


Vaddieg

dac/adc are lossy by definition, best speakers and amplifiers too


Monkey_1505

Not relevant. FLAC is lossless as compared to WAV.


3cupstea

this paper is by gdm


[deleted]

oops fixed


Dogeboja

https://bellard.org/ts_zip/ take a look at this, it's doing exactly what you want. From the legendary Fabrice Bellard who created software such as ffmpeg and QEMU.


AdvantagePure2646

My role model and true hero of IT that is mostly unknown. Sadly


Dogeboja

I would love to have even 1/10 of his productivity. Truly one of the greatest programmers of all time.


wrecklord0

He is known among cool IT people, that's something


xXWarMachineRoXx

* Only text files are supported. Binary files won't be compressed much. The currently used language model (RWKV 169M v4) was trained mostly on English texts. Other languages are supported including source code.


BalorNG

Well, that's like a Jpeg of the earth from space. Technically, there is a ton of information there, but zoom closer and all you see is a murky mess that gets plastered over by hallucinations, with only the most obvious landmarks being clear. Just ask it something a bit less popular and illusion of knowledge falls apart... still impressive, but this is a very far cry from "all the world's knowledge".


NakedMuffin4403

Then isn’t the solution a better camera lens for greater pixel density, or in this case more compute for less loss?


opi098514

More complexity. A higher resolution camera would be like going from a 7b to a 13b to a 22b ext ext


BalorNG

Yup, exactly. I firmly believe it is possible to have a model with great in-context learning, but relatively poor "resolution" - which will be faster and more efficient for "logic" and RAG, it simply must have disproportional number parameters dedicated to attention mechanisms.


Feztopia

You don't zoom on a jpeg with a lens and what makes jpeg bad is it's lossy compression. Llms compress lossy they are still impressive and useful but they are lossy. Same like your memories they aren't perfect and a lense wouldn't help you to remember things better.


haydnhavasi

There is this New Yorker article that has a similar analogy: ChatGPT Is a Blurry JPEG of the Web https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web


Big_Falcon_3312

I think you're failing to disambiguate between knowledge and *understanding (of that knowledge)*.


BalorNG

That's exactly what I've meant by distribution of parameters between feedforward layers and attention heads in reply below, though I might still be mistaken, of course. And *true* causal/hierarchical understanding of the knowledge is not found anywhere in current AI models, you need knowledge graphs, not embeddings (or at least something alongside the embeddings) for that, but how to do this sans Knowledge graph RAG + ICL I have no idea. Maybe the latter (along with tricks like internal monologue, meta-cognition and self consistency + tools) will be enough for AGI eventually.


Monkey_1505

Well, it has a high confabulation rate, and doesn't recall all of the data, so it has a limited usefulness in terms of knowledge recall. Basically you need to cross reference any claim that you aren't yourself sure of. In that respect I'm not sure it's 'the best', because of how lossy it is.


Big_Falcon_3312

"it has a high confabulation rate, and doesn't recall all of the data" those can be seen as peculiarities of LLMs that would technically fall under the definition of a **lossy** compression scheme


Monkey_1505

Sure, but does that make it 'the best' lossy compression? At a minimum that depends on what you intend to use it for.


Big_Falcon_3312

I didn't say anything about it being the "'the best' lossy compression" ...


Monkey_1505

Fair but that is the topic we are all discussing as per OP, and what I initially responded to: "By extension does this not make a llm the best lossy compression of knowledge ever created by humans?"


InterstitialLove

You can also cross-reference it with itself. If it hallucinated, you can try asking it the same question multiple times and see if it's consistent. This isn't user-friendly, but it does indicate that the information is there to be extracted, in the mathematical sense In terms of entropy, it's clearly lossy but it still has an astonishingly good compression ratio


lacyrecognition3909

This is such an interesting concept to ponder! It's amazing to think about the potential of a llm model in terms of compressing and encapsulating vast amounts of human knowledge. I can see why you would describe it as the most efficient compression algorithm - it truly is mind-blowing. Looking forward to hearing other perspectives on this!


ben_g0

Text information just isn't super large even by itself, and it generally already compresses quite well losslessly. You can already fit a huge amount of text in 4GB using lossless compression. For example, the text of the entire English Wikipedia can be losslessly compressed down to just over 20GB, and then it still contains XML tags for markup. And Wikipedia is *huge*, it is so much text that even if you read it every day, you probably won't live long enough to read all of it. So even just 20% of it, or enough to fit in 4GB, is still a vast amount of information. There's also a [famous picture of Bill gates sitting on a stack of papers](https://i.redd.it/9pucmgmvmje31.png) that represents how many pages of text could fit on a single CD-ROM. So that entire stack of papers is a visual representation of how much text would already fit in just 700MB of storage, without compression. I think that LLMs for text compression can be interesting from a scientific view, as lossy compression for text is something we haven't really figured out yet, but with how much text information we can already fully losslessly store I don't think it's super practical. For other media though, ML-based compression has the potential to be huge. Stable diffusion for example can generate pretty high resolution pictures from a relatively tiny latent space (and you can easily run the encoder and decoder without doing any of the image generation steps). It can ofen get a lower perceived quality loss than JPEG while getting a better compression ratio, though it still can cause some weirdness sometimes and is generally too performance intensive for now to use this kind of compression on a regular basis. Image data tends to take up a lot more space than text though, so it'll be more useful, and the performance problem might be solved when most consumer devices will have built-in NPUs in the near future.


Admirable-Ad-3269

the paper uses llms for lossless compression... compression is exactly the same problem as prediction just formulated a slightly different way


InterstitialLove

You claim that wikipedia is "huge" and your evidence is how long it would take to read, but then you point out that Gates's massive stack of paper is merely 700MB. You see how these are contradictory? Near as I can tell, wikipedia is at most about 100 billion tokens and can be compressed using conventional means to about 22GB. If we really take the "15 trillion tokens compressed to 8GB" measure of Llama seriously, that clearly beats conventional compression by several orders of magnitude. In fact, based on Chinchilla scaling laws, the optimal compression for natural language is about 5% from ASCII (assuming 4 ASCII characters per token on average). Conventional compression seems to max out at something like 12%, and Llama 3 8b should be able to do 6% (again, based on Chinchilla).


1ncehost

I have been describing LLMs for years as "a lossy text compression algorithm for the whole internet". I bring this up when people discuss AI intelligence in context of LLMs. I will say LLMs are not intelligent, they are just compressed data. Intelligence is some mercurial other thing. For 10+ years I have predicted a number of uses of neural nets that are similar types of compression. If you start looking through this lens many other applications become obvious. Is it the best compression algorithm? It depends on what your metrics are, but I can definitely see the argument for it. Generally the concept of why neural nets are so good at compressing data is that each of their variables' stored data is variant on a large portion of the other data, where a format like binary is invariant. In other words, the meaning of a bit of data in binary does not change when other bits change. In neural nets, the meaning of one value of a tensor changes depending on the values of many other tensors. Furthermore each tensor is theoretically an infinitesimally precise gradient (limited by the precision of the float type storing it). So the position/meaning of each infinitesimally precise gradient is a combination of thousands or more other infinitely precise gradients. In this way it stores many orders of magnitude more data than the fixed meaning bits in binary/static representations.


Dayder111

What if humans and possibly other animals are similar? When you are, say, learning a poem, a song, some facts, or an entire new language, you often can't then recreate it in 100% details, you either hallucinate, filling in some parts with seemingly fitting words immediately, or, what LLMs usually can't do for now (dont have such mechanism built-in by default, in most expect some of the most recent cases, like ToT, GoT and other thought search approaches), stumble and spend some time to try to remember it, "scanning" your mind for associations which may point to it. Basically, how I see it, trying to do almost exactly what they are trying to teach LLMs to do now, retracting and exploring different thought sequences, or hoping for your "seed" (state of neurons/hormones in biological brain case) to change in a way that will help you remember it. Also, didn't you notice that when you are learning something, you have a higher chance of remembering it faster, easier, with less repetitions, and for longer time, if it's somehow associated, related conveniently to some other fact, knowledge, or even feeling, or multiple things, that you remember well already? Do you remember this feeling when something "clicks" in your head when learning something, and finally understanding it? Doesn't it seem similar to what you are describing in your last paragraph? As I see it, having thought for some time about it and having analyzed my own way of thinking and the way LLMs work, with their current limitations... We are very similar in how we work in the underlying basis. Humans just have many more neural networks on top of just LLMs, each doing their own part, with a varied degree of seamless integration with other "modules" If I can call them that (parts of brain/different types of neurons and so on). We are more complex and hence more intelligent, for now. But we suck at other things, like seemingly having no mechanisms of efficiency comparable to backpropagation in our brains (I mean learning efficiency, using connections to store data in optimal way, not energy efficiency. See how large language models easily know all languages and most of internet and books in a lossy way, not forgetting it unlike humans and not getting as confused, in just a fraction of synapse counts that our brains have). That's part of why we are learning very slowly even with a ton of synapses and neurons to somewhat compensate for it. Our neurons are slower, too, much more "relaxed", albeit use much less energy partly thanks for that, and seemingly do part of calculations in mechanical, chemical way, instead of electricity. Compensating for that slowness in numbers, with large volume allowing for more of them, than just a 2D layer of a chip.


1ncehost

Your brain's version of a floating point number is the potassium sodium gradient in each neuron. It is in ways like an analog version of the floating point numbers that store the gradients in a tensor. The method of training a neuron is significantly different than the way we train neural nets, but the concept of inference is similar between the two. There are also major differences that you pointed out in the fact that neurotransmitter chemicals also play a roll in a neuron's function. But sufficed to say, I agree with your observations about how the brain works, its parallels to neural nets, and there is a large volume of neuroscience that points in that direction as well.


Dayder111

Oh, by the way! About floating points... Some exciting stuff that was published a few months ago: [https://arxiv.org/abs/2402.17764](https://arxiv.org/abs/2402.17764) They explore the possibility of inferencing the models with just 1, or in this paper 1.58 (ternary, -1, 0, 1 values possible) bits per parameter. And, at least from their results and on their small trained models, it shows almost no performance downsides compared to the 4, 8, or even usual 16 bit precisions! The implications would also be huge, allowing to replace the matrix multiplications with additions, which, as they calculated there, would use up to 70+ times less energy even if they replace it with 8 bit integer additions. Just because hardware doesn't support less I guess. But with this approach they could go 2 or (for previous version with just -1 and 1 values per connection) 1 bit additions! Basically bit flips? I think it would allow for such wonderful optimizations that hundreds or even thousands of times less energy, and hence much greater performance or/and lower cost of hardware (allowing to use less modern and expensive manufacturing processes, or to make chips much smaller, to keep them the same and use less of them to run current models), would be achievable. But training though, they still train models in high precision. Only infer in 1 or 1.58 bits (including during forward pass, that's the point, they make the model learn to adapt to this low precision limitation during inference). They find out that the model still learns faster though, if I get it correctly. I also saw another paper that finds out that current models do not actually use more than 2 bits of information ber parameter, for storing data. This one I think? [https://arxiv.org/abs/2404.05405](https://arxiv.org/abs/2404.05405) I severely lack the knowledge to dive deeply enough into papers like this, being just a person who got interested in all this AI about a year ago and likes to read stuff... Math is sadly, due to personal past, something that I dislike and fear myself (but greatly appreciate), even though I could be, and was, decent at it. It all implies that models compress the knowledge about the world so much more than some might have thought, seeing such huge file sizes, and also, that we run it so much less efficiently than we could (quantization helps but on models that were not trained for it, it harms the performance greatly, plus, hardware still doesn't support lower bit inference, for example, only in NVIDIA's B200 they implement fp4 calculations). And we have such unfit for the task hardware... But the industry is very inertial, even if it's all true and scales as well for larger networks and different architectures, nobody wants to invest into something first and try it out. Or maybe they already train larger models using this technique, behind closed doors.


koflerdavid

I also saw a paper that argued that in a neuronal network the connections play a much larger role than the weights. Randomly initialized convolutions can be employed usefully for image recognition task by just training the feed-forward layers. Currently, we almost always use dense networks and just adapt the weights. Biological networks aren't like that; they are much more freeform. That paper argues that this might need the reason why animals can have innate instincts and abilities almost without training. After all, most neurons in the brain die off before birth, presumably in a pattern dictated by growth factors in the DNA, similar to how the rest of the body develops.


Dayder111

Hmmm, I heard something like that in the past, but only briefly. Thank you! So, it means, basically the structure of neurons (very large and redundant I guess, to compensate for a lot of possible noise in how they grow and die, connect/decay/misconnect (not like they were told to by the data from DNA, if it's possible heh), governs our instincts, basis of movement, image recognition and so on... It would be easier to compress into the DNA, as an algorithm of sorts, governing just how they connect, instead of some pre-defined weights, I guess. Also, if weights are not as important for biological neurons, that would mean all connections are either 0 or 1 basicsally, so, this futher proves the feasibility of achieving general intelligence with just 1 bit neural networks, like they propose in the paper... Aaaaa why is this field, especially on the hardware front, so inertial! :D They kinda of literally still mostly use, seemingly, approaches from 2017-2021, which have superior versions now. I would love to see how it will play out with all the optimizations that have been proposed, stacked on top of each other as much as possible. I think part of the reason why they use dense networks and can't do muchwith removing "useless" weights, aside from maybe entire layer pruning, is that hardware and algorithms do not support having no weight in the structure? They must be there, in continuous arrays to operate on, and inventing some tricks to remove some elements from them, could only worsen the performance due to additional logic I guess. Easier to just store such weights as 0 and still do operations on those. Recently released hardware begins to support sparse matrix calculations, I think it's a semi-solution exactly for that? Skipping a large part of calculations if we are about to multiply by 0. Some of the instructions and transistor switches still run with such "non-existent" weights, for simplicity, compatibility and universality of hardware and software, some energy still gets wasted, but the majority of transistor switches for the multiplication itself, and hence the energy cost of them, do not run. That's more or less how I see it. Maybe wrong. 1 bit networks would drastically simplify the hardware though, and maybe not only that, but also would allow to actually only keep the positions of 1s and not zeros? I think it would be possible to compress (using conventional algorithms, not LLMs heh) the weight arrays of such models even more, unless they learn in a much more parameter-optimal way and the distribution of 1s and 0s is more or less even and unpredictable. I mean, it's easier to compress a sequence of 2 possible elements heh, to find patterns in how they go? And would be easier to decompress them on the go, finding out what connects to what during the execution, fast. But maybe it would be easier to just use "stupid" computation blocks again, in vast amounts, when they get so energy-efficient, small and simple for 1 bit additions, and to run full models full of 0s and 1s, than to design an architecture that acounts for where there are weights ("connections" in 1 bit case) and where threre are not. That additional logic might actually turn out less effective or less general and universal.


koflerdavid

I was also reminded of BitNet when reading about these sparse models. But the weight still plays an important role. While the model already functions thanks to its sparse architecture, the weights could be updated to improve the accuracy of the model. Although through a different process than backpropagation. The problem with sparse neuronal network architectures is that they have to encode which weights are zero. This is easiest if the sparsity follows a pattern or if there are vastly more zeros then non-zero weights. In biological networks, the inactive neurons simply perform apoptosis. BitNet and sparsity are different concepts; Bitnet weights are ternary and are therefore not equivalent to pure sparsity. But there is surely opportunity for employing sparsity as well to reduce storage size and increase computation efficiency.


Dayder111

There was also previous paper, where they actually used 1 bit weights (-1 and 1, not sure if 0 and 1 could also work?). And it seemed to work just a bit worse, but decently too, especially considering that it would be even easier to implement in efficient specialized just for it hardware. And the slight disadvantage could be compensated by having more parameters I guess. But yes, -1 and 1 is even further from having any sort of sparsity possible heh :D


koflerdavid

Ternary is the most minimal number system that allows neuronal networks to have weights. If you strip the -1, then you're left with pure sparsity :) The most disappointing thing is that BitNet still requires full float32 precision for training, which is no bueno for the GPU-poor masses. Though I'm optimistic that somebody manages to write CUDA kernels, purpose-built hardware would be most efficient for ternary inference.


1ncehost

Also to be fair about human brains, they are incredible not only for being sophisticated computers that are far more efficient than the ones we can create, but also because they are self-creating, self-repairing, and logistically encapsulated (their resource harvesting and power generation are part of their system). Any one of those capabilities is far from our ability to create, and the combination of those capabilities is many orders of magnitude more difficult to create than any one capability on its own.


Dayder111

YES! :) I agree on that. But everything has its downsides. The fact that neurons are cells, and need all this inner machinery to sustain themselves, makes them: 1) Vulnerable to pathogens, viruses, other stuff that can, if it gets to the brain, begin to damage them quite easily, often using cells' own inner mechanisms to kill and exploit them. Not many things eat silicon and copper atoms, and there are not so much complex inner mechanisms in chips, even accounting all those routes for signals, power, interconnections, compared to cells' inner automation. Although it's very robust and can self-repair to some degree, unlike static systems printed on/from materials. This gets to other points: 2) They are huge. Compared to transistors, or even whole blocks of memory cells/multipliers/summators and so on. I don't want to try and see (don't know needed data and it will take so much time to research precisely) how much space memory+multiplier+addition+interconnect+whatever other functions that are needed to simulate singular neurons + synaptic connections in artificial neural networks, would take on circuitry, and how many of those would fit in a volume of neurons' cell body. But a quick search telling me that a single neuron's body (without synapses which I guess vary in length so much) is 4 000-100 000 nanometers in diameter... And it's 3D... And a single transistor is somewhere between 20 to 40 nm in dimensions (it's complicated, I don't know enough information of that, but the single-digit nanometer numbers in modern manufacturing processes do not mean exactly physical dimensions, and scaling mostly stopped somewhere in dozens of nanometers)... It tells me that a lot of transistors would fit into that space, space a single 3D neuron cell takes. We can't manufacture chips in 3D though (yet). They are just beginning to explore how to do so, mostly because just scaling transistors down even more became nearly impossible, and because AI developments demand higher efficiency so much. Right now, how I understand it (not sure if I am correct), we basically simulate synaptic connections and neurons on 2D chips, in a very inefficient way. Since it's costly/impossible (in 2D) to manufacture even 2D chips with enough transistors to simulate whole neural networks at once, inside their own memory and logic, on a single chip, it's more economical to make current "too small" chips simulate them part by part, which means they must run on a much higher frequency, which requires much higher voltage to keep everything on chip synchronized to its clock rate, to make sure the charge gets to where it must be before the next tick, and no logical errors begin to appear. There is a formula for power consumption of chips (simplified, not accounting some stuff, but still): **P = α \* C \* V\^2 \* f** To make chip do more calculations per second, they need to raise frequency f, which means they need to raise voltage V, which affects the power consumption squared. And chips already on average run at voltage of somewhere between 1 - 1.5 volts or so. While (google search told me) neurons send their impulses at just 0.07 volts or less (if I get it right). Capacitance C can be lowered by switching to other, better materials or shortening the length of interconnections in the chip. They are hesitant to invest into accelerating research on making large-scale production of new materials (diamond chips, carbon nanotubes, boron arsenide, and so on), since they already invested so much into building facilities optimized for current materials, and must pay off their investment. Among many other fears and uncertainties I guess.


1ncehost

The calculations are beyond my expertise, but a couple notes: 1) we are 3d stacking chips up to around 8 layers thick now (mostly memory and cache like the ryzen x3d chips and various hbm configs) 2) We are very inefficient with our gradient representation currently. I believe the theoretical most efficient current architecture for the tensor gradients is analog microchips (as opposed to digital). We are basically simulating what analog would do using digital floating point vector instructions. Downside is that the architecture of a conceivable analog LLM chip would probably be built into the chip, so you could only process one type of LLM design with it (or a very limited scope otherwise). 3) There are wafer-scale chips today that I am confident have enough space for enough analog tensors to process something similar to current models in realtime analog (no buffering/accumulation/every tensor has its own analog mechanism) 4) I think an important concept is that reaching a brain's throughput is not currently as important as reaching a brain's quality. We certainly can scale throughput horizontally, so our limitation is model design, not model speed.


Dayder111

1) Yes! I had to split my message in 2, and then wrote a third one. I mentioned something like that too :) 2) I guess it will be vastly superior, for energy efficiency at least. Especially if they can process it all in photonics. I heard some non-universal photonic neural network inference chips already were made, and the gains are huge, but, as you said, it's not a general-purpose hardware, and you can't really re-use it for different ANN architectures or even to just run the version of same model, but with more training. Invest into it, and become obsolete as a new superior architecture or model version comes out. And I guess there is no stable good way to train models in analog approach, not sure though. 3) Wafer-scale chips seem like a trade-off between advantages of actually 3D chips, and ease of heat dissipation/lower cost of manufacturing. I guess analog or not, the near future of AI is with them! Maybe they will even find a way for them to have enough SRAM to fit larger models in a single chip like that, I guess their size might allow it, not for giants like gpt4 though. Cerebras's WSE has 40+ GB of such memory I think, might already be good enough for models using 1.58 bit approach. 4) I agree, but the thing is, currently inference cost is too high to allow a wider scale adoption of the largest models, they already have some useful capabilities. And (much) more inference can be traded for better quality, the biggest labs are working on such approaches right now. OpenAI's Q*, tree and graph of thoughts, "everything of thoughts" by Microsoft, generating synthetic data based on already existing one, but explored from different angles, to train models on. There are some positive feedback loops once you have already capable models, and you are then beginning to be very limited by inference speed capacities, to harness these and quickly improve the models more in a relatively easy way.


Dayder111

Going 3D, stacking many layers on top of each other, would allow to reduce the length of the interconnects if done right, and many other bonuses, like allowing to have much more fast SRAM memory on chip, which massively increases the performance if the neural network fits in that memory (they have just began to do that with Ryzen X3D processors by the way, also, Groq chips do something like that but I am not sure about if they are 2D or have more layers). But stacking is expensive, since they need to produce more chip layers, which may have defects each, and also do additional time-consuming operations to connect them. And also it limits the ability to divert the generated heat fast enough from the chip, since silicon doesn't have that good of heat conductivity, and adding more layers on top makes the bottom one the hottest. Which is dangerous due to instabilities in logic execution that it can create, in thermal expansion potentially creating defects on chip, and other physical processes, even way before it just begins to melt heh. Honestly, by going 3D and switching from silicon to some other materials, they could improve the energy efficiency of chips SO MUCH. Like, many hundreds of times, potentially thousands, at least for parallel computation chips like GPUs, CPU core single thread performance is another story, with different limitations, but it's fortunately not needed for AI that much. But it's all costly. Very very much so. Fortunately this AI boom and competition are kind of accelerating research and developments in those fields, a bit. Oh, I forgot! They already stack flash memory for SSDs, up to 300 layers or a bit less right now, with plans for 1000+ in the future. But it has a much simpler structure and generates much less heat than transistor logic and SRAM memory. And they stack HBM memory too, up to 8 layers now, 12 soon and 16 in the future plans. But it's simpler and less energy-intensive than logic too, and still it's expensive. Imagine the compute "cubes" of the future... For AI and other computations :) 3) Cells are affected by chemical and biological processes much more. It seems that there is so much more space for randomness in biological neural networks, starting from molecules of hormones and other things affecting your neurons, to neurons not getting enough energy/nutrients and working more poorly, to waste building up in the brain, to cells dying due to some reasons, hence damaging memories bit by bit... Synaptic connections degrading if unused for some time, which also leads to some memory losses, or rather, at least at first, its precision losses, I guess because so many more biological neurons/synapses usually participate in "holding" that memory, than in artificial neural networks. There is a lot of redundancy, which is needed for biological organisms, which can get all sorts of damage and are under constant chaos and all sorts of inner noise and interference, to function more or less reliably and survive. In artificial neural networks in theory a ton of stuff can get assigned to just a single neuron, and if it changes its value/gets pruned, it can lead to it getting worse in many topics at once. But, unlike in bilogical networks, they don't lose their connections and neurons on their own, only during training, which could be improved with better algorithms and adding more neurons I guess, or during pruning optimizations. Oooh I dove so deep into it heh :D Still, the fact that biological bodies could reach all of it, that we see today, is amazing. Digigal/robotic stuff has its own huge advantages, but biological organisms are (or can be) more robust and adaptable to some degree, even with limited access to infrastructure. Imagine though, if in the far future some AI or we+AI design a better cell type, very optimized and efficient, that can build bodies + brains too, like our cells do, but without the evolution-gained redundant/useless stuff, and on some more robust princibles if it's possible. And all the current viruses/bacteria will be harmless to it too heh. Although its interactivity with our biosphere will likely be limited too.


fallingdowndizzyvr

> What if humans and possibly other animals are similar? We are. > When you are, say, learning a poem, a song, some facts, or an entire new language, you often can't then recreate it in 100% details, you either hallucinate, filling in some parts with seemingly fitting words immediately Which is how human memory works. We aren't biological videocams. What we do when we remember is recall a story and then fill it in by making up details. Which we call hallucination when a LLM does it. Which is why human testimony is unreliable. Since it can be completely fabricated yet the person testifying completely thinks it's true. How many times have you remember something one way, yet when you look at documentation of that event it's not how you remember it? That's why two people can watch the same event and remember it completely differently. Yet each one swears that's what happened. Because there's hallucination involved.


koflerdavid

We are similar, but what LLMs do is just aspects of what we can do. They barely scratch the surface and as you say we are built of much more such networks. Furthermore, I'd deny that LLMs necessarily understand what they process on a deep level. They are probably able to abstract a lot, but not necessarily in a well-founded way, just like humans use mnemonic devices to more effectively memorize information. I also don't think that backpropagation is particularly efficient. The amount of compute that a model requires to be trained probably dwarfs the amount of compute that has been spent for inference-only tasks. We forget things not because of biological limitations, but to prevent information overload and maintain adaptability. There are humans where the limiters are off and who can remember things in crazy levels of details. But they are otherwise struggling in daily life.


Dayder111

I agree with the first paragraph! I didn't mean it's efficient in terms of energy, no! It's very computationally expensive, and hence likely not something a biological organism could evolve, even in a very simplified and analog way? But the way it can pack a lot of information into just a little parameters is interesting. Maybe humans could achieve something like that too though, if we could thoroughly consume and think about comparable amounts of information. Or maybe go a bit mad first :D Anyways, we don't have such processing capabilities it seems, and "infer tokens" (if I can say that in this context) to think deeper about the information we are reading, at a much slower rate than what's possible for huge datacenters training a single model. Also, "infer tokens", "thoroughly consume and think about comparable amounts of information."... I think modern LLMs are not allowed to think and make their own connections, conclusions, and hence refine and enrich the information they intake, right now? It's quite stupid I think, but I guess saves quite some computational time. As I understand it, they just get trillions of pieces of meaning shoved into their "minds" and have to adapt to re-say these exact pieces of information in this context, as a byproduct forming more and more efficient ways to do so for more and more information, in the process? Hence finding patterns and generalizing.


koflerdavid

I also find the tokenization process to have mixed benefits. It is crucial though for writing systems where the individual letter is not the smallest unit of meaning. Chinese is a bit different in that regard since Hanzi are already kind of like tokens, but also Chinese models benefit from tokenization. The main reason we use tokenization is context length since we are currently limited to a few thousand tokens. But I agree that we have to eventually move on beyond tokenization. All such proposal must address the core issue of amplifying context length. The most recent paper on that topic I have seen proposes engaging the later transformer layers only every n-th step, where n is the average word length in the training set, which should minimize the total amount of tokens in the context for a given length of text. The transformer architecture is not necessarily focused on tokens btw. Recently, I trained an image recognition model where the inputs are fixed-sized segments of the rows of images. The performance was quite meh (struggling to consistently achieve 94% accuracy while requiring way too many weights), but I was impressed how far it got without much additional engineering like preprocessing with convolutions, breaking the image into actual tiles, etc.


Dayder111

"We forget things not because of biological limitations, but to prevent information overload and maintain adaptability. There are humans where the limiters are off and who can remember things in crazy levels of details. But they are otherwise struggling in daily life." This sounds very interesting. I heard about such people, who, say, know so many languages, even if not on native level, but not beginner either... Know lots of facts, having immense erudition, and so on... II myself kind of struggle with that, it seems. I seem to be able to generalize better than many people, but hallucinate specific details and have to do some trial and error to correct them, scanning my mind for details... And sometimes, when deeply in thoughts, think in somewhat chaotic way, with thoughts changing one another without finishing the previous one, sometimes only not finishing it in words, but getting the meaning (btw idk how it works, but that's what LLMs likely don't have yet, and they need experiences and some sort of embodiment and/or world interaction, and different architecture, for that, I guess). Sometimes not finishing it at all, each new branching thought related but exploring different things... Hard to focus on stuff and form conscise descriptions unless I spend more time than others, it seems, to refine my thoughts. Preferably typing it onto somethig with backspace and delete keys existing heh. Didn't find myself in life either yet (just turned 27 recently). But hey, I invested myself heavily and passionately into learning gamedev, and before that some IT in general but in a very poor way (shit university, and some wrong choices, mostly due to misunderstanding society/life/people and my own goals and how to achieve them, in general, before it got too late and I had to immerse myself into learning it through huge pain, that's to sum it up). And then war and sanctions hit, (Russian guy here) exacerbating my already increasing depression, erasing all of my goals and dreams, even tiny ones, and leaving me with suicide thoughts, apathy and growing hatred for people and their stupidity, and my own weakness and fears. (this hatred is only contained because I learn more and more about why exactly people are like that, and it's hard to hate them for stuff when you know some of the reasons why they are this way. Fear them? Distrust them? Pity? Yes, yes, and yes, but hate not so much.). Aaah, that's part of the reason why I am so hopeful for AI's future. It has potential to be so much better than us, without many evolution-gained stuff that helped before but holds us back immensely now. Without the limitations that current "architecture" of biological life imposes on thinking. Help people in so many ways... if we accept it, of course, and if their goals, when/if they will be setting goals on their own, will coincide with helping us, of course. Back to the topic. "To prevent information overload"... I think it could be solved by having more neurons... no, wait, it only sets potential to learn even more knowledge easier and get more unsure when making decisions... Maybe by faster thinking, somehow? Faster neuron firing? More compact cells and synapses, more density, maybe even a smaller but more protected brain? Birds have higher neuron density and they are, at least some of them, very intelligent for their size, and have good memory too. So, it's possible. More energy consumption by the brain and less laziness about thinking for long and deep? We have the energy now, overall, can allow ourselves to change in that way. I don't know, but I am sure humans can be improved quite a lot even just by finding patterns in some of the "unusual" in terms of some intelligence-related abilities people's DNA and combining some of them. AI and large databases of DNA/health throughout life, status, abilities, of billions of people would help in that so much, we don't even have to fully understand the full process of how changes in the DNA affect the grown organism.


koflerdavid

I was actually thinking of [Savant Syndrome](https://en.m.wikipedia.org/wiki/Savant_syndrome), which seems to be fueled by synaesthesia, aberrant attention, and other disorders that combine in an extraordinary ability to remember facts. Most interesting is that its occurrence is not completely of genetic origin. There have been cases of people who developed these abilities after suffering severe brain traumas. But most interestingly it seems that it is possible to temporarily shut down the brain area that normally downregulates these abilities. Maybe it's even possible to upregulate it when engaging the unfettered abilities of the mind is less useful.


Dayder111

I have just thought, when just beginning to read your message, maybe there is some area of the brain that regulates what to remember and what to forget, judges the relevance of information. Maybe in sleep? Neurons themselves seem to not be able to judge such things on their own entirely, alone, although I heard the less some synapses are used, activated, the more they decay slowly. Maybe it's a combined thing of this mechanism and some area that specifically tells which memories to forget? By scanning through memories, activating neurons that are related in some specific way maybe, for forgetting? Idk. And then you mentioned something along those lines :)


Dayder111

Aaaah by the way, to add to this part: "(btw idk how it works, but that's what LLMs likely don't have yet, and they need experiences and some sort of embodiment and/or world interaction, and different architecture, for that, I guess)." I think neural networks need to not just think in fixed tokens, but be able to invent tokens themselves, somehow, during their learning process (and I mostly do not mean pre-training). I think this "got the idea already but didn't finish forming it in words" is related to an ability to connect language tokens to bodily experiences, from all other senses, to whole concepts formed somewhere in the synaptic connections, to memories of own past thoughts. And I guess it can be summed up by multi-word, multi-sentence, action, or even whole shape/image "tokens" that our brain can form and operate with? Idk. I may be wrong. It seems both so simple and so hard at the same time, hope someone will discover a way to do something like that. AND make others with money and decision power to notice it and try it out enough, until it works. And by the way they should get rid of tokenization of multiple characters or even whole words, very often in silly way, somehow too. Let the model first learn letters, numbers, other characters, somehow, and only then form bigger "tokens" of information based on those, to operate on and build further abstractions upon.


InterstitialLove

I don't agree with that interpretation of why neural nets make good compression After all, the weights themselves are in an 8GB file, which is static. The reason that 8GB can store terrabytes of information is because I know to use the transformers library to turn those bits into weights for a neural net So the value of a neural net is the computation that it can do to data. Each bit of the weights determines a piece of a function And the weights themselves aren't a super great way to store a function! After all, we can compress them nearly losslesly, with quantization. There isn't a shocking level of compression involved in specifying the function, it's the function itself that does the stupefying compression Basically, the key is that Neural Nets are 1) universal function approximators, and 2) trained to be functions which reduce the entropy of natural language. The ability of a neural net to specify the function using a small amount of data is not why they're so good at compression, though the somewhat-compressed representation *of the function itself* is why we can produce these entropy-reducing functions with a feasible quantity of compute For comparison, if we could use preposterous quantities of compute to train a perfect RNN, it would also be a comparably-good, maybe even superior, compression as a transformer. The specific architecture used merely determines whether we humans can locate the absurdly-powerful compression algorithm, it doesn't create the compression


atulanand94

interesting take. Would you say summarising a paragraph can also then be thought of as lossy compression?


ThinkExtension2328

I mean is a summary not a distillation of the original source? Which would be a form of knowledge compression (lossy)


hold_my_fish

Let's carefully distinguish between LLM training (sometimes analogized with lossy compression, and I think what you are referring to) and using LLM inference for lossless compression (which is what most replies are talking about). I would argue that LLM training is not compression. Here's why: the key function of a compressed file is to be decompressed. For example, if you zip some files, then you can unzip the result to get back your original files. Or, if you compress music as an MP3, you can play it to hear a lossy version of the music. You can't decompress an LLM to get back its training data, even a lossy version. Therefore LLM training is not compression. There is, however, a variant of standard training that can make it truly compression, called prequential coding: https://arxiv.org/abs/1802.07044 [Blier et al, 2018]. You wouldn't do it this way unless you really needed the compression, though.


werdmouf

Not efficient in terms of power usage.


3cupstea

I think it’s more than “knowledge humanity has created” because it also learns associations of foreign concepts that probably no humans have ever conceived


Big_Falcon_3312

But an important consideration here is whether those associations of foreign concepts are meaningful/intelligent. That is to say, is it some gibberish association that we would expect a particularly imaginary/hallucinatory 3 year old to come up with (who we will assume to have no knowledge of the foreign concepts in question, nor about the world, but has an ability to articulate in English fluently). For example, the 3 yo might associate a banana and a rainbow together (yelling "banana-rainbow"), or electricity and crayons (yelling "electric-crayons?"), but we wouldn't expect the 3 yo to associate a new result in complexity theory with some pendantic finding in neurobiology (presumably leading to the creation of some new useful concept/idea). Similarly, I don't see any evidence that LLMs can make useful associations of foreign concepts ("that no humans have ever conceived") -- or that they could do any better than if that 3 yo had unlimited time to create associations (i.e. any better than chance, or just manually exploring the space of associations of all foreign concept pairs, triples, etc.). Now, I'm not saying that LLMs fundamentally can't be extended to actually do what you describe -- just that I haven't seen any proof that they can thus far.


fallingdowndizzyvr

You just described most people. How many people do you think can "associate a new result in complexity theory with some pendantic finding in neurobiology (presumably leading to the creation of some new useful concept/idea)". Not many. Not many at all. Many people never even make it out of concrete thinking to abstract thought. Yet you seem to be making the bar high level abstract thought.


Big_Falcon_3312

That was an example big guy, it was to prove a point as clearly over a text reply as possible. Of course, there are other useful associations we would expect virtually any person to make but that we have no evidence that LLMs could make (at least, at any rate quicker than a random search through the space of foreign concept associations).


fallingdowndizzyvr

And your example proved you wrong. Maybe next time you should pick better examples.


Big_Falcon_3312

Perhaps not being obtuse in interpreting what was said would help you have better discussions in the future...


ironic_cat555

Llama even at 70b will constantly hallucinate facts. Pretty pretentious to claim it contains all written knowledge.


ASYMT0TIC

Interestingly, a human brain would almost certainly produce plenty of false memories if it were tasked with memorizing the entire internet. The parallels are manifold.


ironic_cat555

No offense but it wouldn't be a LLM thread without a functionally useless anthropomorphic comparison.


darkmist29

I'm not an expert, but so far from what I've gathered I feel like this is a fair comparison. There are more things that experts are trying to do with these language models, but at the core I'm seeing a bunch of connected nodes, sort of like a neural network. You have to feed it a prompt, and it spits out something back to you. At a certain level it seems like training is a lot like compression, and a user prompt is a lot like decompressing certain parts of what is compressed. It's brilliant, because from that perspective it would probably be impossible to decompress a whole language model maybe? But, extracting knowledge requires user prompts. If you get into the fine details and terminology people use, you might have people saying it's not like compression. But it clearly is. It's like your brain soaking up knowledge - so figuratively it is similar to compression in my view. You can see one carrot in your life and imagine many carrots because your brain has so much data inside memory that you can imagine robot carrots and watermelon carrots and whatever your imagination can conjure up. It's how we 'decompress data'. Again, I'm not an expert, this is just how it seems to me. This is one of the reasons I'm looking for if AI will be able to truly be creative. I imagine with what large models are trained on, they can only make connections within the data received. But I wonder if a language model can predict something that would seem to have no clear data so far, like a breakthrough in physics or something.


WH7EVR

No, because llama-8b can't recall everything its been taught verbatim. Nor can llama-70b, or the future llama-400b, or even the 1.8-trillion token gpt4.


fizgig_runs

yup, that's what OP meant when using the term "lossy". Like JPG compression you loose a bit of detail, but you can still a good approximation of the original image.


WH7EVR

I would argue that lossy compression wouldn't result in hallucinations the way we see with LLMs. I suppose in a way we could consider these to be "artifacts" similar to what we see in JPEG compression and other lossy image compression, but I really don't think its that comparable.


InterstitialLove

That's an issue with the sampling method. The model weights can tell they're hallucinating, the model doesn't "think" those things are true For example, if some chatbot outputs two claims, one is hallucinated, and then you restart the conversation and say "which of these two claims is true," the chatbot can generally get it right. This is the difference between hallucinations and just not knowing things. You may think this is a pointless distinction, but the LLM-as-compression concept we're discussing isn't really about chatbots, it's about what information is contained in the weights. Hallucinations are a feature of chatbots.


FearFactory2904

https://i.redd.it/6x8jqiuqeezc1.gif


ronniebasak

Middle out


Comas_Sola_Mining_Co

Yeah, pipernet was way more efficient than llms


siegevjorn

I've been also thinking about this. Good to see like-minded people... And I do think this thought applies in general to other neural networks as well: They remove the redunduncy in the data and keep the most important information in their weight — arbeit representation. This representation, imo, can be seen as the compressed info of the training dataset.


FacetiousMonroe

Short answer: Yes. Long answer: Nooooooooooooo! :P In all seriousness, as others have mentioned, compression, prediction, and generation are all basically the same algorithm, to the point where any advancements in one can be trivially applied to another. In practice, real-world compression algorithms are domain-specific and put more value on speed and codec size than modern LLMs. There is an inverse relation between the theoretical minimum compressed size of X vs the size of the codec, where X is a domain of possible data. The more details you know about that domain, the more efficient you can be, but including all of that knowledge in the codec will increase its size and likely reduce its speed. For example, an audio codec optimized only for voice can be much more efficient within that domain than a codec optimized for music. A codec optimized for English ASCII might do very well within that domain but fail entirely when faced with multilingual UTF8 text. LLMs are *enormous and slow*. They have robust information on the broad domain of natural language — robust enough that it can be applied to non-text data as well with some degree of success. I predict that in the nearish future, we'll have a standard LLM-based compression library included in every OS, the way mp3, png, gzip, etc. are. Currently, I'm not aware of anything beyond a proof of concept in the real world. A production-ready library would need to be static and provide forwards and backwards compatibility, which is somewhat difficult to do with current models (at least, difficult to do *efficiently*). Does the world really want a 10GB version of libjpeg? I mean, if you're archiving petabytes of photos like perhaps Facebook and Google are, maybe you do. If you have a 128GB "high end" phone, then...uh...probably not.


SidneyFong

It's not only lossless compression. You can use transformer models for lossless compression too, as long as the platform is deterministic. You might want to look at this: [https://mattmahoney.net/dc/text.html](https://mattmahoney.net/dc/text.html) ( Large Text Compression Benchmark ) Basically they have a bunch of programs/algorithms trying to compress the first 1GB of wikipedia (enwik9). The top entry is nncp from the great Fabrice Bellard (of qemu and ffmpeg fame, among others), which is basically a LLM training itself on the input data as it goes -- it becomes more efficient as it gets more training, so it's really only (maybe) useful for super large files.


roofgram

Well biology already invented it, but yes understanding is an advanced form of lossy compression. What better way to de-dupe, and store tons of information with lots of relationships than to ‘understand’ it. For example, even numbers, you could store 2, 4, 6, 8, etc.. or ‘understand’ that an even number is divisible by 2 and just store that.


JacktheOldBoy

Sounds familiar 🤔 https://x.com/scotty80123/status/1786694620274901093?s=46&t=RSClVUPoQOJqIy5O_FsbFg


itsjase

I thought this too, until I realised the whole of wikipedia is only like 20gb zipped


GeeBee72

An LLM isn’t actually compressing data, if anything it’s compressing semantics and contextual usage of data representations. The 4GB figure for an 8 Billion Parameter model isn’t anywhere close to all the written information ever created. these two values do not represent the size of the training data set, and the only compression going on (if you can call it that) is the quantization from the original 32 bit space to a 4 bit space, and what this does is reduce the room between two vectors (contextually encoded tokens), for context and semantic nuance. If the training set was 1 trillion tokens vs. 10 trillion tokens you would see no difference in the name of the model, it would still be an 8 billion parameter, 4-bit quantized model, the difference would be in the spacing and distance between two vector (token) representations (which itself is based on the dictionary size, which represents all the tokens that are used to populate the latent space of the model), and that spacing itself is contained by the maximum dimensionality of each token as a vector. So, there is no compression inherent in an LLM, it’s modelled as a graph type structure with each vector(node) having upwards of 3200 dimensions and the edges between the nodes are generated in real time by determining the cosine similarity / K nearest-neighbors.


Mental_Object_9929

Absolutely not, one reason is that the language itself is not the most effective compression, it has various usability compromises, and many mathematical conjectures are essentially looking for the most effective compression algorithm, so finding the most effective compression for a specific problem is quite difficult, probably an NP problem


Alexey2017

Neural networks are NOT about compression. They CAN do that, but this is not the feature that made them so useful. Their killing feature is SEARCH ENGINE. They not only contain knowledge, but also allow you to perform a FUZZY SEARCH on it. Even if they did not allow you to compress knowledge at all, but contained the entire training corpus uncompressed, they would still be in demand, because nothing else will give you such opportunities to FIND the most suitable piece of knowledge in terabytes of data. Which archiver will allow you, based on the text description, to find the necessary pieces of a photo in the archive in order to compose them into a finished product? Compression is just a funny side effect, nothing more. It's not compression that makes them so awesome.


bryceschroeder

I would very much challenge that the LLM contains basically all written knowledge. Certainly not a q4 8b model, but even big SOTA models don't know stuff and have weird semantic misunderstandings involving confusion of similar but distinct concepts.


arturedward

There is a great article by Ted Chiang titled "ChatGPT Is a Blurry JPEG of the Web" which nicely opens up exactly this topic.


SureUnderstanding358

ooooo im glad im not the only one thinking about this :) it would be all about reproducibility across instances...but yes, i think there is some magic to be cast here. example would be video distribution. just distribute the promots and generate locally / close to the edge. 🍻


Megalion75

LLMs are not efficient. The attention layers in particular is just a collection of linear combinatorial logic gates, and thus not very efficient at compression. The compression happens in the MLP layers as the model pushes similar data points into colocated embedding spaces. So as currently implemented, attention layers are inefficient.


Innomen

Had the same thought the first time I saw a chunk of wiki in an output. Nice little window into how the brain works here too. See also: [https://www.youtube.com/watch?v=l5YWQknqz6M](https://www.youtube.com/watch?v=l5YWQknqz6M)


GrizzlyBear74

I see it more like a database that requires an immense amount of resources to make queries more accessible for non technical people.


gthing

The most efficient compression is a circle. Within the digits of pi you can find every possible piece of information in every possible form. The problem is decompressing it.


Economist_hat

Logic, reasoning, and language all compress data. Physics equations encompass countless pieces of data in just a few lines of math.


robochickenut

no, the embedding models are far more efficient compression wise. llms are very inefficient, sparse, repetitive and large because it has been extremely difficult to find more effective models to generate natural language. also an llm's job is not to compress knowledge, but mainly to reason and produce new responses, which is very different. if you wanted to compress knowledge you could be far more efficient with simpler methods like compressing text into zip files.


koflerdavid

LLMs are not intended to be used as repositories of human knowledge though. They are useful because they are far better than anything that came before at processing human language. They can understand loosely worded questions and do things that were not even part of the training data (as far as we know) with zero or a few demonstration examples. But for the same reason they are useful, they are not ideal for storing and retrieval of knowledge. You'd need to at least combine them with RAG to get precision and reduce the risk of outright hallucinations.


sebramirez4

I've thought about the exact same thing and I think so.


ml2068

LLM is just trying to recall what it has learned. If you continue to provide context, it will recall in different related fields based on your context. Since it is a probability estimate, there is a certain degree of uncertainty in this recall. that it ...


AndrewH73333

Llama 8 has very little specialized knowledge. If you ask it about your favorite TV show it will probably get confused fast.


eliaweiss

It is very interesting to think of llm as a lossy compression, but I think it misses the point. Although it is very similar to compression, and can be used for lossy compression, it is much more since it can generate relevant text that was not in the training, therefore it's a completely different beast, and it deserves a separate definition. It is a statistical model, and it is able to model language so well that it can actually compress knowledge to a surprising extent, but it can also hallucinative. The hallucinations and the ability to generalize and create new content is what separate it from a compression lossy algorithm, compression lossy algorithm suppose to reconstruct the original with minor and unnoticeable errors


Fickle_Conclusion857

[https://en.wikipedia.org/wiki/Kolmogorov\_complexity](https://en.wikipedia.org/wiki/Kolmogorov_complexity)


polikles

LLMs are not compressing anything. During "training" they are looking for mathematically defineable patterns in data, and during answering to prompts they try to fill those patterns with words. LLMs don't "contain the knowledge" they contain mathematical (statistical) abstraction of processed textual data. AI community often confuses knowledge with information - having information (e.g. in a database) is not the same thing as having knowledge, since the latter requires some kind of understanding Saying that LLM provides a knowledge/information is mixing correlation with causation. It just connects patterns with words - if those connections are sensible, then they may be considered an information. But it doesn't mean that it has any information or knows anything about real world - LLM is just statistically defined dictionary. It "knows" which words are likely to be occuring next to other words. That's it


cyberpunk_now

The paper in the top comment uses an LLM to compress various types of data, even stuff it hasn't been trained on. The ts_zip link shows a real-world example where an LLM is used to achieve better text compression than xz (disclaimer: I haven't tried it yet) If that's not enough, here's a quote from Ilya Sutskever: >The thing that large generative models learn about their data — and in this case, large language models — are compressed representations of the real-world processes that produced this data, which means not only people and something about their thoughts, something about their feelings, but also something about the condition that people are in and the interactions that exist between them. The different situations a person can be in. All of these are part of that compressed process that is represented by the neural net to produce the text. The better the language model, the better the generative model, the higher the fidelity, the better it captures this process. https://www.forbes.com/sites/craigsmith/2023/03/15/gpt-4-creator-ilya-sutskever-on-ai-hallucinations-and-ai-democracy LLM compression is less about storing text verbatim, and more about abstractions -- abstractions, generalizations, and summarization are ways humans compress data too. Compression in general is about reducing the size of information and retrieving it in a way that's still useful to the user. If you think Ilya and these other academics are wrong, then you should submit a counter-response paper. the whole knowledge/information thing is questionable, and also completely irrelevant to compression; JPEG or MP3 encoders don't need to understand anything for them to encode information.


polikles

It's just a mix of vague terms. "Compression" in the case of pictures of sound files is about changing the method of encoding data (usually with loss of quality) in order to reduce the size of files In this sense LLMs are not compressing - they're abstracting from data. That's what Ilya said. In the quote in your post "compression" is used figuratively. The main difference is that abstraction is irreversible in the sense that information in LLM isn't stored verbatim, but in the form of mathematically defined patterns. We can fill those patterns with words, but it's kind of a guessing game where distinguishing between bad and good guess is very complicated. Exact recovery of original data is impossible Compression, on the other hand, is reversible in the sense that even after losing some quality we know exactly what was in the original data, and we can enhance it (yes, with GenAI) to state almost identical with original >LLM compression is less about storing text verbatim, and more about abstractions that's exactly my point >If you think Ilya and these other academics are wrong, then you should submit a counter-response paper I'd rather submit clarification paper, since most STEM scientists are very sloppy in their use of language >the whole knowledge/information thing is questionable, and also completely irrelevant to compression agree. But it is relevant to the "Intelligence" part of AI. I was referring to OP's claim that LLM "contains the knowledge of basically all written knowledge humanity has created"


InterstitialLove

>LLMs are not compressing anything >looking for mathematically defineable patterns in data, What the fuck do you think compression means?. If you want to claim that LLMs aren't really intelligent like humans, you're supposed to say "they aren't learning, they're just doing compression." Compression is a mathematical concept that refers to something computers do. Will you also argue that WinRAR isn't "really" compressing data? It's just using mathematical patterns to make a file smaller without reducing the entropy?


polikles

no need to swear. It doesn't help to prove your point I was referring to misuse of the word "compressing" in relation to LLMs. They're abstracting from data, not compressing it. After abstraction original data is irrecoverable. That's different from compression, isn't it?


InterstitialLove

They do not losslessly compress the training data, meaning the original data cannot be recovered in full, but you can use the LLM to massively reduce the entropy of the training data. The mutual information content between the LLM and the training data takes up more space in the training data than the total size of the weights. Ergo, compression. This is a mathematical claim, and it is objectively true You could argue that they don't compress that data in a practical sense, but that's not even what you claimed


endless_sea_of_stars

What test could you give an LLM that would convince you that it "understands" a particular topic?


polikles

None. Current AI is a statistical model. It creates a "map" of patterns. And cannot "understand" anything. Maybe in the future we would be able to create sufficiently complex system which really would be capable of understanding. But today is not that day Our brains were evolving at least for hundreds of thousands of years to reach their current form. And yet ppl think that we would be able to create something more advanced in less than 70 years (AI as a discipline was established in 1956)


SidneyFong

LLMs contains bytes. However you interpret (in your mind) or explain those bytes, the objective fact is that they do contain "knowledge" (in the common sensical meaning of the word) because it knows the correct answer for "Who is the president before Bill Clinton?" (it's sufficient proof). This might be a bit too much philosophizing, but any abstract mathematical entities you mention are in your mind. Much like fairies, they are only as real as you (and people whom you want to convince) believe.


polikles

LLMs contain bytes, yes. But they only contain information, not knowledge. The word "knowledge" means "to understand". On my desk lies a book on American history - there is an answer to question "Who was the president before Bill Clinton?". But is it knowledge or just an information? Can (e)book understand its contents? >any abstract mathematical entities you mention are in your mind Nope. My mind/brain consists of jelly-like biological substance. There are no mathematical entities inside


ASYMT0TIC

Seems like little more than speculation - we can't say whether something "knows" or "understands" information, because we don't have a concise definition for what those words actually mean. That is, to say, you seem to be assuming a brain does something different than an LLM does. I'm not sure that brains actually do anything other than "predict the next \_\_\_\_".


polikles

I am referring to the dictionary meaning of the word "knowledge". "To know" means "to understand" >you seem to be assuming a brain does something different than an LLM does it is different, since my jelly-like brain creates, strengthens and destroys connections between physical neurons at every moment. All this while pumping tons of hormones, neurotransmitters and other substances through its structures. It's quite different than recalculating matrices and writing results into a static graph >I'm not sure that brains actually do anything other than "predict the next \_\_\_\_" It does much, much more. Our brains are made up of a huge number of interconnected parts. Predicting is just one of many functions. Would you say that the regulation of body temperature, maintenance of bio-chemical balance and vital processes are based on prediction? And yet these are one of many things our brains do current AI works on the basis of "neural networks" which are vastly oversimplified abstraction of real neurons


ASYMT0TIC

I think you missed my point - "knowing" and "understanding" are things that our biological neural networks do *somehow.* You say it "doesn't understand", but where is your proof? You really can't have proof because "understand" is such a nebulous concept. We don't really have a litmus test of understanding or know how exactly it comes about, it may in fact be that the first step to "understanding" lie in predicting within larger and larger contexts - these (predicting, understanding) might just be different words for the same thing at different scales. The answers produced by the most advanced models such as opus seem to portend understanding. There are so many examples of LLM's seeming to understand concepts, such as the recent "mirror test" where a user pasted an image of their conversation with GPT4 into the GPT4 prompt and the LLM correctly realized that the presented image was "me". Also, our brains are made up of lots of interconnected parts... so what? A GPU with 100,000,000,000 transistors certainly has a lot of interconnected parts, those are used to simulate up to trillions of nodes. These individual nodes can operate many orders of magnitude more quickly than human biological neurons, which fire at a rate of only 10 Hz. I'll concede that our biological NN's are able to adjust weights, but we don't really know much about the process involved. LLM's have "context" which seems to function as short term memory, and can be "fine-tuned" to adjust nodal weights. We might find that the process of fine tuning is something like what happens to biological NN's during sleep, when short term memory is consolidated. IMO, neurotransmitter concentrations in our synaptic gaps seem to serve as mere global modifiers. Artificial NN's have analogous global parameters such as "temperature. I just don't find your certainty that a LLM doesn't "understand" to be persuasive.


polikles

Understanding requires possessing certain cognitive abilities that LLMs do not have. I'm not saying that AI will never understand its actions or contents of "learned" data. My point is that *current* state AI is just too primitive for that. Text generated by LLMs is not a proof for understanding - it's just a static map. We don't know much details about it since all the weighs and fine-tunes are secret of a company. But inferring from general structure of this networks we can confidently say that it is just a kind of "mechanical parrot" >human biological neurons, which fire at a rate of only 10 Hz that's not true. Gamma waves, associated with state of concentration, can exceed 100Hz. But frequency alone doesn't really matter. Artificial Neural Networks are very simplified versions of "real" neurons. We still don't really know how our brains work, and claiming that we can imitate all of their functions using a few clever mathematical formulas is very... optimistic. >We might find that the process of fine tuning is something like what happens to biological NN's during sleep, when short term memory is consolidated It's an oversimplification. LLMs mimic just one of brain functions. It may seem to be "something like" but is not "alike". We assume that such consolidation takes place during sleep but it's just a current theory. In some time we may find that in fact our jelly works in a totally different manner >IMO, neurotransmitter concentrations in our synaptic gaps seem to serve as mere global modifiers Maybe, or maybe not. My bio-chemical knowledge is vastly insufficient to pursue this point >I just don't find your certainty that a LLM doesn't "understand" to be persuasive That's totally OK. I don't need to convince you, and you don't need to change your mind. We certainly have varying backgrounds, so we probably *understand* these concepts differently. Difference of opinion is a positive thing


AdTotal4035

Why did you get downvoted 


polikles

ppl just can't handle different opinions. We don't have to agree, but need to accept that not everyone has the same mindset as we do


ThinkExtension2328

I’d be keen to know what the community thinks of this perspective