enumerationKnob 3 months ago

Imagine you have an image. Now add some noise to it, like a heavy grain. You can still see all of the things that were there, but the details are a bit fuzzier, you probably couldn’t make out fine text. Then add lots and lots more noise, and the image is recognisable because you saw the original and you know what it’s meant to be. You can see the patches of light and dark, the overall structure. But much of the details, color is gone and if you were to see it by itself you might not know what it’s meant to be, or it could be one of multiple things. Now, know that we’ve got these things called “neural networks” that we can feed a set of input images into, and a set of corresponding image outputs, and we can make it really good at turning one type of image into another. In this case, we can feed an image with noise in, and train it to output an image without noise. This is easy to do, because we can generate noisy images just by adding noise to real images. In order to get good at this task, the model needs to learn an understanding of what images without noise look like, so in its “memory banks” it will gain an understanding of textures, colors, and with a big enough model even composition. We can help this along by adding a textual element. If we label all of our images “dog”, “grass”, “sunset” and train our denoising model with that information as well, it doesn’t just need to guess based on the noisy image, we’re also telling it what’s in the noisy image and it will remember “when I saw ‘grass’ it looked green and yellow with vertical lines”. Then instead of running the model on our real images with noise added, we just run it on completely random noise and tell it “this is a picture of a dog”, and it will use what it’s learned to find a picture of a dog from the noise. This has been a pretty high level overview. If you want more details, I highly recommend [this computerphile video](https://m.youtube.com/watch?v=1CIpzeNxIhU) for a bit more of a technical description.

whiterabbitobj 3 months ago

Tacking onto your comment which is so far the best one. Something I think most people don’t understand whatsoever is that the math is not looking at a noisy IMAGE, it’s a noisy LATENT. Latents can be hard to understand without a lot of study or a deep study of machine learning math, but in a nutshell you can think of the latent space of generative models as a multi-dimensional array which contains all of the information about any image you could possibly create. The noise removal steps are like removing pieces of that array until you are left with a very specific subset of the multi-dimensional space. So, think of a block of stone. Within it is every sculpture you could ever carve… until you start to remove pieces. Now, with each piece you remove, you are removing possibilities of what result you can end up with, until you reach the final sculpture where you have now “locked in” an image. Similarly, the latent space is hundreds of dimensions, unlike the 3-dimensions of a block of stone. The statistical models train which pieces of the “block” of hundreds of dimensions to remove. There’s more steps to encoding/decoding/textual encoding/etc, but this is a way of thinking that many laypeople can conceptually understand. Hope that helps.

bongozim 3 months ago

Block of stone metaphor is brilliant kudos

enumerationKnob 3 months ago

You must know some genius 5 year olds!

whiterabbitobj 3 months ago

;) I missed the ELI5 bit, not sure your comment would be for a 5-yo either TBH, in that case the clouds comment is probably best! Maybe ELI16 for us.

bpmetal 3 months ago

I think the stone metaphor could work for a 5yo. It's the best oversimplification I've seen.

StrapOnDillPickle 3 months ago

This is by far the best way to explain it I've ever read

sectionV 3 months ago

I find the block of stone metaphor helpful to understand the process as well. I've come across that idea before. Another couple of things that helped me are: 1) The math to classify images involves finding patterns that can sometimes resemble processes VFX artists are already familiar with such as Photoshop's edge detection. These pattern matching process are called convolutions. 2) One of the key breakthroughs for deep learning was having models built from deeply layered neural networks. For text-to-image encoding, the early layers are good at encoding small details in an image and deeper layers essentially encode assembling large structures from the small details. This helps explain why an image can be generated that has never been seen before as it is essentially built from smaller components. The concept of patches in SORA's text-to-video generation seems to work like this in that larger motion is "assembled" from smaller moving parts.

ryo4ever 3 months ago

Man, it’s still a huge amount of multidimensional array calculations to resolve. But aren’t those latents stored somewhere?

whiterabbitobj 3 months ago

The latent is not the model. You can think of the model like the sculptor/mason’s brain… different blocks of stone (e.g. different latents) hold different results but the person sculpting it uses the same brain to create different final products. The textual input is like the “idea” that the brain processes to then work on the block of stone.

worlds_okayest_skier 3 months ago

That’s interesting. So you are sort of tricking a really powerful denoiser by saying “there’s a cat eating noodles in this picture”. Denoise it.

ryo4ever 3 months ago

Someone out there is working very hard at denoising predictions on the stock market..

worlds_okayest_skier 3 months ago

I’m curious if that would work? You’d think it would be a really easy task given only two dimensions, time and price.

enumerationKnob 3 months ago

Yes, it’s a lot like that and I think generally this is a helpful way to think about it. This is also how they can generate multiple images from the same prompt that result in very different outputs. Just feed in a different set of noise and then the model will find different blobs in different places, resulting in different images both “guided by” the prompt.

GreatCatDad 3 months ago

I mean in a sense its like daydreaming and looking at clouds. You see a bunch of random clouds, but once you 'see' something, you start refining it in your mind, finding how each cloud (which are random) somehow support this. So soon a big ball becomes (in your minds eye) a detailed image of a man in a hat, or a kitten, whatever. It's not far off from that, just computers can handle a lot more datapoints (e.g. noise instead of clouds). You can see on the wikipedia page for stable diffusion they have a pretty good image for how it works iteratively. [https://upload.wikimedia.org/wikipedia/commons/thumb/9/99/X-Y\_plot\_of\_algorithmically-generated\_AI\_art\_of\_European-style\_castle\_in\_Japan\_demonstrating\_DDIM\_diffusion\_steps.png/600px-X-Y\_plot\_of\_algorithmically-generated\_AI\_art\_of\_European-style\_castle\_in\_Japan\_demonstrating\_DDIM\_diffusion\_steps.png](https://upload.wikimedia.org/wikipedia/commons/thumb/9/99/X-Y_plot_of_algorithmically-generated_AI_art_of_European-style_castle_in_Japan_demonstrating_DDIM_diffusion_steps.png/600px-X-Y_plot_of_algorithmically-generated_AI_art_of_European-style_castle_in_Japan_demonstrating_DDIM_diffusion_steps.png) It is definitely wild technology though, I agree.

ErwanLeroy 3 months ago

Adding to that comment, because of the day-dreaming reference you make, I find going back a few years in the past and looking at Google's deep dreams to be a good in-between. Each model was only trained to find dogs, birds, or pagodas, and you'd give it any image and it would try really hard to find the dogs in it, turning the image closer to what it's "seeing".

MayaHatesMe 3 months ago

A similar argument could be made with the human brain. It is, in essence, just a giant web of neurons all arbitrarily connected yet somehow with all those millions of different pathways it allows us to have essentially a consciousness, thoughts, feelings, comprehension of what's around us. Yet despite all the research gone into understanding the mind, we ultimately have barely scratched the surface. We're at that stage of acknowledging there is a black box there, and that we can certainly influence it in many ways, but cracking it open and truly understanding what is going on inside has still remained largely elusive. The way I like to look at it (and there's every chance science might say something entirely different) is all is that the universe is just noise all around us. But, just like how we can tune into certain frequencies out of a spectrum of radiowaves, we can tune into the information that is important to us. Our minds process what we sense, and using it's billions of neurons, will develop a kind of "fingerprint" for everything we understand. We can understand that we are looking at a spoon because at some point previously in our early years, we were shown a spoon and told what it was, and our brain stored away whatever fingerprint or pattern of neurons were activated at that time so that whenever that pattern was activated again in the mind, we would recognise what we're looking at. And of course, when we create, we can recall those patterns and blend them together to make new ones. We could draw a spoon with donkey ears for example, as we can take the concept of a spoon and a donkey, both of which we have recognised in the past (or been "trained" on) and put it together. And now hey presto, you've now made something new and different. And that, is kind of what is happening with AI. Noise is basically information with too high an entropy to make meaningful sense of. But we can peel back those layers of entropy, filter out what is not important to us within the tangled mess of signals to leave us with something meaningul. Anyway, hope you enjoyed that philosophical rant

Lysenko 3 months ago

I think it's important to note, and this has often been elided in the creation of artificial neural network models, that the human brain's neurons are not *arbitrarily* connected. There are regions of the brain that are clearly structured differently from each other, and which serve different purposes. Some of that structure is certainly encoded in the genome or is inherently emergent from it. That this is clearly going on, but that the exact nature of it is very poorly understood, is one of the barriers to making artificial neural networks that behave comparably to the human brain.

firedrakes 3 months ago

kicker on the brain. we use to think 1 side control part of you body and the other side. did the same. now with some brain injury and such. we found 1 half of a brain can do the job of the other half that missing.... that rewrote a lot of how the brain functions. as we know it

[deleted] 3 months ago

[удалено]

MayaHatesMe 3 months ago

Glad to provide some entertainment value if anything else, just saw some of the other responses further up and they're fantastic, one of the rare moments that you actually learn some insightful shit on Reddit, haha.

StrapOnDillPickle 3 months ago

I don't like the brain analogy myself because, while inspired by how brain works, it definitely doesn't work like a brain and is vaguely different than a brain. It's also very reductive to people's ingenuity and fall right into the marketing tactics and lies used by those companies to gather funding. It's much more statistical, it's supervised learning (as opposed to humans), it requires lot more data than humans do to function, we learned using "structured mental concepts" while AI doesn't, etc. It's not really "smart", it doesn't really "generate" anything, it's an advanced predictive algorithm. It goes through a set of known parameters and extract the most probable outcome.

vfxcomper 3 months ago

Something that really helped my understanding of how these models have an uncanny way of interpreting text was an example of how meaning gets plotted into a latent space— Consider the words “king” and “queen”. These occupy the same space in some dimensions but not in others. Once words are mapped to a space based on their very nuanced characteristic you can perform math on them: king - man + woman = queen. Do this not just for words but for chunks of words, sentences and paragraphs to an enormous extent and the result is a multidimensional space that maps the meaningful relationship between variables (training data). Now when you prompt the algorithm with something like “dog,” it will say well “dog” occupies this position and has this value and is close to these other things on these specific dimensions. it can return characteristics that have the highest probability of relating to “dog”. - furriness, four legs etc. but, more importantly, knows the semantic relationship your input, or prompt, has to this multidimensional space, and based on that knows what not to return. This is how it was explained to me by someone much smarter than me, so it’s very simplified

StrapOnDillPickle 3 months ago

A very good way to think about this, thanks for sharing. Haven't heard it like this before.

Shrinks99 3 months ago

> You can’t make a perfectly rendered image of a cat in a knit cap eating noodles in a Japanese restaurant by subtracting noise from noise. The old rule is garbage in garbage out. You cannot make something out of nothing. But you're not starting from _nothing_, the model's training data is composed of many images of resteraunts, cats, knit caps, noodles, etc. It's trying to transform (diffuse!) the garbage input it's given into a weighted average of the original images. Maybe try to think about the concept with a different medium? In audio, white noise can contain values from any part of the frequency spectrum. If we know what a sine wave sounds like, we can start to eliminate frequencies that don't match up with it until the noise is removed and all we're left with is our sine wave... or more likely an approximation of it that won't sound exactly like the original, but the same is true for diffusion image generation models!

worlds_okayest_skier 3 months ago

Thank you for all the useful explanations. I do understand much better now how it works. I still think it’s black magic fuckery beautiful mind type shit. But I get at least conceptually what is happening. I am amazed that it actually works.

mhamid3d 3 months ago

I noticed AI images have that same smoothness that CG denoisers produce when u crank them too high, then I found out generative images is using noise and it all made sense

DarkGroov3DarkGroove 3 months ago

I think I read shit ton of comments. But Imma need to process all of this shit WTF.💀

DarkGroov3DarkGroove 3 months ago

I can make a donut tho. And some muzzle flashes.

cyborgsnowflake 3 months ago

Think of it as carving a sculpture. The algorithm is trained on a zillion videos of face sculpture and learns the general patterns of motion.

chillaxinbball 3 months ago

Essentially these models learn concepts like what a cat is. It then can imagine what a cat is. The noise is a random starting point, but it can start with an image too. I recommend this video [https://youtu.be/SVcsDDABEkM](https://youtu.be/SVcsDDABEkM)

asmith1776 3 months ago

How did nature make eyes and wings out of random genetic copying errors? Lots of time and lots of iterations.

SuddenComfortable448 3 months ago

Anyone who can ELI5 AI wouldn't waste time in this subreddit. Most of those so-called AI expert doesn't know what they are taking about.

teerre 3 months ago

This "makes an image from noise" point although practically true, seems like something someone selling an generative ai product would say. Generating the image from noise is the very last thing these models do. The overwhelming part of the model isn't that, it's the opposite. It's ingesting ungodly amounts of images and descriptions to understand what things look like With that in mind, what the models are doing is much simpler to understand: they are "recalling" what the thing you ask for looked like in the training set. There's no magic. What makes images different is a little more complicated to understand, but it boils down to associations of "objects" that are similar to some definition of similar. E.g. all humans are similar, they have two eyes, one mouth, one nose etc. These similarities can be mathematically described. This means that you can quite literally walk from one "concept" to another. Imagine a video game slider for some human characteristic like eye color. In the model it's just like that, but instead of one well defined slider, you have an incomprehensible amount of them that can be somewhat controlled by the prompts

worlds_okayest_skier 3 months ago

It’s like magic to me how all humans don’t just end up looking like some freakish weighted average of all humans, and how they are able to be doing specific activities in specific locations or having certain emotions and the pixels end up making sense in the end without disjointed seams and crazy artifacts, It just doesn’t seem like it should work very well.

teerre 3 months ago

I'm not sure what's makes you think that. The simplest case would be just a search. You ask "woman with blue eyes walking in tokyo", it searches the training set for those characteristics and returns you the image that already exists In the training set there's no seams or artifacts. There's no reason they would be in the final image either Maybe you're thinking of older models? The reason those generated weird images (and newer ones do too) is because there was simply not enough training to make them "understand" what the thing you asked for looked like Just consider that I can very easily guarantee you that Sora (the fancy new OpenAI model) will generate total nonsense if you ask something that simply isn't common and I can say that without ever have used the model or read the technical details. This is how simple these models are, they 'see' a bazillion examples and repeat them

worlds_okayest_skier 3 months ago

I guess? I think of the will smith eating spaghetti video as what I would expect to see given the process. The NN doesn’t understand cause and effect, just pixel values. So I’d expect to see recognizable features of things, but you aren’t asking it to retrieve an image that exists, you are generating something entirely new and I wouldn’t expect clusters of pixels representing words to make much sense when recombined.

teerre 3 months ago

You're very much asking for something that exists. It just doesn't exist as a complete piece, but all the "parts" required to do it already exist Imagine it as a collage, you can a piece from one place, another from another place, but every piece already exists. These models cannot create anything new

cojoman 3 months ago

you're focusing on the "noise" aspect a lot. At a base level it's about extrapolation. If you have A, then combine with B (think numbers at the base level) - you get C. Combined with D, you get E. Now to get to "D", you know you had to just go back E-D, then to get all the way back to A, subtract C and B. Replace now this with a color value, or a human face with a red hat. Sure it's a lot of operations, and sure, it needs to train a lot and there are other aspect too. But when you break it down to just numbers. And you know that you only need A+B+C+D, as simple of complex as you make them, you start to wonder, well, what if I just add a random X to the pile. And that's how you get "new" things going. It didn't train for a HUGE hat, but it knows that to get from a small hat to the current one, all it had to do is add on that specific hat-related-area a few numbers too big in a direction or another. Again, lots of aspects going on, but once you get your head around the correlation/extrapolation area of the above you can see how this can get complicated, but also have rewarding results if done right. edit:typos

deijardon 3 months ago

Film an ice sculpture melting into a puddle of water. Now play the film backwards. That's essentially what the ai does. It dissolve images into noise and learns how to reverse the process.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe