Anyone know why text-to-image models have so many fewer parameters than text mod...

Sohcahtoa82 · on Aug 2, 2024

The way someone explained it to me is that text-to-image models are essentially just de-noisers.

They train them by taking an image with a label, ie, "cat", and then adding some noise to it, run a training step, add more noise, run another step, and so on until the image is total (or near total) noise and still being told it's a cat.

Then, when you want to generate "cat", you start with noise, and it finds a cat in the noise and cancels some of the noise repeatedly. If you're able to watch an image get generated, sometimes you'll even see two cats on top of each other, but one ends up fading away.

Turns out, these denoisers don't require that many parameters, and if your resulting image has a few pixels that are just a tiny bit off color, you won't even notice.

minimaxir · on Aug 1, 2024

Diffusion is very efficient encoding/decoding.

The only reason that diffusion isn't used for text is because text requires discrete outputs.

astrange · on Aug 2, 2024

But see: https://arxiv.org/pdf/2407.15595

ZoomerCretin · on Aug 2, 2024

Thank you for the explanation!

fennecbutt · on Aug 6, 2024

If a wurd is misspelt then you notis right away.

If a pixel is just slightly the wrong shade of green, nobody really cares.