Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Anyone know why text-to-image models have so many fewer parameters than text models? Are there any large image models (>70b, 400b, etc)?


The way someone explained it to me is that text-to-image models are essentially just de-noisers.

They train them by taking an image with a label, ie, "cat", and then adding some noise to it, run a training step, add more noise, run another step, and so on until the image is total (or near total) noise and still being told it's a cat.

Then, when you want to generate "cat", you start with noise, and it finds a cat in the noise and cancels some of the noise repeatedly. If you're able to watch an image get generated, sometimes you'll even see two cats on top of each other, but one ends up fading away.

Turns out, these denoisers don't require that many parameters, and if your resulting image has a few pixels that are just a tiny bit off color, you won't even notice.


Diffusion is very efficient encoding/decoding.

The only reason that diffusion isn't used for text is because text requires discrete outputs.



Thank you for the explanation!


If a wurd is misspelt then you notis right away.

If a pixel is just slightly the wrong shade of green, nobody really cares.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: