Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

“We then compare the resulting embeddings using cosine similarity. When we begin the training process, the similarity will be low, even if the text describes the image correctly.”

How is this training performed? How is accuracy rated?



Cosine similarity is a fixed way of comparing two vectors, so we can think of it as making a difference: A-B = d

If d is close to 0, we say that both embeddings are similar.

If d is close to 1, we say that both embeddings are different.

Imagine we have the following data:

- Image A and its description A

- Image B and its description B

We would generate the following dataset:

- Image A & Description A. Expected label: 0

- Image B & Description B. Expected label: 0

- Image A & Description B. Expected label: 1

- Image B & Description A. Expected label: 1

The mixture of Image Y with Description Z with Y!=Z is what we call "negative sampling"

If the model predicts 1 but the expected value was 0 (or the other way around), it's a miss, and therefore the model is "penalized" and has to adjust the weights; if the prediction matches the expectation, it's a success, the model is not modified.

I hope this clears it


I'd be curious to see an example gallery of image generation of the same vector scaled to different magnitudes. That is, 100% cosine similarity, but still hitting different points of the embedding space.

The outcome vectors aren't normalized right? So there could be a hefty amount of difference in this space? Maybe not on concept, but perhaps on image quality?


> The outcome vectors aren't normalized right?

I'm not sure about it, maybe they are, it wouldn't be strange

> So there could be a hefty amount of difference in this space? Maybe not on concept, but perhaps on image quality?

Sure, each text could have more than one image matching the same representation (cosine wise), but maybe the changes wouldn't look much as "concepts" in the image but other features (sharpness, light, noise, actual pixel values, etc)

It would be curious to check, definitely


Thanks very much! That helped me understand the concept better.


From the paper on the CLIP embedder it appears that they use a form of contrastive loss that maximizes the cosine similarity between related images and prompts, but also minimizes the same between unrelated prompts & images.

See section 2.3 of the CLIP paper: https://arxiv.org/pdf/2103.00020.pdf

Also, the writeup on OpenAI's blog: https://openai.com/blog/clip/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: