>Now try properly aligning with an "entity" that thinks nothing like you, has a vastly different set of experiences, doesn't know how to count, and has no real concept of physical mechanisms (and more).
The biggest issue with alignment is that humans don't really know what they mean nor want in the first place.
Yet, we train these networks to produce high quality images, whose resolution is necessarily higher than the resolution of the fundamentally ambiguous human input.
Training a high quality image is different than training a high quality image of the specific thing you want. One has substantially more flexibility. It is ridiculous to compare the two.
As for humans, I will give the constant reminder. There are 3 parts to language: 1) what you intend to convey (what's in your head), 2) what you actually say/write (encoding head to physical), 3) what the other person understands (decoding physical to mental). These 3 things can have 3 different meanings. Do your best in the first two, but the third requires the other person to be acting in good faith.
The biggest issue with alignment is that humans don't really know what they mean nor want in the first place. Yet, we train these networks to produce high quality images, whose resolution is necessarily higher than the resolution of the fundamentally ambiguous human input.