Yeah, I was looking for a book about this topic, but there doesn't seem to be anything out there except for research articles, it is just too new.
When I look at the descriptions of the papers, it sounds incredibly complicated, and at the same time incredibly trivial. Clearly this whole domain is not well enough understood yet to be explained properly. Or maybe there is no interest yet in clear and concise explanations.
The description of the attention mechanism of GPT architectures and a couple of examples can be very brief. Then you have to supply your imagination and realize that the model simply found a whole bunch of very effective such attention measures by itself, which are all computed for every query, and they could be anything from the straight forward examples, or some more clever abstract types of attention. I think we'll need some more descriptions of what the more important attention measures it comes up with really are, to understand better.
The pun is in "attention" because GPT uses "attention" to weigh each input token and comes up with an attention score between whatever token is currently being generated and all the input tokens then it'll take those scores to determine what weight each contributes to the output. Something along those lines... I'm no GPT expert.
I worked in one of the big labs when the first large models came out and I can pretty confidently say that nobody in the field predicted this. Sure, there were always people who said "let's make models bigger because why not, we have the infra and it'll be a good paper" but nobody expected them to become this good just by being bigger and using more data. The consensus was that they'd hit a ceiling of what they can do much sooner.
Only some model architectures continue to get better as you pump in more data. Transformers and their variants have this property more so than prior architectures.
When I look at the descriptions of the papers, it sounds incredibly complicated, and at the same time incredibly trivial. Clearly this whole domain is not well enough understood yet to be explained properly. Or maybe there is no interest yet in clear and concise explanations.