In case anyone misses the links, this is twinned with two other superb posts - one about general lessons the author learned over the course of the project
Pintrest images from search too. There's an extension for that as well, but in the interest of having fewer extensions you can just search with `-site:pinterest.*`
I hope Google finally takes a stand against paywalls and popups themselves by deranking them. They already have a policy that a site is not allowed to hide info from users that they do show search engine crawlers.
I'm a huge fan of using simulations to ground qualitative arguments. While the sims usually need to be fine-tuned so extensively as to leave them open to claims of 'overfitting', the benefit is that it nails the assumptions in your argument to the church door.
Abstract:
Patterns of political unification and fragmentation have crucial implications for comparative economic development. Diamond (1997) famously argued that “fractured land” was responsible for China's tendency toward political unification and Europe's protracted political fragmentation. We build a dynamic model with granular geographical information in terms of topographical features and the location of productive agricultural land to quantitatively gauge the effects of “fractured land” on state formation in Eurasia. We find that either topography or productive land alone is sufficient to account for China's recurring political unification and Europe's persistent political fragmentation. The existence of a core region of high land productivity in Northern China plays a central role in our simulations. We discuss how our results map into observed historical outcomes and assess how robust our findings are.
And the boundary between them is permeable. Last generation's well spoken insane person can become the next generation's strong independent thinker, and vice versa.
Q. If a water bottle breaks and all the water comes out, how much water is left in the bottle, roughly?
A.
… Roughly half.
… If the bottle is full, there is no water left in the bottle.
I wouldn’t describe this as GPT-3 “smashing” the questions. It’s still clearly subhuman. This sort of question, logical real-world reasoning embedded in a descriptive sentence, is still hard for it. It’s definitely improving on GPT-2 though.
which isn't surprising because virtually all of the questions are so simple they could literally appear in the training data that GPT-3 was trained on. I'm a little tired of proving how "intelligent" GPT is by asking these superficial questions.
the MIT article gives much better examples that actually require physical, biological or higher-level reasoning and it produces complete nonsense as one would expect.
The article is meaninglessly cherry-picked, showing six bad answers out of 157, except those 157 examples were themselves cherry-picked to be bad out of a larger set.
As usual, Gary Marcus is absurdly biased. For example, out of the larger 157 cherry-picked examples, there is this.
> You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it. It tastes a little funny, but you don’t really notice because you are concentrating on how good it feels to drink something. The only thing that makes you stop is the look on your brother’s face when he catches you.
They then consider this a failure because, I quote, there is no reason for your brother to look concerned.
This is patently ridiculous. It indicates that Gary has no idea what a language model even is. GPT-3 is not a Q&A model. It is not given a distinction between its prompt and its previous continuation. The only thing GPT-3 does is look for likely continuations. If you want GPT-3 to avoid story continuations, don't give it a story to continue! Or at least tell it what you're grading it on!
But no, as usual, to Gary, all the times we show GPT-3 making sophisticated physical and biological deductions are fake, spurious, or meaningless. [1], [2], [3], [4]; none of that is truly evidence. But an incredibly cherry-picked, unfairly marked exam where you never told the examinee what you were testing them on, and you used high-temperature sampling without best-of, so only getting half right doesn't even indicate anything anyway (and of course, let's also pretend there are as many ways to be wrong as to be right, such that we can pretend each is equal evidence)—now that's enough evidence to write a disparaging article about how GPT-3 knows nothing.
Marcus might be biased but I don't think you're giving a good refutation, because the fact that GPT-3 gets a lot of things right probabilistically doesn't compensate for the fact that it's not actually understanding what's going on at a semantic level.
It's a little bit like some sort of Chinese room, or asking a non-developer to answer you programming questions by looking like something that vaguely resembles your prompt and then picking the most upvoted answer on stackoverflow.
Do they maybe give reasonable answers seven out of ten times or close enough on a good day? Yeah, can they program or even understand the question? No. And this is Marcus point which is fundamentally correct.
It's really besides the point to point to successes, its the long tail of failures that show where the problem is. You can argue for a long time about the setup of some of these questions, but just to pick maybe the simplest one from the article
"Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?"
GPT-3: "I have a lot of clothes"
Someone who actually understands what's going on doesn't produce output like this. Never, because reasoning here is not probabilistic. It's not about word tokens or continuations but understanding the objects that the words represent and their relationship in the world at a deep, principled level. Which GPT-3 does not do. The fact that some good answers create that appearance does not change that fact.
> It's a little bit like some sort of Chinese room, or asking a non-developer to answer you programming questions by looking like something that vaguely resembles your prompt and then picking the most upvoted answer on stackoverflow.
Except this isn't how it works. We know it can't be, because GPT-3 can do simple math, despite math being vastly harder with GPT-3's byte pair encoding (it doesn't use base-N, but some awful variable-length compressed format). These dismissals don't hold up to the evidence.
> GPT-3: "I have a lot of clothes"
Most people don't write “Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?” as a way to quiz themselves in the middle of a paragraph. The answer “At the dry cleaner's.” might be the answer you want, but it's a pretty contrived way of writing.
GPT-3 isn't answering your question, it's continuing your story. If you want it to give straight answers, rather than build a narrative, prompt it with a Q&A format and ask it explicitly.
Further, GPT-3's answers are literally chosen randomly, due to the high temperature and no best-of. You cannot select one answer out of a large such N to demonstrate that its assigned probabilities are bad, because that cherry-picking will naturally search for GPT-3's least favourable generations.
It can't actually, and again this is an example of the same issue. This was discussed earlier here[1]. Sometimes it produces correct arithmetic results on addition or subtraction of very small numbers, but again this is likely simply an artifact of training data. On virtually everything else it's accuracy drops to guesswork, and it doesn't even consistently get operations right that are more or less equivalent to what it just did before.
If it actually did understand mathematics, it would not be good at adding two or three digit numbers but fail at adding four digit numbers or doing some marginally more complicated looking operation. That is because that sort of mathematics isn't probabilistic. If it had learned actual mathematical principles, it would do it without these errors.
Mathematics doesn't consider of guessing the next language token in a mathematical equation from data, it consists of understanding the axioms of maths and then performing operations according to logical rules.
This problem is akin to the performance of ML in games like breakout. It looks great, but then you adjust the paddle by five pixels and it turns out it hasn't actually understood what the paddle or the point of the game is at all.
GPT-3's failure at larger addition sizes is almost fully due to BPE, which is incredibly pathological (392 is a ‘digit’, 393 is not; GPT-3 is also never told about the BPE scheme). When using commas, GPT-3 does OK at larger sizes. Not perfect, but certainly better than should be expected of it, given how bad BPEs are.
If you give me a task of competing a story narrative, I find the following continuation to be quite likely:
> Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes? I have a lot of clothes so I spend a lot of time looking for them.
Am I falling to actually understand what's going on? Or am I actually doing what I was supposed to do i.e. continue the narrative?
[1] in particular I find pretty interesting. I'm skeptical in general of Gwern's "sampling can prove intelligence" idea, but this does seem like a good example of where it applies; it's hard to see how this could be answered without some embedding of a conceptual model.
OpenAI would naturally optimize for the tests published by Marcus as a critique of GPT-2, yet GPT-3 still fails physical reasoning spectacularly (the one test needing casual reasoning the most).
There are two broader points here:
1. The lack of independently verifiable evaluation metrics for these type of models should make everyone very skeptical. (Who can afford to retrain GPT-3 from scratch?)
2. I find it difficult to believe that smart people still insist that a model incapable of representing causal relationships can produce intelligent answers.
> OpenAI would naturally optimize for the tests published by Marcus as a critique of GPT-2
It would be difficult for them to do so since Marcus's GPT2 critique came out after they collected the dataset for GPT3.
Marcus's article: Jan 2020
GPT-3 dataset: "Table 2.2 shows the final mixture of datasets that we used in training. The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019"
(1) I certainly agree with. But Marcus doesn't claim skepticism about GPT-3s intelligence; he claims that his evaluation metrics definitively show it doesn't understand the text it outputs or know anything about the world.
(2) is, I think, a misunderstanding. People who believe GPT-3 is producing intelligent answers generally believe it can represent causal relationships.
No, those concrete tests are mostly issues that researchers have been talking about for years, meaning that many of them appear on the Internet somewhere. Increasing the volume of training data to hundreds of gigabytes likely meant that the exact questions and answers appeared in the training data.
So GPT-3 didn't "smash them", it cut and pasted the answer from its training.
Did a double-take seeing Maldacena's name on this. He's better known for discovering AdS/CFT, which is the foundation of a lot of modern work on quantum gravity.
https://clemenswinter.com/2021/03/24/my-reinforcement-learni...
and one history of the project
https://clemenswinter.com/2021/03/24/conjuring-a-codecraft-m...