More

kostaj · 2026-05-28T15:24:04 1779981844

Awesome. We do plan to human-label the 1,000 claims and then compare Lenz' performance vs the 5 models. We've done some limited internal research with 150 claims, but more are needed for statistical significance.

kostaj · 2026-05-28T15:21:19 1779981679

Agree that some of the claims are forward-looking. The messiness of the real-world and real-user fact checks. No ground-truth verdicts are provided or used in the study though. It only measures the level of agreement between the selected models, not which one is right on which claim. I.e. none of the claims is actually labelled.

brokensegue · 2026-05-28T15:58:58 1779983938

were you involved in making the study? your bio says you work for them so you should probably indicate that in your comments.

lack of agreement when there is no singular correct answer (or any answer at all) isn't a useful metric

I ran into a lot of these kinds of issues when working on the Citation Needed WMF project (and related extensions). Truth is so often very nuanced.

simonw · 2026-05-28T16:16:38 1779984998

They introduced themselves as the study author here: https://news.ycombinator.com/item?id=48307887#48307899

brokensegue · 2026-05-28T16:25:58 1779985558

ah. I missed that.

kostaj · 2026-05-28T15:18:35 1779981515

Good idea about publishing intra-model variance data! Will include in the next version. Even if we put aside the two middle buckets (Mostly True and Misleading), that are somewhat subject to interpretation and hedging: On 21% of the claims still at least two models provide polar-opposite verdicts (one model saying True, and another saying False)

vlovich123 · 2026-05-28T15:25:04 1779981904

Of those 21% how many are time-dependent questions that are past the model’s training and requires research to verify? Like the “did Ukraine attack Russian in the past week” question?

kostaj · 2026-05-28T15:05:07 1779980707

This is in line with my observations and tests as well. Also supported by the distribution of the verdicts across the 4-buckets -- Gemini uses the middle buckets (Mostly True and Misleading) much less often - 6% combined for Gemini w/o search. And Opus uses them the most - 45% combined. Looks like Gemini is calibrated to be confident and Opus to be careful.

kostaj · 2026-05-28T15:00:57 1779980457

Indeed. For algorithms and coding, my personal routine nowadays is to review every detailed plan with Opus 4.7 and GPT-5.5. They tend to find very different type of gaps.

kostaj · 2026-05-28T14:57:30 1779980250

Agree that True and Mostly True might be very close and could be a calibration difference. Misleading and False, as well. A better headline number might be the 34% claims with substantial or polar-opposite verdicts.

kostaj · 2026-05-28T14:54:22 1779980062

Agree. Human experts also struggle agreeing on this type of claims. The inter-annotator agreement on the verdicts on the AVeriTeC corpus across 50 organizations is κ=0.619 - substantial but well short of perfect.

kostaj · 2026-05-28T14:50:34 1779979834

Agree with @pjdesno, that the 34% substantive or polar disagreement might be a better headline number. Or even the 21% polar disagreement (at least one model True, and at least one model False), which is still high for many real-world applications.

kostaj · 2026-05-28T14:47:52 1779979672

That's a valid point. During the preliminary research, we did try also more explicit prompts (with explanation for each of the 4 buckets), as well as a five-bucket rubric (with Abstain option). Will show in a follow-up paper how the concise vs explicit prompt impacts the distribution of the verdicts and the level of disagreement. One issue to note with the longer prompts is that they open to much room for discussion around the exact prompt used. Probably we should preregister the prompt before running any further tests.

MattRogish · 2026-05-28T15:05:19 1779980719

The other thing I suspect is that "Just give me True/False" cuts off a large amount of the search space a modern-day LLM uses to help it answer questions (you can see it in reasoning traces but the act of writing the explanation helps guide it toward a better answer and gives it better likelihood it backtracks on a bad decision).

If you let it spew out an explanation along with the answer, I'm curious if the accuracy will improve (I suspect it will).

kostaj · 2026-05-28T15:46:56 1779983216

Good point. Will publish in the next version also the results with a prompt that allows the models to "think out loud" before providing the final verdict.

kostaj · 2026-05-28T14:30:25 1779978625

Quick note on the second effect - how LLMs reduce that to a four-category judgment: On 21% of the claims at least two models provide polar-opposite verdicts (at least one model False, and at least one model True). This might be a better measurement of the strict disagreement than the 67% disagreement on the four-bucket rubric.