Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AI Overviews: About last week (blog.google)
81 points by gnabgib on May 31, 2024 | hide | past | favorite | 61 comments


I'm usually pretty cynical about these kinds of posts, but Google seems to be pretty honest here. It's a pretty reasonable take on what went wrong and what they're working on to fix it. I still don't particularly like the AI Overview though.


They summarizer model has access to their core web page ranker. Since we saw Reddit and Onion posts in their result set for the summarizer, I'd argue that the goals of the core web page ranker don't align well with the goals of the summarizer.


> but Google seems to be pretty honest here.

They're not.

Take "This means that AI Overviews generally don't “hallucinate” or make things up in the ways that other LLM products might."

This is just a lie, in two ways:

1. It's still an LLM, feeding search results into the prompts and asking for a summary reduces the probability of hallucination compared to directly asking the question in the prompt, it's still a probability non-negligibly above zero. This kind of pretending you can "fix" the hallucination problem by feeding data into the prompt is extremely wrong and dangerous.

2. It's missing the entire point. "Um akshually the LLM didn't hallucinate, we simply gave it Reddit user Fucksmith's post which it confidently restated as truth" (or indeed, the much funnier, "we gave it The Onion, which it confidently restated as truth") is functionally the same as a hallucination to the end user. The mechanism here does not matter.

These are both fundamental critical issues to AI overviews. "Rare" hallucinations are unacceptable on a tool used billions of times a day. And the LLM paraphrasing lies/satire/ignorance/etc into truth is a fundamental flaw of using LLMs to paraphrase like this.

You can't just patch this issue, the entire thing is fundamentally flawed.


> This is just a lie [...] it's still a probability non-negligibly above zero. This kind of pretending you can "fix" the hallucination problem [...]

I don't think "generally don't hallucinate" and "it's usually for other reasons" are implying a total fix. From what I've seen, it is true that these errors stemmed from uncritically accepting satire/unreliable sources rather than hallucinations.

> It's missing the entire point. "Um akshually the LLM didn't hallucinate, we simply gave it Reddit user Fucksmith's post which it confidently restated as truth" (or indeed, the much funnier, "we gave it The Onion, which it confidently restated as truth") is functionally the same as a hallucination to the end user. The mechanism here does not matter.

When you can check the provided source it's summarising and see that it's a satire website, I think that's materially different than ChatGPT's hallucinations from thin air.

It is still an issue, and if the post was just deflecting by saying "it's not typical hallucination so it doesn't matter" then I'd agree that'd be insufficient, but it seems totally fine to clarify that it's a different issue and go on to describe how they're working on fixing that issue.

> "Rare" hallucinations are unacceptable on a tool used billions of times a day

I don't think 100.0% accuracy is a reasonable bar for anything, including humans. My benchmark would be roughly "people come away with the correct information the same or more often as they did before, with features like highlighted snippets".


> When you can check the provided source it's summarising and see that it's a satire website, I think that's materially different than ChatGPT's hallucinations from thin air.

I disagree.

The moment you accept the need to always check the source, there's no fucking point to this anymore.

If you have to check the source ... just provide the links like Google Search does. There's no point in a summary whose full original text you always have to read.

> From what I've seen, it is true that these errors stemmed from uncritically accepting satire/unreliable sources rather than hallucinations.

And so, this is why I don't think this holds up; If we consider this a failure mode of the user, the entire tool is pointless.

It's plainly obvious that the intended benefit of AI Overview is not having to read the pages of the search result. And with that intent, this is not a failure of the user.


> The moment you accept the need to always check the source, there's no fucking point to this anymore.

> If you have to check the source ... just provide the links like Google Search does. There's no point in a summary whose full original text you always have to read.

> [...] It's plainly obvious that the intended benefit of AI Overview is not having to read the pages of the search result.

I see its purpose as similar to the text snippets and "highlight" box that Google search has had for a while: quickly picking out information relevant to your query.

I don't think most answers actually warrant the user to verify that it came from a reliable source. If you ask "what foods end with um" or "good ideas for birthday party" you can generally just judge the answers yourself, like you would even if you already knew the answer came from Quora/reddit.

But for answers that do (which is still plenty), it makes it easier to click through and check the part of the summary you determined to be relevant, as opposed to manually parsing through pages to find that part in the first place.

> And so, this is why I don't think this holds up; If we consider this a failure mode of the user, the entire tool is pointless.

I don't consider it solely a failure on the user's part - the tool is failing here and Google claim to be making changes to improve it. I don't think lack of 100% accuracy makes automatic summaries pointless - they are still useful for surfacing information.


Excuse me, lack of information is a misdirect. They have data warehouse of information that knows where this article is coming from, and every fucking url ever put into chrome. They bought the rights to /r/the_onion

It’s an onion article. Directly from the page.

Are they saying “the mighty google search” + “AI” is going to be somehow worse because they don’t have enough data about the fucking Onion article with dozens of hits in 2021?

Even a 3.8B parameter edge model (phi3-mini) has enough satire token clustered around the first two paragraphs to flag it accordingly.

https://cdn.some.pics/snekoil/66594bca93e19.jpg

as does llama3-7B

https://cdn.some.pics/snekoil/66594cb8cddcd.jpg

as does GPT4

https://cdn.some.pics/snekoil/66594e1a3de29.jpg

as does Claude

https://cdn.some.pics/snekoil/665955fb22618.jpg

Instead they gaslight us with words like “faithful” and “nobody asked the question”. Their entire competition executes better.

Just because they AI wash the search data to boost their stock narrative we don’t have to accept their lame limitations narrative. They chose this fight.

Also see my longer commentary at the root which is getting google brigaded like many other critical comments in this thread.

Also, this is Zuck level of malevolent naïveté, it’s not like they don’t have experience https://www.theverge.com/2024/2/21/24079371/google-ai-gemini...


I think the difference between our takes is our understanding of how the "AI Overview" was trained. Google said,

> AI Overviews work very differently than chatbots and other LLM products that people may have tried out

and later,

> Let’s take a look at an example: “How many rocks should I eat?” Prior to these screenshots going viral, practically no one asked Google that question.

It seems it's being trained based on search queries. This is important to protect against confabulation, but obviously AI will fail on out-of-distribution data. Such as search queries that practically no one asked about before. Now, they could ask an LLM if the AI Overview seems reasonable, but the entire point of their alternate training scheme is to guarantee the output can be backed with sources. Why introduce a potential failure mechanism?

You say, "they gaslight us with words like 'faithful' and 'nobody asked the question,'" but given this is how their model works, it seems like the best, non-gaslighty defense. Plus, it's pretty unreasonable to expect Google to fix every potential bug before production. As they said,

> there’s nothing quite like having millions of people using the feature with many novel searches. We’ve also seen nonsensical new searches, seemingly aimed at producing erroneous results.

Having "a content policy violation on less than one in every 7 million unique queries" seems pretty solid to me. I'm sure it'd be much higher if they were using an LLM like Claude.

EDIT: My argument is pretty much, it looks like they designed this to be helpful and accurate, so the errors seem like honest mistakes from imperfect execution rather than deception or politicking. Their post seems pretty open about acknowledging what went wrong and that they want to do better.


> Having "a content policy violation on less than one in every 7 million unique queries" seems pretty solid to me.

Not to me. AI will lead to more queries because I search differently with AI results than an index/page rank result.

Also, it’s not about the unique queries, but the number of users running them.

This seems like “data waving” where google is giving some not relevant number hoping to distract from the issue they can’t or don’t want to solve. If eating glue is just one of seven million unique queries but it’s searches by a million people, that’s more important than one of 7 thousand only searched by one person.

The measure should be impact of bad queries and bad info. Or amount of garbage responded.

Or even better, confusion created in people based on AI responses. They might be able to proxy this because they have chrome usage and android usage data and see what people do after these queries. Do they stop searching? Do they watch a movie? Do they jump off a building? Do they never search again (ie, die)?


> AI will lead to more queries because I search differently with AI results than an index/page rank result.

If you do 1000 queries a day, it would take you twenty years before you expect to hit a problematic query, assuming such errors are random. First, almost no one makes that many queries a day, and second, the errors come mostly from being silly, not when trying to find accurate information.

> Also, it’s not about the unique queries, but the number of users running them. This seems like “data waving” where google is giving some not relevant number hoping to distract from the issue they can’t or don’t want to solve.

I actually want to know once per 7M unique queries more than 10M users (or w/e the number is). Memes spread, so users don't tell me what their failure rate actually looks like.

> The measure should be impact of bad queries and bad info. Or amount of garbage responded.

Most stuff on the internet already is garbage advice and bad info. Much of this is due to SEO spam, but it's also because there are far fewer experts than bloggers (e.g. on the topic of AI). Now, I think the best solution would be for Google to start punishing SEO spam again so I can actually search for the right information by myself, but it seems possible that the AI Overview could be helpful in the sifting process.

> Or even better, confusion created in people based on AI responses.

I mean, this is a pretty difficult measure to figure out, even through polling. People aren't really trained to notice when they're confused, and it's pretty difficult to know you saw misinformation when you were looking up the information for the first time.


So they failed to make it work and are now trying to shift the goalposts. They should not have launched it then.


“nobody asked the question”

That's such an odd thing to say for a search engine operator.


The article made me question Google search even more.

« we tested the feature extensively before launch. This included robust red-teaming efforts, evaluations with samples of typical user queries and tests on a proportion of search traffic to see how it performed »

Then she explains that:

«We’ve also seen nonsensical new searches, seemingly aimed at producing erroneous results.»

So, the red-teaming didn't think of prompt injections or how to identify fake results? If that's the case, wouldn't that mean all results from Google searches are questionable, considering how much crap coming out of SEO-farming websites?


Regarding SEO stuff, I cynically think Google knows about it, but doesn’t do anything because it would reduce revenue.

A friend told me that google estimates about 30% of its clicks are fraud, accident, waste, etc. But they have no incentive to fix because they make money off every wrong ad shown and every accidental or fraudulent click.

So SEO-farming sites make Google money. If I have to click on different ads and multiple garbage sites, that’s more lucrative than just seeing a good result and skipping all the ads and blargh. Remember when “I feel lucky” was helpful?

I can’t wait for some lawsuit discovery to produce the emails where google sees red team results for bad ads and decides to keep them because of revenue.

This stuff doesn’t make google money, so they’re happy to cut it.


Re: the 30% fraud figure. An interesting book just came out last month that speculates on the amount of fraud / wasted clicks / impressions in the ecosystem

https://dl.bookfunnel.com/6kwmi0qlsq

TL;DR it's pretty damning, but since ultimately advertising is shown to drive revenue, the whole problem is priced in so to speak.


They don't seem to understand the problem. If they link you to a website that tells you to eat a rock, whatever. If they put their brand on that with an AI Overview, that's a problem.


Meanwhile, other LLM providers don’t need to worry as much about their brand image.


They understand just fine. It’s just they see the potential of running the equivalent of the world oracle where every monkey goes and receives the instructions from Gemini, the most powerful of AI deities as a native ad surface.

They chose this. “Let them eat rocks” was a choice.

After all they promised to do better after the racially diverse SS division dreamed up by Gemini a few months ago.


To detect the faked screenshot results, the post asks us to run the queries for ourselves and see what happens. But where does that leave people outside the US who don't have access to AI Overviews?


It's also non-deterministic, right? So what does it prove if I get a different answer to someone on Twitter?


Bah, we just have to wait for the next batch of GPU they get so that giant datacenters can be equipped to give us 0.2% more accurate results about how to eat rocks..

Cant wait for the next-gen ads !


Plus they say they’ve already rolled out some changes, no?


That's the point.

The dirty secret of a lot of AI is that they rush out a hotfix for any such issues that pop up, and then just insist the issue was fake. "You can't replicate the issue, our AI is flawless, DO NOT LOOK AT THE PATCH ROLLOUT BEHIND THE CURTAIN".


"We hold ourselves to a high standard, as do our users, so we expect and appreciate the feedback, and take it seriously."

In terms of AI Safety, I believe Anthrophic has built the right culture and mindset. In my opinion there is a difference between doing right and say that you are doing right. Usually if you do the latter you actually don't. It feels like BigTech companies are scrambling towards not falling behind in the AI race, move fast and break things, with little consideration of the potential harmful side affects of edge use cases for covert influence and similar.


it’s unfair but hey, Google chose to compete.

https://cdn.some.pics/snekoil/665955fb22618.jpg


> We’ve also seen nonsensical new searches, seemingly aimed at producing erroneous results.

Is this the first time they release something? Of course people are going to try to break it.



The shift from extracting content from the web and showing it versus generating content is an inflection point in search. It’s hard to get right, and I don’t agree that people like it better than regular search. It seems so, but the reality is more nuanced since positive engagement vs. frustration are hard to distinguish at a distance.


It's impossible to get it right: it's not search anymore, it's editorialising. And for computers to do it while adding value will take time, and probably more than applied statistics. Like maybe they should experience it themselves, go against the flow sometimes, have an actual, political even, opinion. A desire, an agenda.

I see people asking the stuff for opinions, like "what should I do when X": you think it's search? You think it's "hard"? It's just insane and there will be a huge hangover once the limits have been properly understood. Puts on Nvidia.

Google is whining that their billions of dollars invested in a thinking machine, an artificial "intelligence" in their own words, is unable to guess when it s been made a fool of, and gets a beautiful F on the Turing test, sigh: but at least it's more interesting than the "metaverse" virtual real estate grift I guess that their main advertising competitor attempted.


I’m waiting for the first section 230 case where they blame AI for the summary and argue it’s user generated content.


Well I think in that case that'd be like your dog biting a naughty child in a park. It's not exactly entirely your fault, it's not the dog's fault, the kid is not entirely always innocent either. Put down the dog, fine the owner, scold the child and move on.


More interesting than the launch of AI overviews has been the universally cold and hostile response from the ad-supported media. There is an air of panicked existential dread in their reporting of it.

Yeah the AI is new and rough. That will be fixed. What won't be fixed is people's desire to read AI overviews instead of click bait headlines that drag you through ad ridden filler before the one worthwhile point.

The crack of Overview's faux pas is a twig breaking compared to the earth shaking crack of the words-for-adviews industry.


This is under-appreciated. AI overviews are an extinction level event for low-quality information-regurgitators. I'd love to read some analysis on this, since it seems like AI overviews will significantly erode Google's own AdSense revenue.


> AI overviews are an extinction level event for low-quality information-regurgitators.

I think this is good. They don’t serve a positive role in the information food chain. Freeing up their space should be helpful for people using information.


Google will almost certainly wield it's AI as a salesperson for advertisers. The ads will follow the users.


No they are an extinction level event for all ad financed content. Google will just switch to “pay us to get listed” over time.


Aside from the content itself, the "Listen to Article" button uses a robotic, outdated TTS voice. Shouldn't a company like Google use their latest technologies in public-facing content, particularly when discussing AI progress?

I'm genuinely curious about the decision-making process behind this choice.


It's amusing to me that they spend so much energy explaining that there were a lot of faked screenshots... but in this very same post they admit the tool told people to put glue on pizza and eat rocks!


What do you mean? Both can be true.


>Some of these faked results have been obvious and silly. Others have implied that we returned dangerous results for topics like leaving dogs in cars, smoking while pregnant, and depression.

Meanwhile "add glue to your cheese" was a real result, which I guess was not Google "returning a dangerous result" because they specified that the glue should be non-toxic. Not really sure how obvious smoking while pregnant should be after that.


The quoted paragraph describes two distinct and _separate_ group: the "obvious and silly" ones, and the "others [which] implied that we returned dangerous results". Smoking while pregnant is listed among the latter.


Of course, but I would read it differently if the faked screenshots were wildly more absurd what was actually being generated (which in this case, they weren't!)


The fake posts involved things like suggestions to jump off the Golden Gate Bridge and that cockroaches often live in human penises. Those were quite a bit more absurd (and dangerous) than the actual falsities.


Both can be true. But fakes screenshots isn’t relevant to the topic at hand.

So why mention it?


A masterclass in gaslighting users. Google just can't take the L on this one. I don't know anyone personally who found the AI overviews to be helpful.

For one, searching is an act of research, corroborating multiple sources, and the current implementation of AI overviews obstructs this process. There is no trust for these services yet, and for good reason; if the AI is even 5% likely to return a harmful or "satirical" result, it should be treated as a completely unreliable source, and the user should fall back to reading the provided sources anyway. So I'm not sure how this is more helpful than regular vector search.

Just about the only useful thing about AI overviews is that it places content at the top of the page, directly beneath the search bar, instead of three full pages down, below two awfully-designed widgets and a dozen sponsored, often nefarious results. That was nice. Too bad it's just another useless widget.

Does anyone left at Google with power actually care about delivering a good search product? Earnest question.


Wow so much google brigading going on. this does not deserve the downvotes it gets.


Wow, they are basically admitting they can’t solve the problem.

Whoever solves this problem, hence has a good shot at being the next Google. OpenAI is probably not there yet, and it’s unclear if they will be.

Apple? Some dark horse ?


Literally no one has solved it yet.


I agree, kind of similar to ranked web search pre-pagerank.


Literally unsolvable. No way even a 3.8B parameter vision model for the edge (phi3-llava) trained on the internet of hate could cluster the tokens of the first to paragraphs together with bullshit.

https://cdn.some.pics/snekoil/66594bca93e19.jpg

No way llama3-7B Q4 would be able to under the same on a home GPU

https://cdn.some.pics/snekoil/66594cb8cddcd.jpg

No way GPT4 can perform a web search to determine the source is the onion and flag it as satirical or summarize the page and flag it satirical

https://cdn.some.pics/snekoil/66594e1a3de29.jpg

No way Claude can flag that it’s satire.

https://cdn.some.pics/snekoil/665955fb22618.jpg

No way WizardLM30B would suggest it’s satire because eating rocks is not scientifically backed.

Yes, it’s not reliable, but it shows that there’s enough information encoded in even small edge models without access to the largest data warehouse of web content and human interaction and hundreds of millions of dollars in content deals to put a “likely fake on it”.

Yes, it may be unsolved at SERP scale, not chat scale, but Google chose to have this fight on SERP and launch so they have to take the hits.

Yes replacing context with Gemini god quotes to monopolize traffic onto google is bad, but given the above results augmenting results rather than appropriating them seems to have some potential as seen with Perplexity.

Yes LLMs are flawed, but as some competitors show they can improve on the very low bar that is Google search results and not just by removing the paid misinformation / ads. Google showed they can make it worse.


I am not saying LLMs will solve this, hopefully some future tech will.

Can you solve the problems you have posed with some reliability? If yes, there’s a good chance it’ll be automated at some point.

Claiming something is unsolvable rarely pans out unless it violates some fundamental law of physics.


Your post here is repetitive, we get your point.


Sorry but solve the problem Of absolute truth?

Has everyone lost their mind? This isn’t possible.


No. You don’t solve that. But if you remove the context from data and present it as truth by the Gemini God, you chose the battle and don’t get to claim “it’s unsolvable”.


Fair


I guess Science is our best attempt.. followed by well funded serious journalism ?

An adversarial network might actually make a good effort to matching assertions with verifiable facts .. or at least facts that have a high factualness page rank.


Truth quite literally isn’t singular. Contextual variation is a core part of communication. As is trusted parties. And lying.

Engineers are the last people I want in charge of this.


... and I guess another problem is that many/most humans dont want to know the truth.

If we had a chatbot that said .. "btw, your religion XYZ is mostly made up myths" .. would that sell well ?


Fully solve it - impossible. But we can still detect sarcasm and BS a lot of the time. The models are just not trained to do that typically so they never stood the chance. https://paperswithcode.com/task/sarcasm-detection


This is remarkable and so full of misdirection, it’s hard to know where to start?

Is it the “lack of contemplation” invoked by a company enough data (see the recent google search leaks) to make the NSA blush?

Is it the fact that it glances over the vast amount of data about this one article back from 2021 that all screams satire?

Is it that google bought reddit data which is full of context about the topic and content in /r/theonion and /r/notheonion?

Is it that when I tried 8 other LLMs, all the way down to phi3-llava, a 3.8B vision model on a text only screenshot of the first two paragraph it responded with

> “ Berkeley did not actually recommend eating small rocks as part of a healthy diet. The idea of incorporating sediment into one's diet, let alone suggesting that people should aim for a daily intake of at least one small rock, is absurd and completely contradicts scientific consensus”

as did every competing model in one form or another (gpt4, claude, llama3-7B, wizardlm), indicating that even an edge model as enough information in its weights to identify the article as satire even without google’s data warehouse.

Is it the language talking about Gemini “faithfully complying” and “nobody asked the question”, which seems to paint the technology as a faithful hound let down by its human masters?

Is it that Google says “it’s hard” as if they were not the best resourced tech company in the world fully in charge of their own roadmap and technology stack and the luxury to pay off tens of thousands of workers because AI is so awesome?

Is it that Google chose to shove this down the throats of SERP customers rather launching a new product to use their monopoly to quash a better executing competiton?

Is it that there a pattern of similar issues (“Opps, Asian and African American SS troops”)?

Is it that it’s impossible to tell the excuses from a VP of Product from an LLM hallucination? (That may actually reinforce the economic potential of the technology)?

Is it that we are apparently so gullible, we deserve being fed something like this?

I almost feel bad for Perplexity because a lot of people will take away “the technology is a total failure” rather than “Google is so desperate they’d rather burn down the ecosystem and tell everyone to eat glue [1] than let them get 1% of search” from this

[1] https://snekoil.omg.lol


I'm pretty sure Brave has been doing this for awhile now, and it's certainly available outside of the US. It often saves me a click, which I appreciate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: