Ok, let's chew on that. "reasonable mechanistic interpretability understanding" and "semantic" are carrying a lot of weight. I think nobody understands what's happening in these models -irrespective of narrative building from the pieces. On the macro level, everyone can see simple logical flaws.
> On the macro level, everyone can see simple logical flaws.
Your argument applies to humans as well. Or are you saying humans can't possibly understand bugs in code because they make simple logical flaws as well? Does that mean the existence of the Monty Hall Problem shows that humans cannot actually do math or logical reasoning?
Thanks for the link. Yeah, interesting and creative work. I can see how it can help reason about large models. "Interpret" seems more aspirational than real. It's still largely narrative driven. I've been waiting for something deep in this area, I'm not sure it will be this community or not. For sure, as of today, the bold claim is someone understands.
> Your argument applies to humans as well
Yeah, I'm talking about obvious and trivial errors that reveal lack of representation of the code. But your question did make me think, cheers.
> do you know what "Mechanistic Interpretability Researcher" means? Because that would be a fairly bold statement if you were aware of that.
The mere existence of a research field is not proof of anything except "some people are interested in this". Its certainly doesn't imply that anyone truly understands how LLMs process information, "think", or "reason".
As with all research, people have questions, ideas, theories and some of them will be right but most of them are bound to be wrong.
That's a lame typical anti-intellectual argument. You might as well as say all of physics is worthless because nobody truly understands gravity.
Notice I didn't use vague terms like "think" or "reason" and instead used specific terms like "feature/circuit internal representation". You're trying to make a false equivalence of "the hard problem of gravity/reasoning/etc is not solved ... so therefore nobody understands anything" and that's obviously a false leap of logic if you've talked to any physicist or ML researcher or whatever.
That type of response is more typical GED holder who wants to feel intellectually superior so they pull out a "well you don't know anything either" to a scientist.
That's fake epistemic humility, akin to a religious nutcase proclaiming "evolution is just a theory". In fact, he's using the exact same arguments.
I'm not impressed. I've seen this before, from "biology is actually fake" or "the covid vaccine is fake, the FDA is using an 'emergency authorization' which means it's made up", or plenty of other examples. That's not a substantive objection, that's a thought-terminating cliche which is designed to dismiss any merits in the moment.
Imagine if someone in 1945 said "nuclear bombs cannot be real, even if the USA just dropped a nuke on Hiroshima, because it's just theory and it hasn't been peer reviewed yet. The Manhattan project is burning a lot of money". That would be hilarious. And yet if someone identifies an actual neuron or feature in a ML model that activates upon recognition of a software bug- WHICH IS LITERALLY WHAT YOU WOULD EXPECT IF A MODEL HAS AN INTERNAL REPRESENTATION OF SUCH A THING- it gets dismissed. If such an obvious signal is dismissed, what is even the end goal?