Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To me, this reads like a very reasonable take.

He suggests to limit the scope of the AI problem, add manual overrides in case there are unexpected situations, and he (rightly, in my opinion) predicts that the business case for exponentially scaling LLM models isn't there. With that context, I like his iPod example. Apple probably could have made a 3TB iPod to stick to Moore's law for another few years, but after they reached 160GB of music storage there was no usecase where adding more would deliver more benefits than the added costs.



I’m still waiting for SalesForce to integrate an LLM into Slack so I can ask it business logic and decisions long lost. Still waiting for Microsoft to integrate an LLM into outlook so I can get a summary of a 20 email long chain I just got CCed into.

I don’t think the iPod comparison is a valid one. People only have so much time to listen to music. Past a certain point, no one has enough good music they like to put into a 3TB iPod. However, the more data you feed into an LLM, the smarter it should be in the response. Therefore, the scale of iPod storage and LLM context is on completely different curves.


Haha, I'd be happy if outlook just integrated a search that actually works.

Most of outlook search results aren't even relevant, and it regularly misses things I know are there. Literally the most useless search I've ever had to use.


Irony: they did. They bought LookOut, which was a simple and extremely good search plugin for desktop Outlook. And then, somehow, it was melted down into the rather weak beer search that 365 has today.

There is an alternative, Lookeen, which positions itself as LookOut's successor, but I've yet to try it.

https://lookeen.com/solutions/outlook-search/lookout-alterna...


I can vouch for Lookeen (circa 2012-2015). I set it up for 200+ users on Citrix and it worked great. I had the index for each user saved to their network share and yet the search was instant. It even indexed shared mailboxes. It barely used any CPU when doing background indexing.

It worked well with thin OSTs too but due to how Outlook and Exchange work, it would have to rebuild the index more often.

Definitely a blast from the past reading the word “Lookeen” but mostly good memories about it. I believe the ADMX integration was pretty decent too.


Don’t get me started on Outlook’s search. I can try to search for an email that’s only a few weeks old and somehow it won’t find it. It will, however, find emails that are from over a decade ago.


If they can't make an effective basic text search, I doubt they're going to manage an LLM integration that doesn't return straight up garbage.


Right, thats why they give openai gobs of money for their llm research


Most useless ever? You've used a Windows OS, right?


The fact that one chap's decade-old freeware[0] is 100x better search than the current native tool of a trillion-dollar technology corp is my proof that there is a god, and their name is Loki.

[0]https://www.voidtools.com/support/everything/


That’s exactly my thoughts. If search is broken, expecting LLMs to fix that sounds naive or just disingenious.


Thunderbird's is bad, too. GMail's is pretty good.


> However, the more data you feed into an LLM, the smarter it should be in the response.

Is it that way? For example if it lacks a certain reasoning capability, then more data may not change that. So far LLMs lack useful ideas of truth, it will easily generate untrue statements. We see lots of hacks how to control that, with unconvincing results.


That has not been my experience with GPT4 and GPt4o. Maybe you’re using worse models?

The point is that the more context an LLM or human has, the better decision it can make in theory. I don’t think you can debate this.

Hallucinations and LLM context scale are more engineering problems.


ChatGPT says, “Generally, yes, both large language models (LLMs) and humans can make better decisions with more context. … However, both LLMs and humans can also be overwhelmed by too much context if it’s not relevant or well-organized, so there is a balance to be struck.”


I'm being pedantic here but... I think the correct statement is "the more context an LLM or human has TO A POINT, the better decision it can make".

For example it's common to bury an adversary in paperwork in legal discovery to try to obscure what you don't want them to find.

Humans do not do better with excessive context and it is appearing that although you can go to 2m tokens etc. they don't actually understand. They ONLY do well at "find the needle in the haystack, here's a very specific description of the needle" tasks but nothing that involves simultaneously considering multiple parts of that.


I think the argument was, GPT4 can't learn to do Math from more data. I'd be surprised if that's not true.


ChatGPT makes mistakes doing basic arithmetic or sorting numbers.

Pretty sure we have enough data for these fundamental tasks.


It's more than enough data for a specialized tool, yes.

It's not even remotely enough data for a statistical language processor.


Why are young children able to quickly surpass state-of-the-art ML models at arithmetic tasks, from only a few hours of lecturing and a "training dataset" (worksheets) consisting of maybe a thousand total examples?

What is happening in the human learning process from those few thousand examples, to deduce so much more about "the rules of math" per marginal datapoint?


Are they? Even before OpenAI made it hard to force GPT to do chain of thought for basic maths it usually took over a dozen digits per number before it messed up arithmetic when I tested it.

How many young children do you genuinely think would do problems like that without messing up a step before having drilled for quite some time?

I'm sure there are aspects to how we generalise that current LLM training processes does not yet capture, but so much of human learning processes involve repeating very basic stuff over and over again and still regularly making trivial mistakes because we keep tripping over stuff we learned how to do right as children but keep failing to apply it with sufficient precision.

Frankly, making average humans do these kind of things consistently right manually even for small numbers without putting a process of extensive checking and revision around it is an unsolved problem. And convincing an average human apply that kind of tedious process consistently is an unsolved problem.


> How many young children do you genuinely think would do problems like that without messing up a step before having drilled for quite some time?

You're overestimating how many examples "drilled for quite some time" represents. In an entire 12 years of public school, you might only do a few thousand addition problems in total. And yet you'll be quite good at arithmetic by the end. In fact, you'll be surprisingly good at arithmetic after your first hundred!

> I'm sure there are aspects to how we generalise that current LLM training processes does not yet capture, but so much of human learning processes involve repeating very basic stuff over and over again and still regularly making trivial mistakes because we keep tripping over stuff we learned how to do right as children but keep failing to apply it with sufficient precision.

LLMs fail when asked to do "short" addition of long numbers "in their heads." And so do kids!

But most of what "teaching addition" to children means, is getting them to translate addition into a long-addition matrix representation of the problem, so they can then work the "long-addition algorithm" one column at a time, marking off columns as they process them.

Presuming they can do that, the majority of the remaining "irreducible" error rate comes from the copying-numbers-into-the-matrix step! (And that can often be solved by teaching kids the "trick" of inserting commas into long numbers that don't already have them, so that they can visually group and cross-check numbers while copying.)

LLMs can be told to do a Chain-of-Thought of running through the whole long-addition algorithm the same way a human would (essentially, saying the same things that a human would think to themselves while doing the long-addition algorithm)... but for sufficiently-large numbers (50 digits, say) they still won't perform within an order-of-magnitude of a human, because "a bag of rotary-position-encoded input tokens with self-attention, where the digits appear first as a token sequence, and then as individual tokens in sentences describing the steps of the operation" is just plain messier — more polluted with unrelated stuff that makes it less possible to apply rigor to "finding your place" (i.e. learn hard rules as discrete 0-or-1 probabilities) — than an arbitrary-width grid of digits representation is.

People — kids or not — when asked to do long addition, would do it "on paper": using a constant back-and-forth between their Chain-of-Thought and their visual field, with the visual field acting as a spatially-indexed memory of the current processing step, where they expect to be able to "look at" a single column, and "load" two digits into their Chain-of-Thought that are indirected by their current visual attention cursor — with their visual field having enough persistence to get them back to where they were in the problem if they glance away; and yet with the ability to arbitrarily refocus the "cursor" in both relative and absolute senses depending on what the Chain-of-Thought says about the problem. Given an unbounded-length "paper" to work on, such a back-and-forth process can be extended to an unbounded-length processing sequence robustly. (Compare/contrast: a Turing machine's tape head.)

Pure LLMs (seq2seq models) cannot "work on paper."

If you consider what is even theoretically possible to "model" inside a feed-forward NN's weights — it can certainly have the successive embedding vectors act as "machine registers" to track 1. a set of finite-state machines, and 2. a set of internal memory cells (where each cell's values are likely represented by O(N) oppositional activations of vector elements representing each possible value the cell can take on.) These abstractions together are likely what allow LLMs to perform as well as they do on bounded-length arithmetic. (They're not memorizing; they're parsing!)

But given the way feed-forward seq2seq NNs work, they need a separate instance of these trained weights, and their commensurate embedding vector elements, for each digit they're going to be processing. Just like a parallel ALU has a separate bit of silicon dedicated to processing each bit of the input registers, an LLM must have a separate independent probability model for the outcome of applying a given operation to each digit-token "touched" on the same layer. Where any of these may be under-trained; and where, if (current, quadratic) self-attention is involved, the hidden-layer embedding-vector growth caused by training to sum really big numbers, would quickly become untenable. (And would likely be doubly wasted, representing the registers for each learned arithmetic operation separately, rather than collapsing down into any kind of shared "accumulator register" abstraction.)

---

That being said: what if LLMs could "work on paper?" How would that work?

For complete generality — to implement arbitrary algorithms requiring unbounded amounts of memory — they'd very likely need to be able to "look at the paper" an unbounded number of times per token output — which essentially means they'd need to be converted at least partially into RNNs (hopefully post-training.) So let's ignore that case; it's a whole architectural can of worms.

Let's look at a more limited case. Assuming you only want the LLM to be able to implement O(N log N) algorithms (which would be the limit for a feed-forward NN, as each NN layer can do O(N) things in parallel, and there are O(log N) layers) — what's the closest you could get to an LLM "working on paper"?

Maybe something like:

• adding an unbounded-size "secondary vector" (like the secondary vector of a LoRA), that isn't touched in each step by self-attention, and that starts out zeroed,

• with a bounded-size "virtual memory mapping" — a dynamic and windowed position-encoding of a subset of the vector into the Q/K vectors at each step, and a dynamic position-encoding of part of the resulting embedding (Q.KT.V) that maps a subset of the embedding vector back into the secondary vector

• where this position-encoding is "dynamic" in that, during training of each layer, that layer has one set of embedding vectors that it learns as being a "input-vocabulary memory descriptor table", describing the virtual-memory mappings of the secondary vector's state-at-layer-N into the pre-attention vector input at layer N [i.e. a matrix you multiply against the secondary vector, then add the result to the pre-attention vector]; and an equivalent "output-vocabulary memory descriptor table", mapping the post-attention embedding vector to writes of the secondary vector [i.e. a matrix you multiply against the post-attention embedding vector, then add to the secondary vector]

• and where the secondary vector is windowed, in that both memory-descriptor-table matrices are indicating positions in a window — a virtual secondary vector that actually exists as a 1D projection of a conceptually-N-dimensional slice of a physical secondary N-dimensional matrix; where each pre-attention embedding contains 2N elements interpreted as "window bounds" for the N dimensions of the matrix, to derive the secondary vector "virtual memory" from its physical storage matrix; and where each post-attention embedding contains 2N elements either interpreted again as "window bounds" for the next layer; or interpreted as "window commands" to be applied to the window (e.g. specifying arbitrary relative affine transformations of the input matrix, decomposed into separate scaling/translation/rotation elements for each dimension), with the "window bounds" of the next layer then being generated by the host framework by applying the affine transformation to the existing window bounds. (And again, with the output window bounds/windowing command parameters being learned.)

I believe this abstraction would give a feed-forward NN the ability to, once per layer,

1. "focus" on a position on an external "paper";

2. "read" N things from the paper, with each NN node "loading" a weight from a learned position that's effectively relative to the focus position;

3. compute using that info;

4. "write" N things back to new positions relative to the focus position on the paper;

5. "look" at a different focus position for the next layer, relative to the current focus position.

This extension could enable pretty complex internal algorithms. But I dunno, I'm not an ML engineer, I'm just spitballing :)


You can't reliably teach an LLM maths the same way you can't take a locomotive offroading.


“Yes, it is debatable. Here are some arguments for and against the idea that more context leads to better decisions…”


Llama-1, 1T tokens, dumb as a box of rocks

Llama-2, 2T tokens, smarter than a box of rocks

Mistral-7B, 8T tokens, way smarter than llama-2

Llama-3, 15T tokens, smarter than anything a few times its size

Gemma-2, 13T synthetic tokens, slightly better than llama-3

(for the same approximate parameter size)

I think it roughly tracks that moar data = moar betterer.


> not smart > slightly smarter > way smarter > Last one, "slightly smarter"

So, the the usual s-curve, that has an exponential phase, then topping out?


Pretty much, yep. There was definitely a more significant jump there in the middle where 7B models went from being a complete waste of time to actually useful. Then going from being able to craft a sensible response to 80% of questions to 90% is a much smaller apparent increase but takes a lot more compute to achieve as per the pareto principle.


I see giant models like Intel chips over the last decade: big, powerful, expensive, energy hogs.

Small models are like arm: you get much of the performance you actually need for common consumer tasks, very cheap to run, and energy efficient.

We need both but I personally spend most of my ML time on small models training and I’m very happy with the results.


but the OP was talking about the size of the context window, not the size of the training corpus


Hmm right, I read that wrong. Still, interesting data I think.


Most data around is junk and the internet produces junk data faster then useful data and current GPT AIs basically regurgitate what someone already did somewhere on the internet. So I guess the more data we feed into GPTs the worse the results will get.

My take to improve AI output is to heavily curate the data you feed your AI, much the like expert systems of old (which were lauded as "AI" also.) Maybe we can break the vicious circle of "I trained my GPT on billions of Twitter posts and let it write Twitter posts to great sucess", "Hey, me too!"


There are multiple companies hiring people on contracts to curate and generate data for this. I do confidential contract work for two different ones at the moment, and while my NDAs limit how much I can say, it involves both identifying issues with captured prompt/response pairs that have been filtered, and writing synthetic ones from scratch aided by models (e.g. come up with a coding problem, and rewrite the response to be "perfect").

The first category has obviously been pre-filtered to put cheaper resources in simpler problems, as sometimes these projects pays reasonable tech contract rates for 1-2 hours of work to improve only 2-3 conversation turns of a single conversation, and it's clear they usually involve more than one person reviewing the same data.

A lot of money is pouring into that space, and the moats in the form of proprietary training data heavily curated by experts is going to be growing rapidly given how much cash the big players have.


Thanks for your insights! Would you say, this is an approach suited for "general" GPTs (not in the sense of AGI) or more for expert systems like Copilot?


I can't really say I know whether the outcomes are good as I won't be told to what extent the output makes it into production models, and I don't even always know which company it's for. But I know at least some of it is being used for "general" models. I do more code-related work than general purpose as it's the work I find most interesting, but the highest paid contract I've had in this space so far is for a general-purpose model that to my knowledge isn't available yet, for a model from a company you'd know (but I'm under strict NDA not to mention the company name or more details about the work).


> My take to improve AI output is to heavily curate the data you feed your AI

This is what OpenAI is doing with their relationships with companies like Reddit, News Corp etc:

https://openai.com/index/news-corp-and-openai-sign-landmark-...

Problem is that we have a finite amount of this type of information.


Massive surveilance, take out data and use it on training. Hope this will not come to frution.


Thankfully, we have stalwart and well-known defenders of our security like Apple and Microsoft to protect us. There's nothing of the sort to worry about.


Outlook already does summarization with CoPilot. I use it everyday. I think summarization is one of the strengths of LLMs and it really shines here:

https://support.microsoft.com/en-us/office/summarize-an-emai...


> integrate an LLM into Slack

They are already training their models https://slack.com/intl/en-gb/trust/data-management/privacy-p...

> Microsoft to integrate an LLM into outlook

Unlikely to happen. Orgs that use MS products do not want content of emails leaking and LLMs leak. There is a real danger that an LLM will include information in the summary that does not come from the original email thread, but from other emails the model was trained on. You could learn from the summary that you are going to get fired, even though that was not a part of the original conversation. HR doesn't like that.

> However, the more data you feed into an LLM, the smarter it should be in the response

Not necessarily. At some point you are going to run out of current data and you might be tempted to feed it past data, except that data may be of poor quality or simply wrong. Since LLMs cannot tell good data from bad, they happily accept both leading to useless outputs.


>>Microsoft to integrate an LLM into outlook

Didn't they already do this? A friend of mine showed me his outlook where he could search all emails, docs, and video calls and ask it questions. To be fair, he and I asked it questions about a video call and a doc - but not any emails, we only searched emails.

This was last week amd it worked "mostly OK," but having a q/a conversation with a long email feels inevitable


Asking questions about a document is one thing; asking questions that synthesize information across many documents — the human-intelligent equivalent of doing a big OLAP query with graph-search and fulltext-search parts on your email database — is quite another.

Right now AFAICT the latter would require the full text of all the emails you've ever sent, to be stuffed into the context window together.


Yes this is already a thing. Copilot for m365 fine tunes on your entire orgs data.


So that's a hard stop for many orgs. Not everyone is supposed to see all of the org's data even if they work for it.


There are still controls on that, it doesn't give you access to everything. It took forever for legal to sign off on it.


The definitely added it to the web version n LinkedIn, you can see it when you want to write or reply to a message and it gives you an option to "Write with AI".


> There is a real danger that an LLM will include information in the summary that does not come from the original email thread, but from other emails the model was trained on. You could learn from the summary that you are going to get fired, even though that was not a part of the original conversation. HR doesn't like that.

There could be separate personal fine-tunes per user, trained (in the cloud) on the contents of that user's mail database, which therefore have knowledge of exactly the mail that particular user can access, and nothing else.

AFAICT this is essentially what Apple is claiming they're doing to power their own "on-device contextual querying."


> There could be separate personal fine-tunes per use

Yes, but that contradicts the earlier claim that giving AI more info makes it better. If fact, those who send few emails or just joined may see worse results due to the lack of data. LLMs really make us hard to come up with ideas on how to solve problems that did not exist without LLMs.


A problem looking for a problem to solve is how I like to think of it


> Still waiting for Microsoft to integrate an LLM into outlook so I can get a summary of a 20 email long chain I just got CCed into.

Still waiting Microsoft to add a email search to Outlook that isn’t complete garbage. Ideally with a decent UI and presentation of results that isn’t complete garbage.

…why are we hoping that AI will make these products better, when they’re not using conventional methods appropriately, and have been enshittified to shit.


> the more data you feed into an LLM, the smarter it should be in the response

This is not obvious though.


It’s in theory. The more information you have, the better the decision in theory.


Not quite. There are bounds on capacity of learning machines.

https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_di...


Anyone got a TL;DR on this?


Think about an AI with a 1 bit model. If you feed that AI data that can't possibly be classified into less than 2 bits, it can't get it precisely right, no matter how much data you train it on, or what the 1 bit of the model represents.

For any given size of system, there will be a ceiling on what it can learn to classify or predict with precision.

I used "system" rather than "model" there for a reason:

Memory in any form, such as context and RAG or API access to anything that can store and retrieve data affects the maximum - a turing machine can be implemented with a very small model + a loop if there's access to an external memory to act like the tape. But if the "tape" is limited, there will be some limitation on what the total system can precisely classify.


It's quality not quantity.

You need to have accurate, properly reasoned information for better decisions.


It’s a good thing that SFDC and Slack are both well known for being a repository of high quality data.

/sarc


Slack literally launched that feature today! In fact I posted about it on hn this morning

https://news.ycombinator.com/item?id=40841057


Both of these already exist. Slack just introduced AI and copilot for M365 products has been available for quite a while now. It works great, I use it every day.


Aren’t you afraid ur gonna be given hallucinating information?

That’s my biggest worry. I’ve set up my own RAG and it’s kind of sad how even that is not super accurate.


> Still waiting for Microsoft to integrate an LLM into outlook

Given that Microsoft this year is all-in on LLM AI, this is surely coming.

But perhaps it will be a premium, paid feature?


This exists and is available for paid users. My company started experimenting with it recently and it can be fairly helpful for long threads: https://support.microsoft.com/en-us/office/summarize-an-emai....


They will sooner do that than fix Teams....


The teams discussion was 3 days ago https://news.ycombinator.com/item?id=40786640

I am happy with my comments there, including "MS could devote resources to making MS Teams more useful, but they don't have to, so they don't."


It’s an additional $20-30/user/mo license, and has already existed in production for several months.


Why should the response be better just because there is "more data"?

Should I be adding extra random tokens to my prompts to make the LLM "smarter"?


More context. Not random data.


> Apple probably could have made a 3TB iPod

It's a very weird comparison, as putting more music tracks to your iPod doesn't make them sound better, while giving a LLM more parameters/computing power make it smarter.

Honestly it sounds like a typical "I've drawn my conclusion, and now I only need an analogy that remotely supports my conclusion" way of thinking.


No, it makes sense: he's coming at it from the perspective of knowing exactly what task you want to accomplish (something like "fixing the grammar in this document.") In such cases, a model only has to be sufficiently smart to work for 99.9999% of inputs — at which point you cross a threshold where adding more intelligence is just making the thing bulkier to download and process, to no end for your particular use-case.

In fact, you would then tend to go the other way — once you get the ML model to "solve the problem", you want to then find the smallest and most efficient such model that solves the problem. I.e. the model that is "as stupid as possible", while still being very good at this one thing.


If you have no conception of mathematics, do you think you'd get better at solving mathematics problems based on looking at more examples of people who may or may not be solving them correctly?


It has worked for humanity...


Only after humanity became a general intelligence. The previous ancestors were also pretty smart, but not smart enough to develop technology on their own, you have to be extremely smart to do what humanity did.


Is there a reason why memory was used and not compute power as an example? I don't understand how cherry picking random examples from past explain future of AI. If he think business needs does not exist he should explain how he arrived at that conclusion instead of a random iPod example.


It's an analogy. He's making the point that even though something can scale at an exponential rate, it doesn't mean there is a business need for such scaling


This. The scaling of compute has vastly different applications than the scaling of memory. Shows once again that people who are experts in a related field aren't necessarily the best to comment on trendy topics. If e.g. an aeroplane expert critiques Spacex's starship, you should be equally vary, even though they might have some overlap. The only reason this is in the media at all is because negative sentiment to hype generates many clicks. That's why you see these topics every day instead of Rubik's cube players criticising the latest version of Mikado.


The business case is absolutely there, it's just the industry has weirdly latched onto 'chatbot' as the usecase as opposed to where the real value lies.

The pretrained model is where the enterprise gold is at.

But the companies building the models past the tipping point scale for that value to be derived are walling up their pretrained model behind very heavy handed fine tuning that strips away most of the business value.

The engineers themselves seem to lack the imagination for the business cases, and the enterprise market doesn't have access to start discovering the applications outside of 'chatbot,' particularly with large context windows of proprietary data fed into SotA pretrained models.

There's maybe a handful of people who actually realize what value is being left on the table, and I think most of them are smart enough not to currently be in positions to make it happen.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: