To me, this reads like a very reasonable take. He suggests to limit the scope of...

aurareturn · on June 30, 2024

I’m still waiting for SalesForce to integrate an LLM into Slack so I can ask it business logic and decisions long lost. Still waiting for Microsoft to integrate an LLM into outlook so I can get a summary of a 20 email long chain I just got CCed into.

I don’t think the iPod comparison is a valid one. People only have so much time to listen to music. Past a certain point, no one has enough good music they like to put into a 3TB iPod. However, the more data you feed into an LLM, the smarter it should be in the response. Therefore, the scale of iPod storage and LLM context is on completely different curves.

MattPalmer1086 · on June 30, 2024

Haha, I'd be happy if outlook just integrated a search that actually works.

Most of outlook search results aren't even relevant, and it regularly misses things I know are there. Literally the most useless search I've ever had to use.

kjellsbells · on July 1, 2024

Irony: they did. They bought LookOut, which was a simple and extremely good search plugin for desktop Outlook. And then, somehow, it was melted down into the rather weak beer search that 365 has today.

There is an alternative, Lookeen, which positions itself as LookOut's successor, but I've yet to try it.

https://lookeen.com/solutions/outlook-search/lookout-alterna...

chime · on July 1, 2024

I can vouch for Lookeen (circa 2012-2015). I set it up for 200+ users on Citrix and it worked great. I had the index for each user saved to their network share and yet the search was instant. It even indexed shared mailboxes. It barely used any CPU when doing background indexing.

It worked well with thin OSTs too but due to how Outlook and Exchange work, it would have to rebuild the index more often.

Definitely a blast from the past reading the word “Lookeen” but mostly good memories about it. I believe the ADMX integration was pretty decent too.

tnel77 · on June 30, 2024

Don’t get me started on Outlook’s search. I can try to search for an email that’s only a few weeks old and somehow it won’t find it. It will, however, find emails that are from over a decade ago.

denkmoon · on July 1, 2024

If they can't make an effective basic text search, I doubt they're going to manage an LLM integration that doesn't return straight up garbage.

7speter · on July 1, 2024

Right, thats why they give openai gobs of money for their llm research

zeekaran · on July 1, 2024

Most useless ever? You've used a Windows OS, right?

HeatrayEnjoyer · on July 1, 2024

The fact that one chap's decade-old freeware[0] is 100x better search than the current native tool of a trillion-dollar technology corp is my proof that there is a god, and their name is Loki.

[0]https://www.voidtools.com/support/everything/

grugagag · on June 30, 2024

That’s exactly my thoughts. If search is broken, expecting LLMs to fix that sounds naive or just disingenious.

MathMonkeyMan · on July 1, 2024

Thunderbird's is bad, too. GMail's is pretty good.

_19qg · on June 30, 2024

> However, the more data you feed into an LLM, the smarter it should be in the response.

Is it that way? For example if it lacks a certain reasoning capability, then more data may not change that. So far LLMs lack useful ideas of truth, it will easily generate untrue statements. We see lots of hacks how to control that, with unconvincing results.

aurareturn · on June 30, 2024

That has not been my experience with GPT4 and GPt4o. Maybe you’re using worse models?

The point is that the more context an LLM or human has, the better decision it can make in theory. I don’t think you can debate this.

Hallucinations and LLM context scale are more engineering problems.

akerr · on June 30, 2024

ChatGPT says, “Generally, yes, both large language models (LLMs) and humans can make better decisions with more context. … However, both LLMs and humans can also be overwhelmed by too much context if it’s not relevant or well-organized, so there is a balance to be struck.”

bluelightning2k · on July 1, 2024

I'm being pedantic here but... I think the correct statement is "the more context an LLM or human has TO A POINT, the better decision it can make".

For example it's common to bury an adversary in paperwork in legal discovery to try to obscure what you don't want them to find.

Humans do not do better with excessive context and it is appearing that although you can go to 2m tokens etc. they don't actually understand. They ONLY do well at "find the needle in the haystack, here's a very specific description of the needle" tasks but nothing that involves simultaneously considering multiple parts of that.

theteapot · on June 30, 2024

I think the argument was, GPT4 can't learn to do Math from more data. I'd be surprised if that's not true.

threeseed · on June 30, 2024

ChatGPT makes mistakes doing basic arithmetic or sorting numbers.

Pretty sure we have enough data for these fundamental tasks.

pestaa · on June 30, 2024

It's more than enough data for a specialized tool, yes.

It's not even remotely enough data for a statistical language processor.

derefr · on June 30, 2024

Why are young children able to quickly surpass state-of-the-art ML models at arithmetic tasks, from only a few hours of lecturing and a "training dataset" (worksheets) consisting of maybe a thousand total examples?

What is happening in the human learning process from those few thousand examples, to deduce so much more about "the rules of math" per marginal datapoint?

vidarh · on July 1, 2024

Are they? Even before OpenAI made it hard to force GPT to do chain of thought for basic maths it usually took over a dozen digits per number before it messed up arithmetic when I tested it.

How many young children do you genuinely think would do problems like that without messing up a step before having drilled for quite some time?

I'm sure there are aspects to how we generalise that current LLM training processes does not yet capture, but so much of human learning processes involve repeating very basic stuff over and over again and still regularly making trivial mistakes because we keep tripping over stuff we learned how to do right as children but keep failing to apply it with sufficient precision.

Frankly, making average humans do these kind of things consistently right manually even for small numbers without putting a process of extensive checking and revision around it is an unsolved problem. And convincing an average human apply that kind of tedious process consistently is an unsolved problem.

derefr · on July 1, 2024

> How many young children do you genuinely think would do problems like that without messing up a step before having drilled for quite some time?

You're overestimating how many examples "drilled for quite some time" represents. In an entire 12 years of public school, you might only do a few thousand addition problems in total. And yet you'll be quite good at arithmetic by the end. In fact, you'll be surprisingly good at arithmetic after your first hundred!

> I'm sure there are aspects to how we generalise that current LLM training processes does not yet capture, but so much of human learning processes involve repeating very basic stuff over and over again and still regularly making trivial mistakes because we keep tripping over stuff we learned how to do right as children but keep failing to apply it with sufficient precision.

LLMs fail when asked to do "short" addition of long numbers "in their heads." And so do kids!

But most of what "teaching addition" to children means, is getting them to translate addition into a long-addition matrix representation of the problem, so they can then work the "long-addition algorithm" one column at a time, marking off columns as they process them.

Presuming they can do that, the majority of the remaining "irreducible" error rate comes from the copying-numbers-into-the-matrix step! (And that can often be solved by teaching kids the "trick" of inserting commas into long numbers that don't already have them, so that they can visually group and cross-check numbers while copying.)

LLMs can be told to do a Chain-of-Thought of running through the whole long-addition algorithm the same way a human would (essentially, saying the same things that a human would think to themselves while doing the long-addition algorithm)... but for sufficiently-large numbers (50 digits, say) they still won't perform within an order-of-magnitude of a human, because "a bag of rotary-position-encoded input tokens with self-attention, where the digits appear first as a token sequence, and then as individual tokens in sentences describing the steps of the operation" is just plain messier — more polluted with unrelated stuff that makes it less possible to apply rigor to "finding your place" (i.e. learn hard rules as discrete 0-or-1 probabilities) — than an arbitrary-width grid of digits representation is.

People — kids or not — when asked to do long addition, would do it "on paper": using a constant back-and-forth between their Chain-of-Thought and their visual field, with the visual field acting as a spatially-indexed memory of the current processing step, where they expect to be able to "look at" a single column, and "load" two digits into their Chain-of-Thought that are indirected by their current visual attention cursor — with their visual field having enough persistence to get them back to where they were in the problem if they glance away; and yet with the ability to arbitrarily refocus the "cursor" in both relative and absolute senses depending on what the Chain-of-Thought says about the problem. Given an unbounded-length "paper" to work on, such a back-and-forth process can be extended to an unbounded-length processing sequence robustly. (Compare/contrast: a Turing machine's tape head.)

Pure LLMs (seq2seq models) cannot "work on paper."

If you consider what is even theoretically possible to "model" inside a feed-forward NN's weights — it can certainly have the successive embedding vectors act as "machine registers" to track 1. a set of finite-state machines, and 2. a set of internal memory cells (where each cell's values are likely represented by O(N) oppositional activations of vector elements representing each possible value the cell can take on.) These abstractions together are likely what allow LLMs to perform as well as they do on bounded-length arithmetic. (They're not memorizing; they're parsing!)

But given the way feed-forward seq2seq NNs work, they need a separate instance of these trained weights, and their commensurate embedding vector elements, for each digit they're going to be processing. Just like a parallel ALU has a separate bit of silicon dedicated to processing each bit of the input registers, an LLM must have a separate independent probability model for the outcome of applying a given operation to each digit-token "touched" on the same layer. Where any of these may be under-trained; and where, if (current, quadratic) self-attention is involved, the hidden-layer embedding-vector growth caused by training to sum really big numbers, would quickly become untenable. (And would likely be doubly wasted, representing the registers for each learned arithmetic operation separately, rather than collapsing down into any kind of shared "accumulator register" abstraction.)

---

That being said: what if LLMs could "work on paper?" How would that work?

For complete generality — to implement arbitrary algorithms requiring unbounded amounts of memory — they'd very likely need to be able to "look at the paper" an unbounded number of times per token output — which essentially means they'd need to be converted at least partially into RNNs (hopefully post-training.) So let's ignore that case; it's a whole architectural can of worms.

Let's look at a more limited case. Assuming you only want the LLM to be able to implement O(N log N) algorithms (which would be the limit for a feed-forward NN, as each NN layer can do O(N) things in parallel, and there are O(log N) layers) — what's the closest you could get to an LLM "working on paper"?

Maybe something like:

• adding an unbounded-size "secondary vector" (like the secondary vector of a LoRA), that isn't touched in each step by self-attention, and that starts out zeroed,

• with a bounded-size "virtual memory mapping" — a dynamic and windowed position-encoding of a subset of the vector into the Q/K vectors at each step, and a dynamic position-encoding of part of the resulting embedding (Q.KT.V) that maps a subset of the embedding vector back into the secondary vector

• where this position-encoding is "dynamic" in that, during training of each layer, that layer has one set of embedding vectors that it learns as being a "input-vocabulary memory descriptor table", describing the virtual-memory mappings of the secondary vector's state-at-layer-N into the pre-attention vector input at layer N [i.e. a matrix you multiply against the secondary vector, then add the result to the pre-attention vector]; and an equivalent "output-vocabulary memory descriptor table", mapping the post-attention embedding vector to writes of the secondary vector [i.e. a matrix you multiply against the post-attention embedding vector, then add to the secondary vector]

• and where the secondary vector is windowed, in that both memory-descriptor-table matrices are indicating positions in a window — a virtual secondary vector that actually exists as a 1D projection of a conceptually-N-dimensional slice of a physical secondary N-dimensional matrix; where each pre-attention embedding contains 2N elements interpreted as "window bounds" for the N dimensions of the matrix, to derive the secondary vector "virtual memory" from its physical storage matrix; and where each post-attention embedding contains 2N elements either interpreted again as "window bounds" for the next layer; or interpreted as "window commands" to be applied to the window (e.g. specifying arbitrary relative affine transformations of the input matrix, decomposed into separate scaling/translation/rotation elements for each dimension), with the "window bounds" of the next layer then being generated by the host framework by applying the affine transformation to the existing window bounds. (And again, with the output window bounds/windowing command parameters being learned.)

I believe this abstraction would give a feed-forward NN the ability to, once per layer,

1. "focus" on a position on an external "paper";

2. "read" N things from the paper, with each NN node "loading" a weight from a learned position that's effectively relative to the focus position;

3. compute using that info;

4. "write" N things back to new positions relative to the focus position on the paper;

5. "look" at a different focus position for the next layer, relative to the current focus position.

This extension could enable pretty complex internal algorithms. But I dunno, I'm not an ML engineer, I'm just spitballing :)

hbs18 · on July 1, 2024

You can't reliably teach an LLM maths the same way you can't take a locomotive offroading.

akerr · on June 30, 2024

“Yes, it is debatable. Here are some arguments for and against the idea that more context leads to better decisions…”

moffkalast · on June 30, 2024

Llama-1, 1T tokens, dumb as a box of rocks

Llama-2, 2T tokens, smarter than a box of rocks

Mistral-7B, 8T tokens, way smarter than llama-2

Llama-3, 15T tokens, smarter than anything a few times its size

Gemma-2, 13T synthetic tokens, slightly better than llama-3

(for the same approximate parameter size)

I think it roughly tracks that moar data = moar betterer.

SideburnsOfDoom · on June 30, 2024

> not smart > slightly smarter > way smarter > Last one, "slightly smarter"

So, the the usual s-curve, that has an exponential phase, then topping out?

moffkalast · on June 30, 2024

Pretty much, yep. There was definitely a more significant jump there in the middle where 7B models went from being a complete waste of time to actually useful. Then going from being able to craft a sensible response to 80% of questions to 90% is a much smaller apparent increase but takes a lot more compute to achieve as per the pareto principle.

hdhshdhshdjd · on June 30, 2024

I see giant models like Intel chips over the last decade: big, powerful, expensive, energy hogs.

Small models are like arm: you get much of the performance you actually need for common consumer tasks, very cheap to run, and energy efficient.

We need both but I personally spend most of my ML time on small models training and I’m very happy with the results.

cerved · on June 30, 2024

but the OP was talking about the size of the context window, not the size of the training corpus

moffkalast · on June 30, 2024

Hmm right, I read that wrong. Still, interesting data I think.

Propelloni · on June 30, 2024

Most data around is junk and the internet produces junk data faster then useful data and current GPT AIs basically regurgitate what someone already did somewhere on the internet. So I guess the more data we feed into GPTs the worse the results will get.

My take to improve AI output is to heavily curate the data you feed your AI, much the like expert systems of old (which were lauded as "AI" also.) Maybe we can break the vicious circle of "I trained my GPT on billions of Twitter posts and let it write Twitter posts to great sucess", "Hey, me too!"

vidarh · on July 1, 2024

There are multiple companies hiring people on contracts to curate and generate data for this. I do confidential contract work for two different ones at the moment, and while my NDAs limit how much I can say, it involves both identifying issues with captured prompt/response pairs that have been filtered, and writing synthetic ones from scratch aided by models (e.g. come up with a coding problem, and rewrite the response to be "perfect").

The first category has obviously been pre-filtered to put cheaper resources in simpler problems, as sometimes these projects pays reasonable tech contract rates for 1-2 hours of work to improve only 2-3 conversation turns of a single conversation, and it's clear they usually involve more than one person reviewing the same data.

A lot of money is pouring into that space, and the moats in the form of proprietary training data heavily curated by experts is going to be growing rapidly given how much cash the big players have.

Propelloni · on July 1, 2024

Thanks for your insights! Would you say, this is an approach suited for "general" GPTs (not in the sense of AGI) or more for expert systems like Copilot?

vidarh · on July 1, 2024

I can't really say I know whether the outcomes are good as I won't be told to what extent the output makes it into production models, and I don't even always know which company it's for. But I know at least some of it is being used for "general" models. I do more code-related work than general purpose as it's the work I find most interesting, but the highest paid contract I've had in this space so far is for a general-purpose model that to my knowledge isn't available yet, for a model from a company you'd know (but I'm under strict NDA not to mention the company name or more details about the work).

threeseed · on June 30, 2024

> My take to improve AI output is to heavily curate the data you feed your AI

This is what OpenAI is doing with their relationships with companies like Reddit, News Corp etc:

https://openai.com/index/news-corp-and-openai-sign-landmark-...

Problem is that we have a finite amount of this type of information.

grugagag · on June 30, 2024

Massive surveilance, take out data and use it on training. Hope this will not come to frution.

talldayo · on July 1, 2024

Thankfully, we have stalwart and well-known defenders of our security like Apple and Microsoft to protect us. There's nothing of the sort to worry about.

kenjackson · on July 1, 2024

Outlook already does summarization with CoPilot. I use it everyday. I think summarization is one of the strengths of LLMs and it really shines here:

https://support.microsoft.com/en-us/office/summarize-an-emai...

surfingdino · on June 30, 2024

> integrate an LLM into Slack

They are already training their models https://slack.com/intl/en-gb/trust/data-management/privacy-p...

> Microsoft to integrate an LLM into outlook

Unlikely to happen. Orgs that use MS products do not want content of emails leaking and LLMs leak. There is a real danger that an LLM will include information in the summary that does not come from the original email thread, but from other emails the model was trained on. You could learn from the summary that you are going to get fired, even though that was not a part of the original conversation. HR doesn't like that.

> However, the more data you feed into an LLM, the smarter it should be in the response

Not necessarily. At some point you are going to run out of current data and you might be tempted to feed it past data, except that data may be of poor quality or simply wrong. Since LLMs cannot tell good data from bad, they happily accept both leading to useless outputs.

geoduck14 · on June 30, 2024

>>Microsoft to integrate an LLM into outlook

Didn't they already do this? A friend of mine showed me his outlook where he could search all emails, docs, and video calls and ask it questions. To be fair, he and I asked it questions about a video call and a doc - but not any emails, we only searched emails.

This was last week amd it worked "mostly OK," but having a q/a conversation with a long email feels inevitable

derefr · on June 30, 2024

Asking questions about a document is one thing; asking questions that synthesize information across many documents — the human-intelligent equivalent of doing a big OLAP query with graph-search and fulltext-search parts on your email database — is quite another.

Right now AFAICT the latter would require the full text of all the emails you've ever sent, to be stuffed into the context window together.

bongodongobob · on July 1, 2024

Yes this is already a thing. Copilot for m365 fine tunes on your entire orgs data.

surfingdino · on July 1, 2024

So that's a hard stop for many orgs. Not everyone is supposed to see all of the org's data even if they work for it.

bongodongobob · on July 1, 2024

There are still controls on that, it doesn't give you access to everything. It took forever for legal to sign off on it.

surfingdino · on June 30, 2024

The definitely added it to the web version n LinkedIn, you can see it when you want to write or reply to a message and it gives you an option to "Write with AI".

derefr · on June 30, 2024

> There is a real danger that an LLM will include information in the summary that does not come from the original email thread, but from other emails the model was trained on. You could learn from the summary that you are going to get fired, even though that was not a part of the original conversation. HR doesn't like that.

There could be separate personal fine-tunes per user, trained (in the cloud) on the contents of that user's mail database, which therefore have knowledge of exactly the mail that particular user can access, and nothing else.

AFAICT this is essentially what Apple is claiming they're doing to power their own "on-device contextual querying."

surfingdino · on June 30, 2024

> There could be separate personal fine-tunes per use

Yes, but that contradicts the earlier claim that giving AI more info makes it better. If fact, those who send few emails or just joined may see worse results due to the lack of data. LLMs really make us hard to come up with ideas on how to solve problems that did not exist without LLMs.

simianparrot · on July 1, 2024

A problem looking for a problem to solve is how I like to think of it

ant6n · on June 30, 2024

> Still waiting for Microsoft to integrate an LLM into outlook so I can get a summary of a 20 email long chain I just got CCed into.

Still waiting Microsoft to add a email search to Outlook that isn’t complete garbage. Ideally with a decent UI and presentation of results that isn’t complete garbage.

…why are we hoping that AI will make these products better, when they’re not using conventional methods appropriately, and have been enshittified to shit.

richrichie · on June 30, 2024

> the more data you feed into an LLM, the smarter it should be in the response

This is not obvious though.

aurareturn · on June 30, 2024

It’s in theory. The more information you have, the better the decision in theory.

richrichie · on June 30, 2024

Not quite. There are bounds on capacity of learning machines.

https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_di...

palad1n · on July 1, 2024

Anyone got a TL;DR on this?

vidarh · on July 1, 2024

Think about an AI with a 1 bit model. If you feed that AI data that can't possibly be classified into less than 2 bits, it can't get it precisely right, no matter how much data you train it on, or what the 1 bit of the model represents.

For any given size of system, there will be a ceiling on what it can learn to classify or predict with precision.

I used "system" rather than "model" there for a reason:

Memory in any form, such as context and RAG or API access to anything that can store and retrieve data affects the maximum - a turing machine can be implemented with a very small model + a loop if there's access to an external memory to act like the tape. But if the "tape" is limited, there will be some limitation on what the total system can precisely classify.

threeseed · on June 30, 2024

It's quality not quantity.

You need to have accurate, properly reasoned information for better decisions.

coredog64 · on June 30, 2024

It’s a good thing that SFDC and Slack are both well known for being a repository of high quality data.

/sarc

mvdtnz · on July 1, 2024

Slack literally launched that feature today! In fact I posted about it on hn this morning

https://news.ycombinator.com/item?id=40841057

bongodongobob · on July 1, 2024

Both of these already exist. Slack just introduced AI and copilot for M365 products has been available for quite a while now. It works great, I use it every day.

moomoo11 · on July 1, 2024

Aren’t you afraid ur gonna be given hallucinating information?

That’s my biggest worry. I’ve set up my own RAG and it’s kind of sad how even that is not super accurate.

SideburnsOfDoom · on June 30, 2024

> Still waiting for Microsoft to integrate an LLM into outlook

Given that Microsoft this year is all-in on LLM AI, this is surely coming.

But perhaps it will be a premium, paid feature?

ravelantunes · on June 30, 2024

This exists and is available for paid users. My company started experimenting with it recently and it can be fairly helpful for long threads: https://support.microsoft.com/en-us/office/summarize-an-emai....

belter · on June 30, 2024

They will sooner do that than fix Teams....

SideburnsOfDoom · on June 30, 2024

The teams discussion was 3 days ago https://news.ycombinator.com/item?id=40786640

I am happy with my comments there, including "MS could devote resources to making MS Teams more useful, but they don't have to, so they don't."

hhh · on June 30, 2024

It’s an additional $20-30/user/mo license, and has already existed in production for several months.

cerved · on June 30, 2024

Why should the response be better just because there is "more data"?

Should I be adding extra random tokens to my prompts to make the LLM "smarter"?

aurareturn · on June 30, 2024

More context. Not random data.

raincole · on June 30, 2024

> Apple probably could have made a 3TB iPod

It's a very weird comparison, as putting more music tracks to your iPod doesn't make them sound better, while giving a LLM more parameters/computing power make it smarter.

Honestly it sounds like a typical "I've drawn my conclusion, and now I only need an analogy that remotely supports my conclusion" way of thinking.

derefr · on June 30, 2024

No, it makes sense: he's coming at it from the perspective of knowing exactly what task you want to accomplish (something like "fixing the grammar in this document.") In such cases, a model only has to be sufficiently smart to work for 99.9999% of inputs — at which point you cross a threshold where adding more intelligence is just making the thing bulkier to download and process, to no end for your particular use-case.

In fact, you would then tend to go the other way — once you get the ML model to "solve the problem", you want to then find the smallest and most efficient such model that solves the problem. I.e. the model that is "as stupid as possible", while still being very good at this one thing.

greybox · on June 30, 2024

If you have no conception of mathematics, do you think you'd get better at solving mathematics problems based on looking at more examples of people who may or may not be solving them correctly?

mewpmewp2 · on June 30, 2024

It has worked for humanity...

Jensson · on July 1, 2024

Only after humanity became a general intelligence. The previous ancestors were also pretty smart, but not smart enough to develop technology on their own, you have to be extremely smart to do what humanity did.

eysgshsvsvsv · on June 30, 2024

Is there a reason why memory was used and not compute power as an example? I don't understand how cherry picking random examples from past explain future of AI. If he think business needs does not exist he should explain how he arrived at that conclusion instead of a random iPod example.

cerved · on June 30, 2024

It's an analogy. He's making the point that even though something can scale at an exponential rate, it doesn't mean there is a business need for such scaling

sigmoid10 · on June 30, 2024

This. The scaling of compute has vastly different applications than the scaling of memory. Shows once again that people who are experts in a related field aren't necessarily the best to comment on trendy topics. If e.g. an aeroplane expert critiques Spacex's starship, you should be equally vary, even though they might have some overlap. The only reason this is in the media at all is because negative sentiment to hype generates many clicks. That's why you see these topics every day instead of Rubik's cube players criticising the latest version of Mikado.

kromem · on June 30, 2024

The business case is absolutely there, it's just the industry has weirdly latched onto 'chatbot' as the usecase as opposed to where the real value lies.

The pretrained model is where the enterprise gold is at.

But the companies building the models past the tipping point scale for that value to be derived are walling up their pretrained model behind very heavy handed fine tuning that strips away most of the business value.

The engineers themselves seem to lack the imagination for the business cases, and the enterprise market doesn't have access to start discovering the applications outside of 'chatbot,' particularly with large context windows of proprietary data fed into SotA pretrained models.

There's maybe a handful of people who actually realize what value is being left on the table, and I think most of them are smart enough not to currently be in positions to make it happen.