Wtf? Once it was AI. Then the models started passing the Turing test and calling themselves AI, so we started using AGI to say "truly intelligent machines". Now, as per the definition you quoted, apparently even GPT-3 is AGI, so we now have to use "ASI" to mean "intelligent, but artificial"?
I think I'll just keep using AI and then explain to anyone who uses that term that there is no "I" in today's LLMs, and they shouldn't use this term for some years at least. And that when they can, we will have a big problem.
LLMs are artificial intelligence illusion engines, they only "reason" as far as there's an already made answer in their dataset that they can retrieve and eventually tweak (when things go best). Take them where there's no training data and give them the new axioms to solve your specific problem and see them fail with incorrect gibberish provided as confident answer. Humans of any level of intelligence wouldn't behave like that.
Tensorflow is largely dead, it’s been years since I’ve seen a new repo use it. Go with Jax if you want a PyTorch alternative that can have better performance for certain scenarios.
You can actually generate surprisingly coherent text with minimal finetuning of BERT, by reinterpreting it as a diffusion model: https://nathan.rs/posts/roberta-diffusion/
I don’t see a useful definition of LLM that doesn’t include BERT, especially given its historical importance. 340M parameters is only “small” in the sense that a baby whale is small.
We are already at AGI. I don’t know how you can argue that LLMs don’t meet the definition of general artificial intelligence, as opposed to narrow AI like chess engines, image classifiers, AlphaGo or self driving cars, which are trained with one objective and cannot even possibly be applied to any other task.
People have just moved the goalposts, imagine explaining Opus 4.6’s capabilities to someone even 10 years ago, it would definitely have been called AGI.
I highly doubt there will be a point where everyone will agree that we’ve achieved ASI, there will always be a Gary Marcus type finding some edge case where it performs poorly.
Yes, I agree. Just not in the direction you’re claiming.
> imagine explaining Opus 4.6’s capabilities to someone even 10 years ago, it would definitely have been called AGI.
No, it would have been called AI. A decade ago most people were not familiar with AGI as a term, that just got popularised because AI was taken over to be basically what we used to call ML.
> No, it would have been called AI. A decade ago most people were not familiar with AGI as a term, that just got popularised because AI was taken over to be basically what we used to call ML.
Define "most people", I don't think the average user of ChatGPT is familiar with the term AGI even now, but it's been used in the AI/ML community for multiple decades. I remember reading about the distinction between general and narrow AI around 2010 as an enthusiast. "Strong" vs "weak" AI were also used although with essentially the same definition, although they're less common terms nowadays.
ASI still runs at finite speed and is limited by its hardware, and speed of its interactions with the real world. It won’t be able to recursively improve itself overnight if it only generates 10 tokens per seconds, and a second company could very well train one of its own before the first one has time to do much.
You're not thinking of the second order meta system here. ASI isn't just one instance of an LLM responding to you in a session. It's the datacenter full of millions of LLM interacting with millions in parallel.
Well in that case wouldn't that be millions of ASIs, each with contradictory goals?
I'm not saying that ASI isn't an existential threat, just that it probably won't present itself like the fanciful sci-fi scenario of a singular intelligence suddenly crossing a magic threshold and being able to take over the world. Most likely it will be some scenario we won't have predicted, the same way hardly anybody predicted LLMs.
The author is correct in that agents are becoming more and more capable and that you don't need the IDE to the same extent, but I don't see that as good. I find that IDE-based agentic programming actually encourages you to read and understand your codebase as opposed to CLI-based workflows. It's so much easier to flip through files, review the changes it made, or highlight a specific function and give it to the agent, as opposed to through the CLI where you usually just give it an entire file by typing the name, and often you just pray that it manages to find the context by itself. My prompts in Cursor are generally a lot more specific and I get more surgical results than with Claude Code in the terminal purely because of the convenience of the UX.
But secondly, there's an entire field of LLM-assisted coding that's being almost entirely neglected and that's code autocomplete models. Fundamentally they're the same technology as agents and should be doing the same thing: indexing your code in the background, filtering the context, etc, but there's much less attention and it does feel like the models are stagnating.
I find that very unfortunate. Compare the two workflows:
With a normal coding agent, you write your prompt, then you have to at least a full minute for the result (generally more, depending on the task), breaking your flow and forcing you to task-switch. Then it gives you a giant mass of code and of course 99% of the time you just approve and test it because it's a slog to read through what it did. If it doesn't work as intended, you get angry at the model, retry your prompt, spending a larger amount of tokens the longer your chat history.
But with LLM-powered auto-complete, when you want, say, a function to do X, you write your comment describing it first, just like you should if you were writing it yourself. You instantly see a small section of code and if it's not what you want, you can alter your comment. Even if it's not 100% correct, multi-line autocomplete is great because you approve it line by line and can stop when it gets to the incorrect parts, and you're not forced to task switch and you don't lose your concentration, that great sense of "flow".
Fundamentally it's not that different from agentic coding - except instead of prompting in a chatbox, you write comments in the files directly. But I much prefer the quick feedback loop, the ability to ignore outputs you don't want, and the fact that I don't feel like I'm losing track of what my code is doing.
I agree with you wholeheartedly. It seems like a lot of the work on making AI autocomplete better (better indexing, context management, codebase awareness, etc) has stagnated in favor of full-on agentic development, which simply isn't suited for many kinds of tasks.
The reason the nearest neighbour interpolation can sound better is that the aliasing fills the higher frequencies of the audio with a mirror image of the lower frequencies. While humans are less sensitive to higher frequencies, you still expect them to be there, so some people prefer the "fake" detail from aliasing to them just been outright missing in a more accurate sample interpolation.
It's actually the other way round: Aliasing fills the lower frequencies with a mirror image of the higher frequencies. So where do the higher frequencies come from? From the upsampling that happens before the aliasing. _That_ makes the higher frequencies contain (non-mirrored!) copies of the lower frequencies. :-)
Just so that my wrongness isn't there for posterity: This is wrong for a real-valued signal (which is what we're discussing here). I had forgotten about the negative frequencies. So there _is_ a mirror coming from the upsampling. Sorry. :-)
I think I've heard the word “images” being used for these copies, yes.
Interpolation is a bit of a confusing topic, because the most efficient implementation is not the one that lends itself the easiest to frequency analysis. But pretty much any rate change (be it up or down) using interpolation can be expressed equivalently using the following set of operations and appropriately chosen M and N:
1. Increase the rate by inserting M zeros between each sample. The has the effect of creating the “images” as discussed.
2. Apply a filter to the resulting signal. For instance, for nearest neighbor this is [1 1 1 … 0 0 0 0 0 …], with (M+1) ones and then just zeroes; effectively, every output sample is the sum of the previous M+1 input samples. This removes some of the original signal and then much more of the images.
3. Decrease the rate by taking every Nth sample and discarding the rest. This creates aliasing (higher frequencies wrap down to lower, possibly multiple times) as discussed.
The big difference between interpolation methods is the filter in #2. E.g., linear interpolation is effectively the same as a triangular filter, and will filter somewhat more of the images but also more of the original signal (IIRC). More fancy interpolation methods have more complicated shapes (windowed sinc, etc.).
This also shows why it's useful to have some headroom in your signal to begin with, e.g. a CD-quality signal could represent up to 22.05 kHz but only has (by spec) actual signal up to 20 kHz, so that it's easier to design a filter that keeps the signal but removes the images.
And also, to add to the actual GBA discussion: If you think the resulting sound is too muffled, as many here do, you can simply substitute a filter with a higher cutoff (or less steep slope). E.g., you could use a fixed 12 kHz lowpass filter (or something like cutoff=min(rate/2, 12000)), instead of always setting the cutoff exactly at the estimated input sample rate. (In a practical implementation, the coefficients would still depend on the input rate.)
Odd that the author didn’t try giving a latent embedding to the standard neural network (or modulated the activations with a FiLM layer) and had static embeddings as the baseline. There’s no real advantage to using a hypernetwork and they tend to be more unstable and difficult to train, and scale poorly unless you train a low rank adaptation.
Hello. I am the author of the post. The goal of this was to provide a pedagogical example of applying Bayesian hierarchical modeling principles to real world datasets. These datasets often contain inherent structure that is important to explicitly model (eg clinical trials across multiple hospitals). Oftentimes a single model cannot capture this over-dispersion but there is not enough data to split out the results (nor should you).
The idea behind hypernetworks is that they enable Gelman-style partial pooling to explicitly modeling the data generation process while leveraging the flexibility of neural network tooling. I’m curious to read more about your recommendations: their connection to the described problems is not immediately obvious to me but I would be curious to dig a bit deeper.
I agree that hypernetworks have some challenges associated with them due to the fragility of maximum likelihood estimates. In the follow-up post, I dug into how explicit Bayesian sampling addresses these issues.
Thank you for reading my post, and for your thoughtful critique. And I sincerely apologize for my slow response! You are right that there are other ways to inject latent structure, and FiLM is a great example.
I admit the "static embedding" baseline is a bit of a strawman, but I used it to illustrate the specific failure mode of models that can't adapt at inference time.
I then used the Hypernetwork specifically to demonstrate a "dataset-adaptive" architecture as a stepping stone toward the next post in the series. My goal was to show how even a flexible parameter-generating model eventually hits a wall with out-of-sample stability; this sets the stage for the Bayesian Hierarchical approach I cover later on.
I wasn't familiar with the FiLM literature before your comment, but looking at it now, the connection is spot on. Functionally, it seems similar to what I did here: conditioning the network on an external variable. In my case, I wanted to explicitly model the mapping E->θ to see if the network could learn the underlying physics (Planck's law) purely from data.
As for stability, you are right that Hypernetworks can be tricky in high dimensions, but for this low-dimensional scalar problem (4D embedding), I found it converged reliably.
I think a latent embedding is almost equivalent to the article's hypernetwork, which I assume as y = (Wh + c)v + b, where h is a dataset-specific trainable vector. (The article uses multiple layers ...)
It absolutely is noticeable the moment you have to run several of these electron “apps” at once.
I have a MacBook with 16GB of RAM and I routinely run out of memory from just having Slack, Discord, Cursor, Figma, Spotify and a couple of Firefox tabs open. I went back to listening to mp3s with a native app to have enough memory to run Docker containers for my dev server.
Come on, I could listen to music, program, chat on IRC or Skype, do graphic design, etc. with 512MB of DDR2 back in 2006, and now you couldn’t run a single one of those Electron apps with that amount of memory. How can a billion dollar corporation doing music streaming not have the resources to make a native app, but the Songbird team could do it for free back in 2006?
I’ve shipped cross platform native UIs by myself. It’s not that hard, and with skyrocketing RAM prices, users might be coming back to 8GB laptops. There’s no justification for a big corporation not to have a native app other than developer negligence.
On that note, I could also comfortably fit a couple of chat windows (skype) on a 17'' CRT (1024x768) back in those days. It's not just the "browser-based resource hog" bit that sucks - non-touch UIs have generally become way less space-efficient.
Not go to all “ackchually” but modern GPUs can render in many other ways than rasterising triangles, and they can absolutely draw a cylinder without any tessellation involved. You can use the analytical ray tracing formula, or signed distance fields for a practical way to easily build complex scenes purely with maths: https://iquilezles.org/articles/distfunctions/
Now of course triangles are usually the most practical way to render objects but it just bugs me when someone says something like “Every smooth surface you've ever seen on a screen was actually tiny flat triangles” when it’s patently false, ray tracing a sphere is pretty much the Hello World of computer graphics and no triangles are involved.
Edit: for CADs, direct ray tracing of NURBS surfaces on the GPU exists and lets you render smooth objects with no triangles involved whatsoever, although I’m not sure if any mainstream software uses that method.
LLMs are artificial general intelligence, as per the Wikipedia definition:
> generalise knowledge, transfer skills between domains, and solve novel problems without task‑specific reprogramming
Even GPT-3 could meet that bar.
reply