There's still plenty of data out there, including in other languages and undigit...

There's still plenty of data out there, including in other languages and undigitised books - and that's before you get to data in other modalities, like speech and videos. Synthetic data can also be used quite effectively if you're trying to distill a model instead of trying to grow capabilities, as Phi-1.5 demonstrates.

For capability growth, well, we don't know what we don't know. There are still many unknowns when it comes to architecture, training, data, modalities, incremental learning, alignment, self-critique, and more. There's plenty of companies and governments trying to find their angle here.

Even if we're at the very peak of what LLMs are capable of -- which seems unlikely -- there's still potentially decades of research in making what we have more effective.