As a last stand, I'd reiterate that different multi-thread algorithms could be practical if the minimum switch time were much, much smaller. Thread wakeup times on the order of hundreds or thousands of instructions, where the cache had not been substantially changed. E.g. Ping-Pong threads that passed buffers back and forth for staged processing. They would even benefit from the cacheing done by the other thread in the algorithm, that had processed the same data.
All impossible now, with the large thread-switch stone around our necks.
But this is sort of like saying things would be so much better if we didn't have this annoying speed of light constraint. Yeah, if memory was all L3 instead of main memory, things would be faster, but it isn't.
Don't get me wrong, I'm all for primitives being light weight, that was the whole motivation behind writing lmbench. You and I agree on the goal.
Where we part ways is what is possible. When context switches are being done in the time it takes to do ~140 cache misses I don't see how it is possible to get to "much, much smaller". Smaller, sure, much smaller? How? Which cache misses are you going to not do?
What I'm looking for is actual insight into the problem. If you've looked at the code and you know the hardware and you know each cache miss on a first name basis and you are saying "right here, these 70 misses are not needed, we can get 2x better if we lose these", that's something that I and everyone would want to know. We already agree that less is more, the point is how do you get to less? Is it even possible?
Lemme run some numbers for i686 25 years ago. Back then, we were at about 34 cache misses/context switch. Now we are at about 140. So it would seem like there is some room there. But back then we didn't have the lock traffic we have today. And who knows what else they have shoved into the kernel. Lets ignore the lock traffic and extra bloat and say we could get back to 34 misses. That would get us to 2usecs rather than 7usecs. Which is better but does it meet your definition of "much, much smaller"?
That's where I'm coming from, I'm sizing the problem. I have no idea if they can shave it down to 34 misses, I very much doubt it. But I think we can agree that given the fact that the kernel is fine grained threaded they aren't going to get to better numbers than what they had when it wasn't threaded, right? That's a lower bound in practice, agreed? So if the cache miss fairy showed up and gave us our best case scenario we're about 3x faster. Which is nice but not life changing. And I don't believe for a second that there is 3x worth of easy to remove cache misses/bloat in that code path, Linus would have to be asleep at the wheel to allow that.
Zero cache misses would be possible. If threads handed off data in a fine-grained fashion, with the cache essentially unchanged between switches.. Just like procedure call and return. In fact, maybe instead of procedure call and return. Impossible with the current model. Possible if thread switching was lightweight.
The idea is, threads are expensive (cache misses) now because they happen infrequently so the cache gets stale. But if they were as common as call-return then they'd behave at least as well. Except for this lead weight (today) of kernel switch/register reload/state reload.
As a last stand, I'd reiterate that different multi-thread algorithms could be practical if the minimum switch time were much, much smaller. Thread wakeup times on the order of hundreds or thousands of instructions, where the cache had not been substantially changed. E.g. Ping-Pong threads that passed buffers back and forth for staged processing. They would even benefit from the cacheing done by the other thread in the algorithm, that had processed the same data.
All impossible now, with the large thread-switch stone around our necks.