The x86 architecture is a modified Harvard architecture where close to the CPU (...

saulrh · on June 12, 2011

This is an extension on another paper from a few years ago where they described the split architecture; this paper describes some optimizations for the hardware and the compiler they use with it. The new optimizations increase performance by a factor of about 6 compared to their earlier work.

Additionally, I don't think that this processor executes code quite linearly. Its hardware can detect and break down functional code and run it in a parallel manner; they make full use of their multiport split memory to do something like eight times as much work/cycle as a (heavily pipelined!) Core 2 Duo. I admit that it probably won't work on iterative code, but there's enough functional code floating around that this could see some use as a coprocessor.

Symmetry · on June 12, 2011

Presumably if they were doing anything as complicated as a modern CPU they would have a unified last level of memory and some mechanism for guaranteeing synchronization.

Honestly, even C execution could probably be speeded up by having separate stack and heap memory pipelines. To do that efficiently on an OoO machine you'd probably need three sets of instructions for memory accesses to the stack - two sets for where you know ahead of time which part of memory you're dealing with and one for where you don't. The first two would be there to help out the scheduler.

Thinking about the C ABI a bit more and how you can have a function call with a pointer to who-knows-what place in memory maybe this isn't actually such a great idea in practice for C, but it should be great for languages which can provide more guarantees.

thesz · on June 12, 2011

>Honestly, even C execution could probably be speeded up by having separate stack and heap memory pipelines.

Later versions of Alpha AXP architecture did almost exactly that. They reorder memory access based on addresses. Alpha could execute load after store without blocking, providing load address did not clash with store address.

It helped them a lot, given that they had 80 registers at their disposal.

Symmetry · on June 12, 2011

Any modern x86 processor can do that too. The thing is, the complexity of the circuitry involved increases faster than linearly so being able to break it out into two sets of load/store queues would let you increase bandwidth by a fair bit.