Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The x86 architecture is a modified Harvard architecture where close to the CPU (L1 cache) memory is divided into 'instructions' and 'data', further from the CPU the memory is joined. L2, L3 and RAM are generally 'unified' or can contain either 'instructions' or 'data'.

This thesis proposes that separating memory to 'instructions', 'stack', and 'heap' results in a performance increase for functional languages.

Additionally this is targeted at software that makes a large number of function calls such that the time expense of function calls with current architectures is higher than the actually computational time expense of the code.

Personal opinion: Maybe this is faster at the inner most levels of memory if the code makes lots of function calls. At the outer most levels of memory many things rely on the ability of 'memory' to be treated as either 'data' or 'instructions'. JIT compilation for instance. This would imply that to run general code there needs to be a separation process similar to what occurs between the L2 and L1 caches in current processors. I'm not sure this ultimately would result in a performance increase for general purpose processors.



This is an extension on another paper from a few years ago where they described the split architecture; this paper describes some optimizations for the hardware and the compiler they use with it. The new optimizations increase performance by a factor of about 6 compared to their earlier work.

Additionally, I don't think that this processor executes code quite linearly. Its hardware can detect and break down functional code and run it in a parallel manner; they make full use of their multiport split memory to do something like eight times as much work/cycle as a (heavily pipelined!) Core 2 Duo. I admit that it probably won't work on iterative code, but there's enough functional code floating around that this could see some use as a coprocessor.


Presumably if they were doing anything as complicated as a modern CPU they would have a unified last level of memory and some mechanism for guaranteeing synchronization.

Honestly, even C execution could probably be speeded up by having separate stack and heap memory pipelines. To do that efficiently on an OoO machine you'd probably need three sets of instructions for memory accesses to the stack - two sets for where you know ahead of time which part of memory you're dealing with and one for where you don't. The first two would be there to help out the scheduler.

Thinking about the C ABI a bit more and how you can have a function call with a pointer to who-knows-what place in memory maybe this isn't actually such a great idea in practice for C, but it should be great for languages which can provide more guarantees.


>Honestly, even C execution could probably be speeded up by having separate stack and heap memory pipelines.

Later versions of Alpha AXP architecture did almost exactly that. They reorder memory access based on addresses. Alpha could execute load after store without blocking, providing load address did not clash with store address.

It helped them a lot, given that they had 80 registers at their disposal.


Any modern x86 processor can do that too. The thing is, the complexity of the circuitry involved increases faster than linearly so being able to break it out into two sets of load/store queues would let you increase bandwidth by a fair bit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: