Eliminating the Call Stack to Save RAM [pdf]

userbinator · on July 30, 2015

Amongst those who use Asm regularly this is a fairly well-known technique... in BIOS code pre-memory-initialisation, for example; there is absolutely no RAM available at that point, so even a call stack can't be used, but you can still reuse code by putting the "return address" in a register and then jumping to the common block of code:

    mov sp, ret1
    jmp block
  ret1:
    ...
    mov sp, ret2
    jmp block
  ret2:
    ...
  block:
    ...
    jmp sp

You can even to go somewhere else after that block instead of continuing like a function call would, just by modifying the destination. No recursion is possible with this but I've seen BIOS code do more than one "level" of "calls" by using more "return address" registers (it tends to be BP and SP.)

It's funny to see this technique being rediscovered/reinvented (AFAICS the authors seem to think this "flattening" is an entirely new idea), and somewhat poorly too - there's no need to use switch statements and their associated complexity to map integers back into addresses, when all that's really required is for the "caller" to supply a where-to-go-next address. This is possible on any CPU that has an indirect-jump type of instruction. I've seen this in Z80 and 6502 code, as well as x86; it's present in early PC applications that were handwritten Asm.

https://sites.google.com/site/pinczakko/pinczakko-s-guide-to...

jerf · on July 30, 2015

"there's no need to use switch statements and their associated complexity to map integers back into addresses, when all that's really required is for the "caller" to supply a where-to-go-next address."

There's nothing quite like implementing an abstraction, then turning right around and unimplementing the abstraction as a layer on top. (See also: using relational DBs as key/value stores. Bonus points if it implements a hierarchy that looks like a file system! Implementing unreliable data delivery on top of TCP (which can be done by reconnections). Taking a character/block device and implementing block/character access. Implementing streaming on top of page-based abstractions like HTTP, implemented on top of streaming via TCP; there's "official" ways to do this but for a long time it qualified.) If you're wondering where the CPU cycles are going....

Dylan16807 · on July 30, 2015

I don't understand what abstraction is being unimplemented with a layer on top here. Whether you use a stack or avoid it with [multiple] link registers, those are both reasonable methods of getting back and neither one builds on top of the other or undoes any work that's already been done. Whether to use tokens or addresses has tradeoffs, but I still don't see any work being unimplemented.

At worst they avoided a stack and made an implementation that was unoptimized in an entirely unrelated manner, not because it unimplemented anything on top.

caf · on July 30, 2015

On Itanium, this is part of the standard calling convention - the call opcode stores the return address in a register rather than in memory. It's up to the callee to save and restore it if it's going to call another function itself.

repiret · on July 30, 2015

I think most architectures are like that.

caf · on July 31, 2015

More modern architectures tend to be. Apart from Itanium, other examples of architectures that store the return address in a register are Alpha, PowerPC and ARM.

Architectures that store the return address directly on the stack include 6502, z80, m68k and x86.

kabdib · on July 30, 2015

Yeah. I've written the boot code for some consumer machines, and those first few thousand cycles when you don't have RAM until you initialize the controllers and test memory, that's pretty neat code to write.

It's a time-honored coding technique, which I first saw on the 68000-based Macs in the 80s, and which probably dates back much earlier. Some Atari 2600 games didn't even use the stack pointer register, except as another slightly awkward temp register.

dezgeg · on July 30, 2015

> there's no need to use switch statements and their associated complexity to map integers back into addresses, when all that's really required is for the "caller" to supply a where-to-go-next address.

Do note that on 8-bit microcontroller ISAs it can be very cumbersome to load an absolute address and do an indirect jump into it, because code addresses could be 16 or 24 bits wide, with only 8-bit general purpose registers.

Taniwha · on July 30, 2015

yes it's also a common technique in embedded multi-threaded code where while there may be room for a stack there's only room for one small stack that has to be shared between all threads.

I've also seen it used in Verilog code where if you sprinkle @(posedge clk) event waits through an always block the synthesis tool creates some buried state (flops) to automatically unroll the code fragments into the equivalent of a case statement (don't do this it tends to make crappy gates)

dferlemann · on July 30, 2015

Freeing up 20% RAM with a trade off of 14% increase in ROM usage is fine and all... Code readability is kind of an issue here as well.

WallWextra · on July 30, 2015

Just to be clear, this is a compiler optimization. Do you mean the readability of the generated assembly code?

dferlemann · on July 30, 2015

I was pointing to the practice of flattening code.

srean · on July 30, 2015

Cant read pdf on my phone, so a quick question: Wouldn't cps style help ?