RoyWard's comments

RoyWard · on Feb 20, 2024

Do you have a reference for that? I might be looking in different places, but my reading suggests that loop rotation is where a conditional is moved to the end to avoid an unconditional branch, sometimes by the compiler. "Sheep race" with it's wasted computation is a little different - it's not a transformation it would be legal for a compiler to make, and possibly doesn't have enough use cases to really have its own well known name.

Oddly enough, I've noticed that the compiler generated code will sometimes not do loop rotation, and and will have a conditional branch in the middle and unconditional at the end. I suspect this is because (a) unconditional branches are often effectively free on modern processors (if all the instructions are cached appropriately), and (b) there are all sorts of subtle instruction layout and ordering that a good compiler does to make things map efficiently to the underlying microarchitecture that means that non-intuitive orderings sometimes have a better cost function.

I did find out that what I called "interleaving" is more properly called "software pipelining".

RoyWard · on Feb 19, 2024

Thank you! Part 2 is going to largely be a rehash of some of the usual algorithmic tricks, although they do have to be adapted somewhat to using a pipeline.

There are several issues that I can see with trying this level optimization on a GPU - firstly, given the limited depth available, renders with doubles are going to be pretty fast already (as long as the GPU supports doubles, which some of the older ones or the ones in phones don't) so there is pretty limited bang for the buck there.

Secondly it is really easy to get good information about CPU microarchitecture and timings (on x86 anyway, thank you Agner Fog), whereas I'm not aware of that level of detail for graphics cards, so they are a bit of a black box. Compounding this, GLSL does compile down to an intermediate code (SPIR-V), but then each graphics card will have that converted to its internal structure in it's own opaque way. It's much harder to optimize when all you can do is try things and look at timings. If I get enough bandwidth, I may look into this.

RoyWard · on Feb 19, 2024

Thank you! I had no idea that this had such communities either, and I've now got some links to check out. I was just picking a small problem because I'm job hunting and had some time to show some of what I can do - for this problem I could give decent coverage of some of the optimization possibilities without writing a whole book.

I would fit far more naturally into the second community, as I will normally avoid using fixed point wherever possible. I only went with doubles because it made it so much easier to highlight what can be done with the pipelining.

Part of the attraction of floating point is that a lot more CPU silicon goes into floating point than integer - so for instance AVX2 supports multiplication of 64 bit doubles but not 64 bit integers. Before this post, the last time I looked at the Mandelbrot Set was on a PowerPC750 - the best I could do with 32 bit integers was 22 clock cycles, the best I could do with floating point was (I think) 11 clock cycles for doubles, 9 clock cycles for floats, simply because floating point pipelined and integers didn't.