All this should in theory make calling functions blindingly fast (not that functions calls are especially slow) so long as you don't pass more than 8 registers worth of arguments (which is plenty for most people). However, as the above link points out, whenever another thread wants the core you have to flush the whole lot to memory, whether it was in use or not (I suppose you could track and only flush the used registers but it probably wouldn't win you anything).
It occurred to me, while working on a piece of code some 40 odd stack frames below the main run loop, that function calls would get more expensive once you ran out of buffer, so I wrote a little benchmark to test it.
The benchmark is fairly simple, it calls the following function a few thousand times and you can set the initial first argument to whatever you want.
int recursive(int nLevel, int val1, int val2)
{
if (0 == nLevel)
return val1 + val2;
else
return recursive(nLevel-1, val2, (int)&val1);
}
(The casting of the pointer of val1 was the easiest way I found to stop the optimiser inlining the whole damn thing, bit clever that optimiser!)
I ran it with a selection of recursion levels, optimised and unoptimised, on Solaris x86 and Sparc boxes.
As you can see the cost of function calls after about 7 suddenly leaps up on the Sparc box but stays constant on the x86 box (I adjusted the number of iterations on the two boxes to normalise the figures so they aren't really directly comparable, my x86 box was a lot faster than my Sparc box). I had initially wondered if I would see a jump every 7 or 8 frames as the whole file was dumped to disk but in retrospect that would have been a stupid way to implement it and obviously only one frame is flushed at a time.
Incidentally the graph stays the same no matter how many initial frames you are down when you start the benchmark, proving that frames aren't loaded back in until they are needed. This is fairly obvious when you think about it but I checked anyway.
The other interesting thing to note is how much faster the first seven frames are on the optimised Sparc calls, after which it's about the same cost as the unoptimised version. The only optimisation to the function call was to replacing a tortuous series or dependent ORs with a single MOV:
Vanilla:
0x00010fbc: recursive+0x002c: add %l2, -1, %o0
0x00010fc0: recursive+0x0030: add %fp, 72, %o2
0x00010fc4: recursive+0x0034: call recursive ! 0x10f90
0x00010fc8: recursive+0x0038: or %i2, %g0, %o1
0x00010fcc: recursive+0x003c: or %o0, %g0, %l7
0x00010fd0: recursive+0x0040: or %l7, %g0, %i0
0x00010fd4: recursive+0x0044: ret
Optimised:
0x00010f58: recursive+0x001c: add %i0, -1, %o0
0x00010f5c: recursive+0x0020: add %fp, 72, %o2
0x00010f60: recursive+0x0024: call recursive ! 0x10f3c
0x00010f64: recursive+0x0028: mov %i2, %o1
0x00010f68: recursive+0x002c: ret
(there were other changes before the conditional jump that will also account for some speedup too)
The fact that this improvement is more or less invisible once we start flushing the buffer goes to show just how much the flush costs.
I have no idea why the x86 lines wobble so much.
The conclusion then, is to try and avoid regularly bouncing above seven stack frames high if you are targeting Sparc machines. Unfortunately with my 40+ frames there was little I could do about that.
* Well I think it's nifty. I've read criticisms that the feature makes Sparc chips hard to lay out and is the main reason they have never reached the clock speeds of other architectures.**
** That said, in my experience hardware engineers are always moaning about some feature making layout difficult and if they ever moan enough to make the architect compromise they just seamlessly move on to moaning about something else, so I still think it's nifty.
No comments:
Post a Comment