Wednesday, 28 September 2011

Sparc's Overlapping Register Windows

One of the nifty* features in Sparc chips is that the register file is in fact just a window onto a larger circular buffer. In Sparc you have 32 registers, eight of these are globals and do not interest us, of the remaining 24 eight are shared with the calling stack frame, and another eight will be shared with any called stack frame. There is a good explanation here: http://www.sics.se/~psm/sparcstack.html

All this should in theory make calling functions blindingly fast (not that functions calls are especially slow) so long as you don't pass more than 8 registers worth of arguments (which is plenty for most people). However, as the above link points out, whenever another thread wants the core you have to flush the whole lot to memory, whether it was in use or not (I suppose you could track and only flush the used registers but it probably wouldn't win you anything).

It occurred to me, while working on a piece of code some 40 odd stack frames below the main run loop, that function calls would get more expensive once you ran out of buffer, so I wrote a little benchmark to test it.

The benchmark is fairly simple, it calls the following function a few thousand times and you can set the initial first argument to whatever you want.

int recursive(int nLevel, int val1, int val2)
{
        if (0 == nLevel)
                return val1 + val2;
        else
                return recursive(nLevel-1, val2, (int)&val1);
}

(The casting of the pointer of val1 was the easiest way I found to stop the optimiser inlining the whole damn thing, bit clever that optimiser!)

I ran it with a selection of recursion levels, optimised and unoptimised, on Solaris x86 and Sparc boxes.




As you can see the cost of function calls after about 7 suddenly leaps up on the Sparc box but stays constant on the x86 box (I adjusted the number of iterations on the two boxes to normalise the figures so they aren't really directly comparable, my x86 box was a lot faster than my Sparc box). I had initially wondered if I would see a jump every 7 or 8 frames as the whole file was dumped to disk but in retrospect that would have been a stupid way to implement it and obviously only one frame is flushed at a time.

Incidentally the graph stays the same no matter how many initial frames you are down when you start the benchmark, proving that frames aren't loaded back in until they are needed. This is fairly obvious when you think about it but I checked anyway.

The other interesting thing to note is how much faster the first seven frames are on the optimised Sparc calls, after which it's about the same cost as the unoptimised version. The only optimisation to the function call was to replacing a tortuous series or dependent ORs with a single MOV:

Vanilla:
0x00010fbc: recursive+0x002c:   add      %l2, -1, %o0
0x00010fc0: recursive+0x0030:   add      %fp, 72, %o2
0x00010fc4: recursive+0x0034:   call     recursive      ! 0x10f90
0x00010fc8: recursive+0x0038:   or       %i2, %g0, %o1
0x00010fcc: recursive+0x003c:   or       %o0, %g0, %l7
0x00010fd0: recursive+0x0040:   or       %l7, %g0, %i0
0x00010fd4: recursive+0x0044:   ret      

Optimised:

0x00010f58: recursive+0x001c:   add      %i0, -1, %o0
0x00010f5c: recursive+0x0020:   add      %fp, 72, %o2
0x00010f60: recursive+0x0024:   call     recursive      ! 0x10f3c
0x00010f64: recursive+0x0028:   mov      %i2, %o1
0x00010f68: recursive+0x002c:   ret  


(there were other changes before the conditional jump that will also account for some speedup too)

The fact that this improvement is more or less invisible once we start flushing the buffer goes to show just how much the flush costs.

I have no idea why the x86 lines wobble so much.

The conclusion then, is to try and avoid regularly bouncing above seven stack frames high if you are targeting Sparc machines. Unfortunately with my 40+ frames there was little I could do about that.

* Well I think it's nifty. I've read criticisms that the feature makes Sparc chips hard to lay out and is the main reason they have never reached the clock speeds of other architectures.**


** That said, in my experience hardware engineers are always moaning about some feature making layout difficult and if they ever moan enough to make the architect compromise they just seamlessly move on to moaning about something else, so I still think it's nifty.

Friday, 16 September 2011

'Soft Deadlock'

Came across a funny situation with a standard worker pool pattern. When the workers had no work to do they basically did this:

1. Lock mutex.
2. Do a cond_timedwait waiting for work. (this has it's own mutex, not the one from above)
3. Unlock mutex.
4. Goto 1. Do not sleep, pass go, collect £200, anything ...

All of which worked fine until I ran it on a single core linux box, at which point a second thread waiting for the lock would occasionally never get it, it would wait, and wait, and never receive the lock, while the worker thread span happily around releasing and acquiring it,a sort of 'soft deadlock' - if you like?

Once I figured out what was going on I stripped it down to a test case and discovered it was happening about half the time, a little more if optimised.

Some weird scheduler problem then. I was under the impression that a thread should yield when it releases a contended mutex but a kernel hacker friend disabused me of that notion, it can do what it likes. Whether or not my waiting thread ever gets the lock is entirely down to luck.

The mutex here is protecting the work sources (including the cond var that the worker is waiting on) which are dynamic and can come and go. The proper solution then is to create some other static secondary work source that is checked between steps 3 and 4 and onto which I can post changes to the primary work sources, but then I would also have to create some response mechanism so that the worker pool knew when it was done and was safe to do the delete - all of which adds complexity and risk to an already complex and risk prone area of the code; all for a code path that will happen very rarely, in fact hardly ever.

Instead I did something truly abhorrent with a volatile flag and a call to sched_yield which I'm not proud of but is unobtrusive and works.

Anyway, an interesting gotcha. Or at least I thought it was interesting.