Loveliness Unfathomable: 2011

Monday 5 December 2011

If the test writer has to forsee the bug then it will not be caught.

I am stuck automating software tests. Not unit tests you understand, but functional tests. It is a very dull necessity.

We have a nice little scriptable home-brewed test-bed where we can enact user stimulus and run scripts, change files, etc. It's really quite nifty but, I think, ultimately of very limited use.

Why? Because there are two sorts of automatic tests. The first feeds in stimulus X and checks for result Y, the second feeds in stimulus X and checks that result Y was the only thing that happened. The problem with the first type is that the test writer is required to foresee the bug, which nobody in their right mind could expect him to do. The problem with the second is it is hard to automate at a functional level.

Needless to say...

Tuesday 25 October 2011

Aladdin Sane's Cat And Other Tools

I call the above diagram Aladdin Sane's Cat and for a while this summer it was the bane of my life. What it depicts is per thread cpu usage, the orange line represents the main process thread and the other lines represent my worker threads. Everything starts off okay, the main thread starts feeding the workers, they are rather heavily contended but we expect them to be contended at startup and are prepared to live with it. Then something terrible happens, the main thread turns its attention to consuming the worker thread results and the workers run dry - it doesn't look so awful in the above case because after startup the application calms down but in other test cases it started to beat; main thread, worker threads, main threads, worker threads, etc. (those cases didn't look like animals so I didn't take a screenshot).

The point of all this is not how to fix this one rather obscure problem (it took ages!) or even to go Hey, I drew a cat! (though don't think for a moment that I'm beyond writing a post just for that) but to demonstrate how incredibly useful something as simple as a chart of per-thread cpu usage can be. Without that chart I not only wouldn't have been able to fix the problem; I wouldn't even have known there was a problem.

In the above case the chart was drawn by our own software, although Excel would have been as good, and the numbers were generated from a script I wrote (and have sadly lost) that simply polled /proc/<pid>/task/<tid>/stat.

Looking back, this was a so simple to do, and so incredibly useful*, that I'm amazed I went so long before I thought of doing it. With that in mind I thought I'd produce a quick rundown of other tools I found useful developing multithreaded applications:

Per thread CPU usage. On linux you can get it from the proc directory or using 'top -H' (although top is hard to script). On solaris from 'prstat -L' (which is easier to script). On Windows I have no idea but there must be ways. This is invaluable, as I hope I've demonstrated, especially once you chart it over time.
Talking of prstat, on solaris you have the 'prstat -m' option for microstate accounting which is very funky indeed. This will give you a column, LCK, for the time spent waiting for userspace locks. You have to be aware that this includes time waiting in pthread_cond_wait but it is still useful.
Valgrind. Both helgrind (or drd if you prefer) can be invaluable, particularly picking up lock ordering violations. The problem with both of these is that they run like dogs so you can't test the program under normal load and it takes forever.
Sun Studio collect and er_print tools are really quite nifty once you get the hang of them. They will give you lock contention time by function and optionally filtered by thread.
VTune, if you can persuade your boss to buy it (like I did, woo!) is surely the bee's knees among profilers and is just chock full of bells and whistles. For instance it is a doddle to identify contention on a per lock, per thread, per call-stack basis, which - when your wizz-bang new multi-threaded program is not scaling like it should, and your employer is breathing down your neck asking why - is a bloody godsend.

There are other tools. dtrace plockstat sounds fantastic but I could never get it to work. AMD CodeAnalyst is supposed to be great and is free but is only really useful on AMD hardware, upon which I was not. No doubt there are plenty more.

* It is a testament to both the usefulness of the script and the laziness of its author that even after I accidentally deleted it I left a copy running for nearly two months and changed the name of every process I wanted to examine in order that it would pick them up. Sadly after two months someone rebooted the server. It was a good script, and I miss it still, though so far not enough to rewrite it.

Wednesday 28 September 2011

Sparc's Overlapping Register Windows

One of the nifty* features in Sparc chips is that the register file is in fact just a window onto a larger circular buffer. In Sparc you have 32 registers, eight of these are globals and do not interest us, of the remaining 24 eight are shared with the calling stack frame, and another eight will be shared with any called stack frame. There is a good explanation here: http://www.sics.se/~psm/sparcstack.html

All this should in theory make calling functions blindingly fast (not that functions calls are especially slow) so long as you don't pass more than 8 registers worth of arguments (which is plenty for most people). However, as the above link points out, whenever another thread wants the core you have to flush the whole lot to memory, whether it was in use or not (I suppose you could track and only flush the used registers but it probably wouldn't win you anything).

It occurred to me, while working on a piece of code some 40 odd stack frames below the main run loop, that function calls would get more expensive once you ran out of buffer, so I wrote a little benchmark to test it.

The benchmark is fairly simple, it calls the following function a few thousand times and you can set the initial first argument to whatever you want.

int recursive(int nLevel, int val1, int val2)

{

if (0 == nLevel)

return val1 + val2;

else

return recursive(nLevel-1, val2, (int)&val1);

}

(The casting of the pointer of val1 was the easiest way I found to stop the optimiser inlining the whole damn thing, bit clever that optimiser!)

I ran it with a selection of recursion levels, optimised and unoptimised, on Solaris x86 and Sparc boxes.

As you can see the cost of function calls after about 7 suddenly leaps up on the Sparc box but stays constant on the x86 box (I adjusted the number of iterations on the two boxes to normalise the figures so they aren't really directly comparable, my x86 box was a lot faster than my Sparc box). I had initially wondered if I would see a jump every 7 or 8 frames as the whole file was dumped to disk but in retrospect that would have been a stupid way to implement it and obviously only one frame is flushed at a time.

Incidentally the graph stays the same no matter how many initial frames you are down when you start the benchmark, proving that frames aren't loaded back in until they are needed. This is fairly obvious when you think about it but I checked anyway.

The other interesting thing to note is how much faster the first seven frames are on the optimised Sparc calls, after which it's about the same cost as the unoptimised version. The only optimisation to the function call was to replacing a tortuous series or dependent ORs with a single MOV:

Vanilla:
0x00010fbc: recursive+0x002c: add %l2, -1, %o0
0x00010fc0: recursive+0x0030: add %fp, 72, %o2
0x00010fc4: recursive+0x0034: call recursive ! 0x10f90
0x00010fc8: recursive+0x0038: or %i2, %g0, %o1
0x00010fcc: recursive+0x003c: or %o0, %g0, %l7
0x00010fd0: recursive+0x0040: or %l7, %g0, %i0
0x00010fd4: recursive+0x0044: ret

Optimised:

0x00010f58: recursive+0x001c: add %i0, -1, %o0
0x00010f5c: recursive+0x0020: add %fp, 72, %o2
0x00010f60: recursive+0x0024: call recursive ! 0x10f3c
0x00010f64: recursive+0x0028: mov %i2, %o1
0x00010f68: recursive+0x002c: ret

(there were other changes before the conditional jump that will also account for some speedup too)

The fact that this improvement is more or less invisible once we start flushing the buffer goes to show just how much the flush costs.

I have no idea why the x86 lines wobble so much.

The conclusion then, is to try and avoid regularly bouncing above seven stack frames high if you are targeting Sparc machines. Unfortunately with my 40+ frames there was little I could do about that.

* Well I think it's nifty. I've read criticisms that the feature makes Sparc chips hard to lay out and is the main reason they have never reached the clock speeds of other architectures.**

** That said, in my experience hardware engineers are always moaning about some feature making layout difficult and if they ever moan enough to make the architect compromise they just seamlessly move on to moaning about something else, so I still think it's nifty.

Friday 16 September 2011

'Soft Deadlock'

Came across a funny situation with a standard worker pool pattern. When the workers had no work to do they basically did this:

1. Lock mutex.
2. Do a cond_timedwait waiting for work. (this has it's own mutex, not the one from above)
3. Unlock mutex.
4. Goto 1. Do not sleep, pass go, collect £200, anything ...

All of which worked fine until I ran it on a single core linux box, at which point a second thread waiting for the lock would occasionally never get it, it would wait, and wait, and never receive the lock, while the worker thread span happily around releasing and acquiring it,a sort of 'soft deadlock' - if you like?

Once I figured out what was going on I stripped it down to a test case and discovered it was happening about half the time, a little more if optimised.

Some weird scheduler problem then. I was under the impression that a thread should yield when it releases a contended mutex but a kernel hacker friend disabused me of that notion, it can do what it likes. Whether or not my waiting thread ever gets the lock is entirely down to luck.

The mutex here is protecting the work sources (including the cond var that the worker is waiting on) which are dynamic and can come and go. The proper solution then is to create some other static secondary work source that is checked between steps 3 and 4 and onto which I can post changes to the primary work sources, but then I would also have to create some response mechanism so that the worker pool knew when it was done and was safe to do the delete - all of which adds complexity and risk to an already complex and risk prone area of the code; all for a code path that will happen very rarely, in fact hardly ever.

Instead I did something truly abhorrent with a volatile flag and a call to sched_yield which I'm not proud of but is unobtrusive and works.

Anyway, an interesting gotcha. Or at least I thought it was interesting.