Tuesday, 29 May 2012

The Unit Testing Catch 22

I am unit testing. Unit testing falls closer to the second of the two types of tests I enumerated here and is, I think everyone agrees, a 'good thing.'

The problem with unit testing is that, like everything short of full functional testing, it relies on domain knowledge. In the days when everything was specified in documents, about which architects had meetings and then bestowed like blessings upon the rest of us, this was not a big deal; the document described exactly what stimulus the unit would receive and if it was wrong it was not your problem. These days we are all agile and stuff, and there are no documents, and we are forced to rely on our own understanding of the wider system to know what stimulus our unit will receive, which in a large system is almost certainly incomplete.

The catch 22 here is that the majority of bugs are due to incomplete domain knowledge. I simply didn't realise that the system would, in some rare occasions, provide me with two notifications before actually doing anything. I assumed that when I received the second notification the first operation was complete. No amount of unit testing would have caught this bug because I would no more have tested it than I would have tested how it coped receiving references to deleted objects.

This is why I am suspicious of mocking frameworks. Mocking is useful, obviously, but it forces you to rely even more on domain knowledge, which is a point of weakness.

Monday, 9 January 2012

Solaris Memory Allocators

Is your multi-threaded application running like a dog on Solaris?

Mine was, and after weeks of head scratching I was reduced to suggesting that the client ran it on Linux instead, which was a mite embarrassing to say the least.

Fast forward a year or so and I discover the wonderful Libumem, or, because we have to support Sol8 the almost as wonderful mtmalloc (incidentally, Oracle need to up their SEO game, it took me ages to dig up that link), two multithread optimised memory allocators that speed things up a lot! Brilliantly all you need to do is set LD_PRELOAD and you're off, you don't even need to recompile anything.

Except that I am not working on that application any more. Now I am trying to make something else go faster. And this something else is old, and sits on the legacy comms layer.

Way back in the dim and distant past somebody who isn't here any more wrote a buffered interface for the comms layer. Now if you or I were implementing a data buffer we'd probably put it in a unit tested class, and be all agile about it, and probably use templates and a design pattern and stuff. We'd definitely keep at least two counters, one to track the size of the buffer, and one to track the size of the data in it.

Things weren't like that back then. Classes were just considered uppity structures, unit-testing was viewed as a dubious eccentricity, and templates were for Microsoft Word and Microsoft Word only. Our long forgotten developer didn't need two counters. He just made it so the buffer was always the size of the data in it and whenever that changed he realloc'd it.

No messin'!

The writers of mtmalloc and libumem were not like that, they were like you and me, and they did not worry about realloc because everyone these days uses c++y things like new and delete and there was no c++y way to do realloc. So when they implimented realloc they simply did a malloc, a memcopy, and a free.

Easy!

Except you know what happens next. I set my LD_PRELOAD and everything runs swimmingly, for a while, until it hits a certain data-rate where, suddenly, the comms buffering is being exercised and the application slows back down to a crawl. Even worse than it was before, and I have to open up the labyrinthine antediluvian comms code and take out all those reallocs.

Which was no fun.

Monday, 5 December 2011

If the test writer has to forsee the bug then it will not be caught.

I am stuck automating software tests. Not unit tests you understand, but functional tests. It is a very dull necessity.

We have a nice little scriptable home-brewed test-bed where we can enact user stimulus and run scripts, change files, etc. It's really quite nifty but, I think, ultimately of very limited use.

Why? Because there are two sorts of automatic tests. The first feeds in stimulus X and checks for result Y, the second feeds in stimulus X and checks that result Y was the only thing that happened. The problem with the first type is that the test writer is required to foresee the bug, which nobody in their right mind could expect him to do. The problem with the second is it is hard to automate at a functional level.

Needless to say...

Tuesday, 25 October 2011

Aladdin Sane's Cat And Other Tools

I call the above diagram Aladdin Sane's Cat and for a while this summer it was the bane of my life. What it depicts is per thread cpu usage, the orange line represents the main process thread and the other lines represent my worker threads. Everything starts off okay, the main thread starts feeding the workers, they are rather heavily contended but we expect them to be contended at startup and are prepared to live with it. Then something terrible happens, the main thread turns its attention to consuming the worker thread results and the workers run dry - it doesn't look so awful in the above case because after startup the application calms down but in other test cases it started to beat; main thread, worker threads, main threads, worker threads, etc. (those cases didn't look like animals so I didn't take a screenshot).

The point of all this is not how to fix this one rather obscure problem (it took ages!) or even to go Hey, I drew a cat! (though don't think for a moment that I'm beyond writing a post just for that) but to demonstrate how incredibly useful something as simple as a chart of per-thread cpu usage can be. Without that chart I not only wouldn't have been able to fix the problem; I wouldn't even have known there was a problem.

In the above case the chart was drawn by our own software, although Excel would have been as good, and the numbers were generated from a script I wrote (and have sadly lost) that simply polled /proc/<pid>/task/<tid>/stat.

Looking back, this was a so simple to do, and so incredibly useful*, that I'm amazed I went so long before I thought of doing it. With that in mind I thought I'd produce a quick rundown of other tools I found useful developing multithreaded applications:

  1. Per thread CPU usage. On linux you can get it from the proc directory or using 'top -H' (although top is hard to script). On solaris from 'prstat -L' (which is easier to script). On Windows I have no idea but there must be ways. This is invaluable, as I hope I've demonstrated, especially once you chart it over time.
  2. Talking of prstat, on solaris you have the 'prstat -m' option for microstate accounting which is very funky indeed. This will give you a column, LCK, for the time spent waiting for userspace locks. You have to be aware that this includes time waiting in pthread_cond_wait but it is still useful.
  3. Valgrind. Both helgrind (or drd if you prefer) can be invaluable, particularly picking up lock ordering violations. The problem with both of these is that they run like dogs so you can't test the program under normal load and it takes forever.
  4. Sun Studio collect and er_print tools are really quite nifty once you get the hang of them. They will give you lock contention time by function and optionally filtered by thread.
  5. VTune, if you can persuade your boss to buy it (like I did, woo!) is surely the bee's knees among profilers and is just chock full of bells and whistles. For instance it is a doddle to identify contention on a per lock, per thread, per call-stack basis, which - when your wizz-bang new multi-threaded program is not scaling like it should, and your employer is breathing down your neck asking why - is a bloody godsend.
There are other tools. dtrace plockstat sounds fantastic but I could never get it to work. AMD CodeAnalyst is supposed to be great and is free but is only really useful on AMD hardware, upon which I was not. No doubt there are plenty more.


* It is a testament to both the usefulness of the script and the laziness of its author that even after I accidentally deleted it I left a copy running for nearly two months and changed the name of every process I wanted to examine in order that it would pick them up. Sadly after two months someone rebooted the server. It was a good script, and I miss it still, though so far not enough to rewrite it.

Wednesday, 28 September 2011

Sparc's Overlapping Register Windows

One of the nifty* features in Sparc chips is that the register file is in fact just a window onto a larger circular buffer. In Sparc you have 32 registers, eight of these are globals and do not interest us, of the remaining 24 eight are shared with the calling stack frame, and another eight will be shared with any called stack frame. There is a good explanation here: http://www.sics.se/~psm/sparcstack.html

All this should in theory make calling functions blindingly fast (not that functions calls are especially slow) so long as you don't pass more than 8 registers worth of arguments (which is plenty for most people). However, as the above link points out, whenever another thread wants the core you have to flush the whole lot to memory, whether it was in use or not (I suppose you could track and only flush the used registers but it probably wouldn't win you anything).

It occurred to me, while working on a piece of code some 40 odd stack frames below the main run loop, that function calls would get more expensive once you ran out of buffer, so I wrote a little benchmark to test it.

The benchmark is fairly simple, it calls the following function a few thousand times and you can set the initial first argument to whatever you want.

int recursive(int nLevel, int val1, int val2)
{
        if (0 == nLevel)
                return val1 + val2;
        else
                return recursive(nLevel-1, val2, (int)&val1);
}

(The casting of the pointer of val1 was the easiest way I found to stop the optimiser inlining the whole damn thing, bit clever that optimiser!)

I ran it with a selection of recursion levels, optimised and unoptimised, on Solaris x86 and Sparc boxes.




As you can see the cost of function calls after about 7 suddenly leaps up on the Sparc box but stays constant on the x86 box (I adjusted the number of iterations on the two boxes to normalise the figures so they aren't really directly comparable, my x86 box was a lot faster than my Sparc box). I had initially wondered if I would see a jump every 7 or 8 frames as the whole file was dumped to disk but in retrospect that would have been a stupid way to implement it and obviously only one frame is flushed at a time.

Incidentally the graph stays the same no matter how many initial frames you are down when you start the benchmark, proving that frames aren't loaded back in until they are needed. This is fairly obvious when you think about it but I checked anyway.

The other interesting thing to note is how much faster the first seven frames are on the optimised Sparc calls, after which it's about the same cost as the unoptimised version. The only optimisation to the function call was to replacing a tortuous series or dependent ORs with a single MOV:

Vanilla:
0x00010fbc: recursive+0x002c:   add      %l2, -1, %o0
0x00010fc0: recursive+0x0030:   add      %fp, 72, %o2
0x00010fc4: recursive+0x0034:   call     recursive      ! 0x10f90
0x00010fc8: recursive+0x0038:   or       %i2, %g0, %o1
0x00010fcc: recursive+0x003c:   or       %o0, %g0, %l7
0x00010fd0: recursive+0x0040:   or       %l7, %g0, %i0
0x00010fd4: recursive+0x0044:   ret      

Optimised:

0x00010f58: recursive+0x001c:   add      %i0, -1, %o0
0x00010f5c: recursive+0x0020:   add      %fp, 72, %o2
0x00010f60: recursive+0x0024:   call     recursive      ! 0x10f3c
0x00010f64: recursive+0x0028:   mov      %i2, %o1
0x00010f68: recursive+0x002c:   ret  


(there were other changes before the conditional jump that will also account for some speedup too)

The fact that this improvement is more or less invisible once we start flushing the buffer goes to show just how much the flush costs.

I have no idea why the x86 lines wobble so much.

The conclusion then, is to try and avoid regularly bouncing above seven stack frames high if you are targeting Sparc machines. Unfortunately with my 40+ frames there was little I could do about that.

* Well I think it's nifty. I've read criticisms that the feature makes Sparc chips hard to lay out and is the main reason they have never reached the clock speeds of other architectures.**


** That said, in my experience hardware engineers are always moaning about some feature making layout difficult and if they ever moan enough to make the architect compromise they just seamlessly move on to moaning about something else, so I still think it's nifty.

Friday, 16 September 2011

'Soft Deadlock'

Came across a funny situation with a standard worker pool pattern. When the workers had no work to do they basically did this:

1. Lock mutex.
2. Do a cond_timedwait waiting for work. (this has it's own mutex, not the one from above)
3. Unlock mutex.
4. Goto 1. Do not sleep, pass go, collect £200, anything ...

All of which worked fine until I ran it on a single core linux box, at which point a second thread waiting for the lock would occasionally never get it, it would wait, and wait, and never receive the lock, while the worker thread span happily around releasing and acquiring it,a sort of 'soft deadlock' - if you like?

Once I figured out what was going on I stripped it down to a test case and discovered it was happening about half the time, a little more if optimised.

Some weird scheduler problem then. I was under the impression that a thread should yield when it releases a contended mutex but a kernel hacker friend disabused me of that notion, it can do what it likes. Whether or not my waiting thread ever gets the lock is entirely down to luck.

The mutex here is protecting the work sources (including the cond var that the worker is waiting on) which are dynamic and can come and go. The proper solution then is to create some other static secondary work source that is checked between steps 3 and 4 and onto which I can post changes to the primary work sources, but then I would also have to create some response mechanism so that the worker pool knew when it was done and was safe to do the delete - all of which adds complexity and risk to an already complex and risk prone area of the code; all for a code path that will happen very rarely, in fact hardly ever.

Instead I did something truly abhorrent with a volatile flag and a call to sched_yield which I'm not proud of but is unobtrusive and works.

Anyway, an interesting gotcha. Or at least I thought it was interesting.