The point of all this is not how to fix this one rather obscure problem (it took ages!) or even to go Hey, I drew a cat! (though don't think for a moment that I'm beyond writing a post just for that) but to demonstrate how incredibly useful something as simple as a chart of per-thread cpu usage can be. Without that chart I not only wouldn't have been able to fix the problem; I wouldn't even have known there was a problem.
In the above case the chart was drawn by our own software, although Excel would have been as good, and the numbers were generated from a script I wrote (and have sadly lost) that simply polled /proc/<pid>/task/<tid>/stat.
Looking back, this was a so simple to do, and so incredibly useful*, that I'm amazed I went so long before I thought of doing it. With that in mind I thought I'd produce a quick rundown of other tools I found useful developing multithreaded applications:
- Per thread CPU usage. On linux you can get it from the proc directory or using 'top -H' (although top is hard to script). On solaris from 'prstat -L' (which is easier to script). On Windows I have no idea but there must be ways. This is invaluable, as I hope I've demonstrated, especially once you chart it over time.
- Talking of prstat, on solaris you have the 'prstat -m' option for microstate accounting which is very funky indeed. This will give you a column, LCK, for the time spent waiting for userspace locks. You have to be aware that this includes time waiting in pthread_cond_wait but it is still useful.
- Valgrind. Both helgrind (or drd if you prefer) can be invaluable, particularly picking up lock ordering violations. The problem with both of these is that they run like dogs so you can't test the program under normal load and it takes forever.
- Sun Studio collect and er_print tools are really quite nifty once you get the hang of them. They will give you lock contention time by function and optionally filtered by thread.
- VTune, if you can persuade your boss to buy it (like I did, woo!) is surely the bee's knees among profilers and is just chock full of bells and whistles. For instance it is a doddle to identify contention on a per lock, per thread, per call-stack basis, which - when your wizz-bang new multi-threaded program is not scaling like it should, and your employer is breathing down your neck asking why - is a bloody godsend.
There are other tools. dtrace plockstat sounds fantastic but I could never get it to work. AMD CodeAnalyst is supposed to be great and is free but is only really useful on AMD hardware, upon which I was not. No doubt there are plenty more.
* It is a testament to both the usefulness of the script and the laziness of its author that even after I accidentally deleted it I left a copy running for nearly two months and changed the name of every process I wanted to examine in order that it would pick them up. Sadly after two months someone rebooted the server. It was a good script, and I miss it still, though so far not enough to rewrite it.