Came across a funny situation with a standard worker pool pattern. When the workers had no work to do they basically did this:
1. Lock mutex.
2. Do a cond_timedwait waiting for work. (this has it's own mutex, not the one from above)
3. Unlock mutex.
4. Goto 1. Do not sleep, pass go, collect £200, anything ...
All of which worked fine until I ran it on a single core linux box, at which point a second thread waiting for the lock would occasionally never get it, it would wait, and wait, and never receive the lock, while the worker thread span happily around releasing and acquiring it,a sort of 'soft deadlock' - if you like?
Once I figured out what was going on I stripped it down to a test case and discovered it was happening about half the time, a little more if optimised.
Some weird scheduler problem then. I was under the impression that a thread should yield when it releases a contended mutex but a kernel hacker friend disabused me of that notion, it can do what it likes. Whether or not my waiting thread ever gets the lock is entirely down to luck.
The mutex here is protecting the work sources (including the cond var that the worker is waiting on) which are dynamic and can come and go. The proper solution then is to create some other static secondary work source that is checked between steps 3 and 4 and onto which I can post changes to the primary work sources, but then I would also have to create some response mechanism so that the worker pool knew when it was done and was safe to do the delete - all of which adds complexity and risk to an already complex and risk prone area of the code; all for a code path that will happen very rarely, in fact hardly ever.
Instead I did something truly abhorrent with a volatile flag and a call to sched_yield which I'm not proud of but is unobtrusive and works.
Anyway, an interesting gotcha. Or at least I thought it was interesting.
No comments:
Post a Comment