Multiprocessing at a Glance (VI): Deadlocks and race conditions
Allow me to start this episode by clarifying another concept, namely that of mutual exclusion. Mutual exclusion is the idea that you have a certain resource to which access is exclusive, so only one entity at a time has access to it. In fact, the name mutex – which we discussed previously – comes from the term MUTual EXclusion.
My preferred way to implement mutual exclusion is to create a simple set of functions which provide access to that shared resource and make sure that those few functions are safely written. Once those are in place, the programmer doesn’t need to worry anymore about mutual exclusion and she can simply use those utility functions (or APIs) as she wishes, focusing instead on the problem that she needs to solve.
This approach allows moving the mutual exclusion from a resource, to a snippet of code which is much more intuitive for the programmer to understand and to use, and thus easier to properly code. Just making sure that a piece of data is always exclusively accessed is painful. But moving all the access to one place, and then ensuring that it is mutually exclusive, unshackles the hands and mind of the programmer. The areas of mutually exclusive code are usually called critical sections.
A critical section is defined to be the code snippet which sits between the call that acquires and that which releases the lock.
In the previous episode, we have seen a simple case of producer-consumer, where threads waited for each other in order to orchestrate a higher level job. That was achieved via one of the thread synchronization mechanisms.
When you need to implement thread synchronization, certain kinds of issues may arise. These issue can be grouped in two classes:
All the threads wait for each other, like a super-polite bunch of people inviting each other to perform the next action, never taking the initiative
Every thread rushing to obtain exclusive access to the shared resource, like a bunch of annoying teenagers calling dibs and grabbing the first thing they see
These two situations are called deadlock and race condition, respectively.
As the name suggests, when this happens, the threads involved are getting stuck. This usually leads to the whole program being stuck, but there are exceptions. Nonetheless, deadlocks are severe bugs, and developers should and usually are taking them seriously.
In order for deadlocks to occur, all of the following conditions have to occur:
there are at least two threads in the application
at least one mutually exclusive resource is held by a thread
a thread holding a mutually exclusive resource needs access to another mutually exclusive resource
resources can only be released voluntary by the holding thread, in other words there is no mechanism for forcibly taking exclusive access over a resource
there is some kind of circular dependency for accessing resources
If all these conditions are met, we may get a deadlock during execution. Now, this is not very easy to read from the code, since many of these conditions are happening during runtime. Of course, with experience, a good programmer is able to learn and see these (anti-)patterns and is able to write code that avoids deadlocks following some simple strategies.
One simple strategy to reduce the risk of deadlocks is to ensure that the requests for exclusive access to locking and unlocking multiple resources are arranged in a nested setup. In other words, unlocking (or releasing) resources should be done in the reverse order in which they were locked (or acquired).
Secondly – and more importantly – the nesting structure should be consistent. This can be ensured by creating and consistently following a hierarchy of resources. This hierarchy should be designed in such a way that the most desired resource (the resource that is in more demand and as such less likely to be available) is locked first, and the least desired resource (or the resource that is most likely to be available) is locked last. This will shorten up the amount of time that a thread unnecessarily holds one lock while waiting for another, thus increasing overall throughput.
Race conditions are harder to define and notoriously difficult to debug at runtime. The reason for this is that they may disguise themselves under normal circumstances and only pop-up in certain situations, creating side-effects that hide the real root-cause of the problem. However difficult they are to debug, race conditions are (usually) easier to fix once discovered. In most cases the “fix” simply means creating a mutual exclusion region for your raced data.
To exemplify a race condition, have a look at the first multi-threaded code example from part IV, where the program seemingly went crazy. As you saw there, at a first glance the side-effects were strange, and it wasn’t obvious that we were talking about race conditions. This class of issues can be smelled by the experienced programmer, but in most cases the side effects are so subtle that it takes lot of testing and effort until found. Many times the programmer knows that the issue must be caused by a race condition, but it may be difficult to find where it is.
The difficulty stands from the fact that deadlocks and race conditions are a dynamic property of the scheduler, rather than a static property of the code. That is why multi-threading adds an extra layer of complexity that the programmer needs to be aware of.
A common mistake to avoid them is to brute-force thread synchronization by creating unnecessarily large mutual exclusions that isolate large portions of data. This approach cancels a lot of the benefits of multi-threading as it causes threads to run sequentially, rather than in parallel.
Answer: to end this episode let’s examine the homework we have done in the previous episode. Let’s list again the program in question:
global number = 0
if (number < 1000)
print("Listing numbers from 0 to 999:")
number = number + 1
while(number < 1000)
The first question asked what is the proper locking mechanism to be used in the waitFor_() implementations. Well, we don’t need barriers because we don’t have a group of threads that need to wait for each other to start doing something at the same time. A semaphore could be used, since it’s about waiting for the other thread to finish access to the shared resource. But since we don’t have a pool of resources (like multiple instances of [number]), a mutex is probably the best choice for this case.
The second question asked what would happen if the line stamped with (+) would be missing. In that case, the main thread would enter the loop and wait for the signal from the printing thread that printing was done. On the other hand, the printing thread would start by entering its own loop and waiting for the signal from the main thread that it had incremented [number]. So both threads would end up waiting for each other indefinitely. This would mean that we have replaced a race condition with a deadlock.
In the third question, we asked whether the loop comparison in (M) is inside or outside the critical section. The answer is that the comparison is outside the critical section. The critical sections in our example are the snippets code located between the waitFor_() and notify_() instructions in each thread respectively. As such, in thread (M) the final comparison is outside the critical section. However, this is safe because thread (S) will never change the value of [number] and as such reading the value of [number] in (M) outside the critical section is deterministic.