Multiprocessing at a Glance (V): Thread synchronization
What is thread synchronization, when is it useful and how to achieve it?
Thread synchronization is a mechanism by which various threads of a program can wait for each other before they continue. As we have seen in the previous episode, unsynchronized threads seem to be running in a messy and non-deterministic way.
Synchronization is a mechanism by which if one thread gets too much execution time and another one doesn’t, or if one thread’s load is very heavy compared with the other one but they still need to perform some tasks together, we can make the “faster” thread wait for the slower one to be ready.
Another common scenario for thread synchronization is when threads need to make sure that they have exclusive access to a shared resource, for example they cannot both write to a printer otherwise the printer output will be garbled and mixed.
Now, let’s imagine another scenario where we have two threads in a producer-consumer relationship. A producer-consumer means that the output of one thread is used as input for the other. The easiest way to visualize this is an assembly line. A car being built is slowly traveling on the assembly line, and at every stage a new component is added to the car. At first you have the chassis, then comes the body-work that gets bolted and welded to the chassis, then come the doors, the bonnet, the trunk, then the car goes through rust treatment, then paint, lacquer, mounting seats, engine, and so on.
You cannot paint the car if you missed any of the steps before. Doing that will either result in a car with a missing door or something, or a painted car that wasn’t rust-proofed before applying the paint layer.
So how is thread synchronization happening? There are several synchronization strategies in different programming languages, but we will not go so far as to explore each of them. Instead, let’s focus on the synchronization tools. They are named locking mechanisms, or simply, locks.
A lock is like a special data structure associated with a dedicated interface which provides a simple and intuitive way of using it. In order for these locks to work, they rely on special CPU instructions that allow to test and set a register atomically.
An atom was defined by the ancient Greeks as being the most fundamental building block of all the matter. Being the most fundamental something means that it cannot be broken or divided into any constituent parts, because if that were the case, those constituent parts would have been the most fundamental building blocks instead. Meanwhile, science has progressed a lot and we have found that real atoms are actually made of smaller sub-particles. But this is besides the point.
In programming, we have borrowed the ancient meaning of the term atom, and similarly we define something to be atomic if it cannot be broken apart in smaller pieces. In the world of processing instructions, being broken apart means being interrupted.
And now comes into play what we’ve learned in episode I, namely that even apparently simple instructions are not atomic because they can be broken down in even simpler operations that can be executed separately. And here is where those special instruction we referred above come into play. The CPU guarantees that those instructions cannot be interrupted. You might encounter another idiom for this concept, namely “transaction” or “transactional”.
This instruction either executes entirely, or it has not executed at all
OK, now that we have clarified atomicity, let’s list the various types of locks we have.
I will not explain this one in detail, because it requires the programmer to implement it at the application level, and as such it is always prone to buggy implementation. However simple they are, they are notoriously difficult to implement correctly in any programming language. They have the potential to be the most efficient locking mechanism, but they can easily be misused and may lead to wasting CPU resources through busy-waiting. Let’s just say that spinlocks might be useful in very special corner cases, but you’re not going to need them unless you understand perfectly the other locks and you realize they are not good enough for your use case. And even then, spinlocks might not be what you need. A double sided blade, not to be used by unsupervised children.
Imagine you are at school or at a theater and you want to use the toilet. The toilet facilities have a certain numbers of stalls. For obvious reason only one person can use a stall at a time. When recess comes, some people are rushing to use the toilets. And as is usual, Mother Nature has the bad habit to call more people for urgent “meetings” than there are available stalls. Let’s also imagine that you are not allowed to enter the toilet-room at all if there are no available stalls inside. After all, there is no reason to enter the toilet at all if all the stalls are taken. The door simply locks when all the stalls are occupied, and it unlocks when a persons leaves the toilet. Think of that door as being our semaphore.
A mutex is basically a simpler and more efficient form of semaphore. It behaves similarly, but it only allows a single entity access to the shared resource. It’s the same example as above, with only one stall. Or imagine that you work in an office with many others, and you only have a single printer on your floor (which is usually the case). You have to print a large document with many pages. You start your printing request and the printer confirms the job, and you receive the confirmation. Until your printing job is completed, any subsequent printing requests from you or other persons will be pending in the printer’s queue.
Imagine you want to board an airplane with your family. You want to enter the bird together, as a unit, not separately. The family doesn’t want to be split, with some members taking off and others being stuck in the airport. Usually, while in an airport, everyone is wandering through shops or visiting the restroom. So the first arriving family members at the gate will wait for the rest of the family to join, before boarding together. The locking mechanism modeling this use-case for threads is called a barrier.
Okay, enough with the theory. Let’s go back to our multi-threaded numbering example, and let’s try to make it work as we intend to. I repeat, this kind of program shouldn’t have been multi-threaded at all in the first place, because of its simplicity.
So here is the unsynchronized, multi-threaded code again:
global number = 0
while(number < 999)
print("Listing numbers from 0 to 999:")
number = number + 1
while(number < 999)
(M): is the main thread and in this implementation is responsible with incrementing [number] and deciding when the program ends
(S): is the secondary thread and is responsible to print each value of [number]
The problem with this implementation was that the two threads – (M) and (S) – weren’t waiting for each other, and the result seemed random. What we want to do now is to synchronize these two threads, in other words make (M) and (S) wait for each other. Therefore:
(M) needs to wait for (S) to print the current value of [number], before it increments it again. This will eliminate “missing” values from the output
(S) needs to wait for (M) to increment [number] before printing its value. This will eliminate duplicates from the output
This is a circular case of producer-consumer, where both threads wait for the other thread to perform one step. It’s probably overly complicated, but this fact also helps me drive a point later on.
The important part here is to identify which is the critical data that requires exclusive access. In our example, that data is the variable [number]. Only one thread at a time may access that variable if we want to ensure that we avoid problems. In practice, only “write” access needs to be synchronized (against other read or write operations), since if two or more threads would “read” the same data concurrently, they won’t experience side-effects. It’s only when critical data is being modified that we need to pay extra attention that no other thread accesses that data in any way.
You may of course also mutually exclude read access between threads, but this is unnecessary and usually comes with a performance penalty, as the various threads need to wait for each other just to read the same data. If these threads would run on different cores, you’ve just prevented them unnecessarily to run in parallel. Not to mention the penalty paid for simply entering or exiting the critical section.
In our example the thread (M) both reads and writes [number], while thread (S) only reads the current value of [number].
So let’s rewrite the program with pseudo-synchronization in place:
global number = 0
if (number < 1000)
print("Listing numbers from 0 to 999:")
number = number + 1
while(number < 1000)
Looking at this new code, we see that the code is a bit more complex than before. The program (the main thread – (M)) starts at main () and prints the first line. Then, it creates the secondary thread (S) – just as before. The secondary thread starts and enters the (infinite) loop, waiting for an increment notification.
The next thing we do in (M) is to notify that we have incremented [number] (even though we haven’t really done it yet [sic]). This is done to allow thread (S) to print the value 0 as well. Then (M) enters the loop and waits for the printing (of 0) to be done. That increment notification arrives to (S) which wakes up, verifies that [number] is less than 1000, and since that is true, it prints the value 0. Then (S) notifies that printing (of 0) has been done, before looping back up and waiting for the next increment notification.
Meanwhile (M) receives the printing done (for 0), wakes up, increments [number] to 1, notifies that the increment has been done, and then evaluates the end condition. Since 1 < 1000 is true, it restarts the loop and waits for printing of 1 to be done. And again (S) receives the notification that [number] has been incremented, compares the value to 1000, prints the value 1, then notifies (M) that printing is done, and so on.
This process repeats until [number] has the value 999. At that time, thread (S) again prints the value and notifies thread (M) that it has done so. The thread (M) receives the printing done notification (for 999), it increments [number] to 1000 and notifies (S) that it has done so.
At this point, one of two scenarios may occur:
(M) continues execution and immediately realizes that 1000 < 1000 is false, which causes it to exit the loop, printing the final line “Done!” and the program ends (effectively killing thread (S) too). The program always ends when the main function ends and the main thread dies, or
(S) gets execution from the scheduler, it receives the increment notification, compares [number] with 1000 and since 1000 < 1000 is false, it exits the loop and ultimately the function, and thread (S) dies. At this point (M) will resume, performs the final 1000 < 1000 comparison itself, prints the final line “Done!” and the program ends.
Either scenario produces what we want: the value 1000 is not printed, and the program ends one way or another.
In this (complicated) way, we have given specific instructions to the processor, that regardless of the scheduling algorithm of the operating system, we want that these two threads work in this particular way. One thread “produces” the new number, the other “consumes” it. And the CPU obeys.
Please note that (S) reads the value of [number] twice inside the critical section: once in the comparison, and the second time during printing. Having exclusive access to [number] means that (M) cannot interrupt (S) in-between those two read operations to sneakily increment [number] and as such, to cause the program to malfunction (ie. printing a non-validated value).
On the other hand (M) reads the value of [number] twice, and writes once to it. The increment operation number = number + 1 is in fact a read followed by a write. The processor needs to first read the current value of [number] from memory, before it calculates the new value by incrementing and storing it back to memory. The second time (M) reads the value of [number] is in the loop comparison.
In this synchronized version of the program, (S) will never print twice the same value, since before printing each value, it waits for an increment notification from (M). Conversely, (M) will not increment repeatedly or interrupt thread (S) such that it causes missing or invalid values being printed, because it waits that the current value of [number] was printed before incrementing it again.
People with good memory will remember that in the previous episode we changed the loop condition in (M) from while(number < 1000) in the single-threaded example to while(number < 999) in the unsynchronized multi-threaded example, just to switch it back to 1000 again in the synchronized version. This was done now because using the synchronization we have eliminated the race condition with all its undesired side effects we talked about – which was the whole purpose of the synchronization effort in the first place. As such, there is no risk that the value 1000 is printed by mistake.
Which locking mechanism is best suited in this example?
What would the program print if the first notify_incrementDone () in (M) – just before “do” – would have been missing?
Is the loop comparison in (M) inside or outside the critical section?