Multiprocessing at a Glance (VII): Practical example

Whether you’re celebrating or not, I wish Happy Holidays to all of you!

Let’s apply what we’ve learned so far, by writing an application that uses multi-threading, using all the best practices we’ve been talking about. This time we will not use an example so simple that it’s almost stupid to multithread. We will also replace the generic notify_…() and waitFor_…() functions with the concrete lock mechanism we want to use.

By this time, I am assuming that the readers who got through and keep returning to this mini-series do have at least some basic coding knowledge and are able to read and understand pseudo-code.

Unfortunately, I wasn’t able to come up with some original example, so – with your permission – let’s try to write an application that simulates the behavior of Cinebench (developed by Maxon). Cinebench is one of the popular benchmarks used by the CPU enthusiasts to compare processor performance, so what better example to use than a software we are all familiar with?

Screenshot of Cinebench R20

Let’s summarize the core functionality of Cinebench R20. The program starts in a stand-by mode, allowing the user to Run a benchmark. When the button is pressed, the software prepares the benchmark, then proceeds to execute it, and then at the end presents a digested numerical value that can be used to compare the performance of the processors.

Obviously, we don’t know exactly how Cinebench is implemented, and there are many ways in which it may be implemented regarding the multi-threading aspect. It is safe to assume though that this is a multi-threaded program, which instantiates as many threads as there are available in the processor. The total amount of threads is calculated by multiplying the amount of cores with the SMT factor. As of today, the SMT uses 2 threads per core, so we will use that as well.

So let’s start by writing the main function taking into account how the software works in general.

global benchmarkStart
global benchmarkDone
global threadCount
global runCommandPressed

function main()
initialize()
do
lockMutex(runCommandPressed)
prepareBenchmark()

postSemaphore(benchmarkStart, threadCount)
waitForBarrier(benchmarkDone)

maxonCalculateAndDisplayScore()
while(program_not_stopped)

The program starts by calling the initialize() function. Once the initialization has been done, the main function enters a loop and waits for the user to press the Run command by waiting on the [runCommandPressed] mutex. The code that is called when that button is pressed, is responsible with unlocking that mutex so that the main function can continue. Once released, the main function will prepare the benchmark and then will post the [benchmarkStart] semaphore indicating to all the worker threads to start working.

Once the benchmark has started and the workers were released, the main function will then wait at the [benchmarkDone] barrier, before proceeding to calculate the score. The reason for waiting at a barrier is to ensure that it doesn’t proceed to calculating the score before all the worker threads have finished their jobs.

We don’t know and don’t care about how maxonCalculateAndDisplayScore() is implemented. Let’s just say it’s some time-based magic. And here is the function responsible with releasing the main thread to initiate the benchmark. We assume that this function will be called whenever the user presses the Run buton.

function runButtonCallback()
    unlockMutex(runCommandPressed)

The function initialize() is responsible with loading the necessary 3D scene for rendering and reading the processor details from the system. The same function also instantiates the necessary synchronization locking mechanisms used above.

global smtFactor = 2
global 3dScene = nil

function initialize()
    3dScene = load3DScene()
    threadCount = read_processor_core_count() * smtFactor
    createMutex(runCommandPressed, LOCKED)
    createMutex(tileAccess, UNLOCKED)
    createSemaphore(benchmarkStart, threadCount)
    createBarrier(benchmarkDone, threadCount+1)

First, initialize() creates a mutex used to indicate that the user has pressed the Run button. Because the application needs to wait for the user to take action before starting the benchmark, the [runCommandPressed] mutex is created in a locked state.

Then the initialization logic creates another mutex called [tileAccess], which is used to ensure exclusive access to the tiles. This mutex is created unlocked, so as to allow the first calling worker to fetch a tile immediately.

Then it creates a semaphore which is used by the working threads to start rendering the scene in order to benchmark the processor. The semaphore is posted by the main function.

Finally, it creates a barrier with a count equal to the number of worker threads plus 1. Basically the main function will synchronize with all the worker threads to detect when all of them have finished the work, which indicates that the entire scene has been rendered.

During the benchmark preparation, the program will instantiate the necessary threads and perform final pre-processing on the 3D scene that needs rendering, including splitting the scene into a set of tiles to be rendered separately.

global tileSet = emptySet

function prepareBenchmark()
    tileSet = splitSceneIntoTiles()
    for (i in 1 .. threadCount)
        spawnThread(workerThread())

We will look later at the workerThread() function. For now, let’s notice that a complex data structure called [tileSet] is created for storing all the tiles to be rendered. This data structure will contain the unprocessed tiles. Since the worker threads need to fetch the next available tile for processing – and they do it concurrently, this data structure must be located in a critical section. The critical section is implemented in the next function:

function fetchNextAvailableTile()
    result = emptyTile
    lockMutex(tileAccess)
    if (notEmpty(tileSet))
        result = tileSet.removeFirstTile()
    unlockMutex(tileAccess)
    return result

This function will be called by the worker threads, as they need to get another tile to render. It first sets the result to [emptyTile], then it waits for exclusive access to the [tileSet] and when it enters the critical section it checks whether the set is already empty or not. If not empty, it will remove the first tile in the set and save it into the result variable before exiting the critical section.

We are almost done, the only thing left is to describe the worker threads. The worker threads are responsible to perform the 3D rendering, each working on one different tile at a time. As such, they first pend on the [benchmarkStart] semaphore. Then, as long as there are unprocessed tiles, they take one tile at a time and render it. We already know that there are more tiles to process as long as the fetchNextAvailableTile() returns a tile different than [emptyTile].

Finally, just before dying, the workers wait at the [benchmarkDone] barrier, in order to allow the main function to synchronize when the rendering is complete.

global emptyTile = nil

function workerThread()
    waitForSemaphore(benchmarkStart)
    do
        tile = fetchNextAvailableTile()
        if (tile != emptyTile)
            renderTile(tile)
    while(tile != emptyTile)
    waitForBarrier(benchmarkDone)

Now let’s put it all together into a single listing:

global smtFactor = 2
global threadCount = 0

global 3dScene = nil
global tileSet = emptySet
global emptyTile = nil

global runCommandPressed
global benchmarkStart
global benchmarkDone

function initialize()
    3dScene = load3DScene()
    threadCount = read_processor_core_count() * smtFactor
    createMutex(runCommandPressed, LOCKED)
    createMutex(tileAccess, UNLOCKED)
    createSemaphore(benchmarkStart, threadCount)
    createBarrier(benchmarkDone, threadCount+1)

function runButtonCallback()
    unlockMutex(runCommandPressed)

function prepareBenchmark()
    tileSet = splitSceneIntoTiles()
    for (i in 1 .. threadCount)
        spawnThread(workerThread())

function fetchNextAvailableTile()
    result = emptyTile
    lockMutex(tileAccess)
    if (notEmpty(tileSet))
        result = tileSet.removeFirstTile()
    unlockMutex(tileAccess)
    return result

function workerThread()
    waitForSemaphore(benchmarkStart)
    do
        tile = fetchNextAvailableTile()
        if (tile != emptyTile)
            renderTile(tile)
    while(tile != emptyTile)
    waitForBarrier(benchmarkDone)

function main()
    initialize()
    do
        lockMutex(runCommandPressed)
        prepareBenchmark()

        postSemaphore(benchmarkStart, threadCount)
        waitForBarrier(benchmarkDone)

        maxonCalculateAndDisplayScore()
    while (program_not_stopped)

And that’s about it. We have used two mutex-es (one to initiate the benchmark, the other for exclusive access to the tiles); a semaphore – to start all workers at the same time; and a barrier – to synchronize the main function and the worker threads before the benchmark score gets calculated.

In the next episode, we will go through a summary of what we’ve learned, and draw some conclusions.


Previous: Episode VI – Deadlocks and race conditions
Next: Episode VIII – Conclusion

More Stories
Multiprocessing at a Glance (II): Software
Do NOT follow this link or you will be banned from the site!