Multiprocessing at a Glance (I): Processors and Cores
Since the Zen architecture was launched, we have seen a steady increase in core counts across the whole range of processors from both the Red and Blue teams. For this reason, we would like to share some general-level understanding about modern computing with a focus on multi-processing.
This mini-series is mostly written as an introductory material into the subject. We will go through how processors work, what single-core and multi-core mean, what is multi-tasking, multi-threading, and what are the challenges and benefits. We will also go through a couple of examples.
In this first episode, let’s try to understand together how processors work, at a very abstract level, and what a CPU core is. The focus is on the logical concept of the CPU, not at the transistor-level and electrical signaling magic that happens inside. We will not focus on any particular processor architecture.
Processors are devices capable of executing long series of instructions which together make up computer programs. Those instructions are typically split in two large categories: logical/arithmetical instructions on one hand, and memory access instructions on the other. In general, any instruction is performed in these stages:
Load is the action of fetching the next instruction in the CPU
Decode means breaking the instruction into it’s constituent parts (the instruction code, the input parameters and the output parameters). Moreover, decoding may result in a series of discrete and atomic operations to be executed serially as part of executing the instruction (we will explain to what atomic means, in a future episode)
Read means fetching that data from memory into the registers of the CPU, once we know what operands (data) that instruction needs
Execute is the actual operation performed on the input data
Write-back is storing the result of the instruction back into memory in order to free the CPU resources for the next instruction
The main idea here is that instructions are sub-divided into sub-operations. This point will become relevant in the future chapters.
Processors are very complex devices, containing billions of transistors organized in various sub-components that need to work together in perfect harmony. Today, we are counting multi-billion transistors on modern CPUs, but as the processors increase in size, complexity and transistor-density; we will sooner or later end up with processors that have trillions of transistors. Can any of you visualize such big numbers as a billion or a trillion? I certainly can’t. The technological level required to achieve this is truly awe-inspiring.
In order to allow all those countless transistors to work together, microprocessors use a high-frequency clock. The clock is actually driven by the motherboard, but the CPU has an internal component called a clock multiplier that allows it to multiply the base clock provided by the motherboard in order to obtain the frequency used internally by the processor. A typical base frequency is 100 MHz, and a processor with a multiplier of x40 will run at 4 GHz. Modern CPUs can vary their multiplier at runtime, resulting in variable CPU frequency.
Try to imagine the sound of a ticking wall-clock, and on each of those ticks, the various components inside the processor perform a single operation. Everything inside the processor is governed by that clock. This ensures signal stability and data coherence between different components of the CPU. For a visual equivalent, imagine a North-Korean mass game with billions (not just mere thousands) of participants, each of them performing their choreography together. The “clock” in this case are those people (usually not seen on camera) who hold various colored flags indicating the move and synchronizing the crowd when to perform the next move (I know this detail, because I once was one of those pixels).
In modern processors, several instructions are “in execution” in various degrees of completeness inside the CPU. This mechanism is called instruction pipelining. Basically, it means that CPUs can perform the stages mentioned above (load, decode, execute, etc.) in parallel. Pipelining allows the processors to be more efficient by not having to wait for an instruction to complete all the stages before starting the next instruction.
There are also times when there is nothing to do for the processor. No program is actively running during the current clock. What does the processor do in this case? Does the clock stop? It can’t, since the motherboard keeps generating the base clock, and as such the CPU clock is ticking too. As mentioned, the processor may decrease its clock multiplier as the CPU jumps to a lower-energy state, but it never really stops. During those periods the processor may execute special instructions called NOP (no-operation) which produce no effects. So a powered-on processor never really sleeps.
Recently, this state of affairs is changing somewhat in modern architectures as power efficiency becomes more and more important. Today, processors are able to completely turn off individual cores to save energy through a process called power gating, which also reduces heat build-up inside the processor package, which ultimately allows the processor to boost higher and for longer when needed. However – as a whole – the processor still never sleeps.
Complex instruction-set computer systems (CISC) have many different instructions, some simpler than others. That means that not all instructions execute equally fast inside a CPU. Some instructions take more CPU clock cycles to execute, others less. The average number of executed (meaning finished) instructions per CPU clock is referred to as IPC (Instructions Per Cycle).
IPC and the clock-frequency are two of the main parameters expressing the performance of a processor. There are other indicators used to compare processors, like power efficiency – which expresses how much electrical power is consumed by the processor to perform a certain task, but that is another discussion.
The main strategy to make processors faster is to either increase the frequency or to improve the IPC, preferably both.
Frequency increase is straightforward to understand: just make the clock tick faster and faster. One would wonder, why is there a frequency limit for each processor? Well, if we go back to our North-Korean mass game analogy, different mass game teams perform differently. The faster the flag people would “tick” the more tired the people will get as they need to switch their positions faster (imagine tiredness as heat accumulating in the CPU). Also, past a certain tick rate, people would have to perform the operations so fast that they might get out of sync with each other, ruining the harmony. In the processor world, any deviation from this harmony usually leads to a crash.
IPC improvements can be done usually by either making the memory access faster (check out this article on memory cache) or making a better overall CPU architecture which allows for fewer clocks per instruction or by increasing the number of processing units.
In the late 1980’s, system developers considered the idea of adding several processors in the same system. As such, multi-socket systems were created in which two or even four separate processors were added. This solution provided a relatively easy (but expensive) way of increasing the total throughput of the system. Expensive – because designing motherboards that support multiple CPUs is not easy, and there are added complications regarding access to shared resources.
There are two main strategies when it comes to multi-processing: symmetric and asymmetric.
Asymmetric multi-processing means that the processors present in a multi-socket system are either not identical (and as such perform different roles), or if they are, they are not able to share resources between themselves. In such cases, the system behaves as two separate systems: separate operating systems, separate applications, etc. These are mostly used in very specific custom solutions.
Symmetric multi-processing is more akin to today’s multi-core systems, only that the “cores” were in-fact fully fledged processors. These systems were able to exchange memory and signals between the two processors, and as such they could act as a single unified system. As we will see in a future episode where we talk about multi-tasking operating systems, these systems were the first ones that could truly take advantage of multi-tasking.
However, as these systems became more and more popular, they quickly revealed the fact that the data exchange latencies between tasks running on different processors are very large. The reason is that the data exchange was done via RAM – there simply was no faster way.
Later on, a new idea emerged; that is to bring closer together these separate processing units within the same package – and call them cores. Combined with a cache sharing architecture, it allows these new multi-core CPUs to massively increase their IPC compared with previous methods.
The first instance of a multi-core CPU is IBM’s Power4 launched in 2001, which is a server CPU; but most people will probably remember the first consumer multi-core CPUs such as AMD’s Athlon X2 and Intel’s Pentium D, both released in 2005.
Compared to multi-socket systems, multi-core systems have all the advantages: lower latencies between cores, simpler motherboard architecture, reduced manufacturing costs, etc. But that still doesn’t stop system integrators to create multi-socketed, multi-core systems, particularly for server environments.
In the beginning, multi-core CPUs only had two cores, then two became four, then eight, and now we are in an age when the number of cores inside the CPUs is increasing incredibly fast from year to year to values that would have been considered ridiculous only a few years ago. All this has mainly been driven by the revolutionary Zen architecture and AMD’s pricing model which brought high core-count processors to the masses.
However, this trend of adding cores will have to stop at some point due to Amdahl’s Law, which is basically the law of diminishing returns for parallel processing. If I’m allowed a dark joke here, someone should tell that aging Terminator that Amdahl is a bigger threat to Skynet, than John and Sarah are; and instead he should travel back in time to before 1967.
But cores alone cannot make all the difference. In order to utilize them efficiently, the software needs to be (re)written in such a way that it takes advantage of them. This applies especially to the operating systems and nowadays, to a higher and higher degree, to the user applications as well.
To be continued…
Allow me to end with this: I mentioned before that I am not able to visualize a trillion, and I feel that I need to explain myself.
Let’s try to visualize a million first, by imagining a kilometer broken up in a million millimeters. We can “see” and mentally estimate a kilometer on a long straight road, and we can “see” how small a millimeter is. As such, we can mentally compare the two lengths linearly and understand how huge one million is (compared to one). To linearly visualize a trillion though (and even a billion for that matter), the distances would exceed the human scale either at the high end, or at the low end, or at both ends at the same time! That is because we cannot “see” a thousand kilometers, neither a micrometer for that matter.
In order to visualize a ratio of a trillion to one, we could use areas instead (a kilometer squared and a millimeter squared), but I’d argue that area comparison is in fact very unintuitive for human beings. If we were asked to visualize two circles – one twice the size of the other, our immediate intuition would cause us to imagine two circles, one having its diameter twice as large as the other (which obviously is the wrong answer).
Those of us who are mathematically inclined would immediately realize the mistake and correct the image mentally, but nonetheless our first thought would be to linearly compare them. We think we know how big a billion or a trillion are, but the honest truth is that these numbers are simply too large to visualize intuitively (linearly) – for most of us.
And this makes the achievement of cramming so many transistors in only a few centimeters squared – that much more impressive.