Parallel computing has become commonplace in modern gaming rigs. We come across words like cores and threads all the time, and yet, the rabbit hole of parallel computing goes far deeper than most of us dare to know. Here’s a quick overview of the various types of parallelism that we come across in modern gaming rigs.
A processor core is the part of a computer that runs programs, and a multi-core processor is capable of running multiple programs in parallel on the same chip. This can be either completely unrelated programs, or a single program that is written to take advantage of multiple cores. A typical Windows 10 system will have hundreds if not thousands of little programs and parts of programs that all need to share the few processor cores in the hardware. The task scheduler of the operating system will switch all these programs in and out of the processor seamlessly, without users ever knowing about it. It works effectively the same way with Linux, macOS and gaming consoles as well, with some minor variations.
While cores have processing power of their own, hardware threads do not. Threads are simply a way for two programs to run on a single core at the hardware level. It’s not adding any processing power, but can still make things a bit faster by allowing the core to be better utilized. In theory, the speedup could be anything between 0% and 100%, but for typical applications it’s more likely to be around 25-40%. Historically it has been possible to see a slowdown in some cases when SMT is turned on, but for modern processors it’s fairly rare in practice. In gaming rigs, the de facto standard is 2-way SMT, while 4-way or more can be found in some high-end servers.
When a processor performs arithmetic operations on numbers, there is always an upper limit to how large or small the numbers can be. As a simple example, an 8-bit computer can add numbers between 0-255, so 33+66=99 is no problem at all, but 333+77=410 would not be possible. Even something like 100+200=300 would typically be a problem, because the result also needs to be smaller than 256.
A 16-bit computer however, can do all these examples with no problem at all. It can add numbers between 0-65535 using a single instruction. Now, as ancient as 8-bit computers are, numbers larger than 255 are even older, and people wanted to use them. The solution is simple, just use two additions with a carry, similar to how you might add large numbers manually on paper. However, this takes two instructions, and effectively twice as long as the 16-bit computer. This is a trivial example, but computers are trivial beings, and while they can do other more advanced operations as well, conceptually it’s all the same thing.
This is an example of bit-level parallelism. Every time we double the “bits” of the processor, we theoretically double its computational capacity. But there are also diminishing returns, so while we got to 32-bit computers fairly quickly, the step to 64-bit computers took almost two decades. It’s possible to imagine a need for 128-bit computers, but it’s not in the short-term future for gaming rigs. The RISC-V processor architecture is defined for 128-bit numbers, and may well be in future gaming rigs or consoles, but it won’t be in the near future.
In addition to many cores and many bits, modern processors also have the ability to execute multiple instructions in parallel, even in just a single core. As an example, a processor may read the next four instructions from memory, and start executing them all in parallel. This is possible because each core has multiple execution units. Typically, not four of each type, but different type instructions can often be sent to different execution units in the CPU, and thereby also do their work in parallel. A processor with this capability is called super-scalar, and is required if we want more than one instruction per clock. Of course, all modern processors are super-scalar, so it’s not mentioned much anymore.
Processors differ in the way they do this instruction-level parallelism, how many instructions they can fetch at once, how many different execution units they have, how many instructions can be retired (finish execution) per clock cycle, etc. In an ideal world, you would obviously want to have more of everything, but limitations of chip design determine what is possible at any given time. The processor also splits each instruction up into parts in a pipeline, and then attempts to reorder these partial instructions in order to speed up program execution as much as possible. In order to support this process, there is also register renaming to remove data dependencies, speculative execution to ensure a filled pipeline, and branch prediction to avoid stalling when branching. The details are far too complex for this article, but together, these technologies are what allows for performing more than one instruction per clock cycle.
Intel has an article that explains the instruction pipeline in more detail.
When doing calculations on a computer, it’s common to come across cases when you want to do the same operation to many numbers. For example, you might have 8 numbers, and you want to add the number 5 to each of them. This can be done with 8 addition instructions, one for each number. However, on modern processors there is a faster way. With vector instructions, all the 8 numbers can be put in a single processor register, and all 8 additions can be performed at once, in parallel in a single clock cycle. The modern variant of this for X86-64 systems is called AVX.
This is how SIMD works, single instruction, multiple data. A program needs to be specifically written to take advantage of these instructions. Typically, they are associated with floating point calculations, but vector instructions can also operate on integers and other data types. The width of these registers is 256 bits for modern gaming rigs, which allows for anything between 4-32 calculations to be done in parallel, depending on the type of data and operation.
For more on parallelism, Wikipedia has some good information here:
Wikichip also has some data on actual implementations: