The basic principle behind SIMD is very simple, we can add and subtract numbers just like normal, but instead of just one at a time, with SIMD we do several at a time. Multiple numbers grouped together are often called vectors in linear algebra, and if we have them in two dimensions then it’s called a matrix. We can leave matrix operations for another time, but let’s explore vector instructions a bit.
This is a vector with four numbers. Let’s try adding another one:
Taking this back to computers, when adding single numbers, it might conceptually look something like this:
mov eax, 4 add eax, 6
instruction puts the number 4 in the register named EAX, and the second
instruction adds 6 to whatever is in EAX (we know it’s 4). Let’s look at the
related SIMD instruction:
vaddpd ymm1, ymm2, ymm3
perform ymm1 = ymm2 + ymm3. The key difference we’re interested in for now is
that the ymm registers are 256 bits wide instead of 64 bits, and can hold four
separate numbers instead of just a single one in the previous example. For our
purposes here we can imagine that each instruction takes a single clock cycle,
so it’s easy to understand that it’s if we’re interested in doing four
additions, it’s faster to do all in a single operation, than to do four
addition instructions in sequence.
Single Instruction Multiple Data
This only works
if we’re doing the same operation with all numbers though. If we want to add
the first numbers, subtract the second numbers, multiply the third, and so on,
then we can’t benefit from vector instructions. This is why it is named Single
Instruction, Multiple Data. Vector instructions can still do multiplication
though, if we want to multiply vectors. It can also do division, sine and
cosine, bit manipulation, logical operations, shuffling data within a register,
and many other things that we won’t get into here. It is also possible to
choose to work with, let’s say, 64-bit numbers as in the example above, or
32-bit numbers, even down to 8-bit numbers for some operations, and for others
all the way up to 256 bits or more. Seasoned assembly language artists will
know that the examples above are extremely simplified, mix integers and
floating point numbers, and we would of course never even do addition as in the
first example. It’s pseudo code, alright!
interesting for benchmarking purposes though, is to understand why AVX
workloads are more demanding than scalar workloads. Simply put, the AVX
instructions do more work per clock cycle, requires more or larger execution
units (depending on which CPU it is) and requires more memory transfers to
fetch the data. It exercises more transistors, this requires power, more power
generates more heat, and so it’s fairly straightforward how it ends up being
more demanding. Modern CPUs actually spend most of their time in idle, and even
under full load it’s rare and fairly challenging to really be lighting up all
the transistors and put them to effective use.
Current mainstream Core processors from Intel, and Ryzen processors from AMD, both support AVX2, which is 256 bit wide SIMD and supports a wide range of operations. Historically, there was SSE, which was 128 bit wide, and in the early days there was even 64 bit wide SIMD. On Intel’s server and high-end desktop platforms, they also support something called AVX-512. This extends the width to 512 bits, meaning it can do twice as many operations in one go. It also extends the instruction set quite drastically, and adds things like popcount, which counts the number of bits set in a 512 bit word, and many others. This format is not supported by any AMD processors currently, but would presumably be supported sometime in the future.
As it turns out,
what really ends up limiting performance in most cases is not execution time
itself, but memory transfer speeds. This is certainly true for the mainstream
platforms with dual channel memory, but also for four channel memory systems or
more. For this reason, even though in the case above we might have expected a
4x speedup, we’re not quite going to see that level of speedup in most cases.
2x-3x would be more normal, though it’s certainly possible to achieve more in