The basic principle behind SIMD is very simple, we can add and subtract numbers just like normal, but instead of just one at a time, with SIMD we do several at a time. Multiple numbers grouped together are often called vectors in linear algebra, and if we have them in two dimensions then it’s called a matrix. We can leave matrix operations for another time, but let’s explore vector instructions a bit.
This is a vector with four numbers. Let’s try adding another one:
Taking this back to computers, when adding single numbers, it might conceptually look something like this:
mov eax, 4
add eax, 6
The first instruction puts the number 4 in the register named EAX, and the second instruction adds 6 to whatever is in EAX (we know it’s 4). Let’s look at the related SIMD instruction:
vaddpd ymm1, ymm2, ymm3
This will perform ymm1 = ymm2 + ymm3. The key difference we’re interested in for now is that the ymm registers are 256 bits wide instead of 64 bits, and can hold four separate numbers instead of just a single one in the previous example. For our purposes here we can imagine that each instruction takes a single clock cycle, so it’s easy to understand that it’s if we’re interested in doing four additions, it’s faster to do all in a single operation, than to do four addition instructions in sequence.
Single Instruction Multiple Data
This only works if we’re doing the same operation with all numbers though. If we want to add the first numbers, subtract the second numbers, multiply the third, and so on, then we can’t benefit from vector instructions. This is why it is named Single Instruction, Multiple Data. Vector instructions can still do multiplication though, if we want to multiply vectors. It can also do division, sine and cosine, bit manipulation, logical operations, shuffling data within a register, and many other things that we won’t get into here. It is also possible to choose to work with, let’s say, 64-bit numbers as in the example above, or 32-bit numbers, even down to 8-bit numbers for some operations, and for others all the way up to 256 bits or more. Seasoned assembly language artists will know that the examples above are extremely simplified, mix integers and floating point numbers, and we would of course never even do addition as in the first example. It’s pseudo code, alright!
What’s interesting for benchmarking purposes though, is to understand why AVX workloads are more demanding than scalar workloads. Simply put, the AVX instructions do more work per clock cycle, requires more or larger execution units (depending on which CPU it is) and requires more memory transfers to fetch the data. It exercises more transistors, this requires power, more power generates more heat, and so it’s fairly straightforward how it ends up being more demanding. Modern CPUs actually spend most of their time in idle, and even under full load it’s rare and fairly challenging to really be lighting up all the transistors and put them to effective use.
Current mainstream Core processors from Intel, and Ryzen processors from AMD, both support AVX2, which is 256 bit wide SIMD and supports a wide range of operations. Historically, there was SSE, which was 128 bit wide, and in the early days there was even 64 bit wide SIMD. On Intel’s server and high-end desktop platforms, they also support something called AVX-512. This extends the width to 512 bits, meaning it can do twice as many operations in one go. It also extends the instruction set quite drastically, and adds things like popcount, which counts the number of bits set in a 512 bit word, and many others. This format is not supported by any AMD processors currently, but would presumably be supported sometime in the future.
As it turns out, what really ends up limiting performance in most cases is not execution time itself, but memory transfer speeds. This is certainly true for the mainstream platforms with dual channel memory, but also for four channel memory systems or more. For this reason, even though in the case above we might have expected a 4x speedup, we’re not quite going to see that level of speedup in most cases. 2x-3x would be more normal, though it’s certainly possible to achieve more in special cases.