Note1: the following is a modified and shortened version of the transcript Jim used for his video on CPU caches. This transcript has been revised for an article format and is not a 1:1 copy. The original video can be seen at the bottom.
Note2: Analysis of the Zen2 early 3900X engineering sample userbenchmark results with the cache latency anomaly (at the end of the video) was not transcribed, as it is slightly outside the scope of this article.
Alright guys how’s it going?
In the mid 1990’s, I took a Bachelor of Science in Computing.
Now, the mid-90’s weren’t yesterday but I still remember much of what that course comprised. It covered both the hardware and the software side of “computing” and I learned to program “wonderful” languages such as COBOL and Pascal. I also learned about computer architecture and networking – before the internet really took off.
I think it’s fair to say that most of you know me well enough by now and you know I’m a hardware guy. I’m always talking about the ins and outs of the hardware itself, particularly the silicon.
So you might think that after finishing at University in the late 90’s, I went on to work in electrical engineering or in semiconductor manufacturing, right? Wrong. Those kind of opportunities didn’t really exist in the UK back then or were exceedingly rare. What did exist in very high demand were jobs in programming. And as it happens, back then – programming was my first love anyway.
So I actually became a programmer. I learned a variety of obscure and archaic programming languages and I worked for an American freight-forwarding company, who employed me as a UNIX and network programmer. At one point I even taught programming. The cool thing about back then in the 90’s was that everything was a bit simpler and yet it was still far easier to stand out because unlike today, not everybody was a programmer.
Back in the mid 90’s the hardware side of things was also a lot simpler. The 3D graphics cards we know today were in the research and development phase. Multi-core CPUs still didn’t exist either, but something else had started to change on CPUs around that time, and that change was the inclusion of cache memory.
To understand why cache memory ended up on the CPU means going back even further, to the early 80’s and the “home computer” boom. The first home computer processors ran at – by today’s standards – extremely low clocks speeds of single-digit MHz. Yes, megahertz.
My first computer, the Sinclair Spectrum 48K+, had a Z80 microprocessor – a “single core” CPU running at a dizzying 3.5MHz. Note though that back then nobody actually called these single core chips, they were simply processors, microprocessors or CPUs – which of course stands for Central Processing Units.
The CPU was known as the heart of the computer so the memory was the “brain” of the computer, and my Spectrum had a total of 48KB – Kilobytes – of RAM. Below you can see the ZX-Spectrum motherboard with the Z80 processor (framed in orange) and the memory chips (framed in yellow). We can see 8 chips with 2K RAM each for a total of 16KB RAM and these newer chips to the right were 4K RAM chips each, again 8, for a total 32KB. So in total, the computer had 48KB of RAM.
I first learned programming on my Spectrum when I was 8 or 9 years old. And forgive me this little tangent – but my father, who at the time was in his mid 60’s, also learned how to program on the Spectrum. He had been retired for years from work due to ill health. Like many other UK men of that generation he had been a coal miner all of his life, and in the end it killed him through lung disease. But I will never forget coming home from school, hoping to play some game on my computer yet not being able to get my dad to let me play because he was too busy coding. I wasn’t even 10 years old but the utterly captivating nature of computers was already crystal clear to me.
But as I was saying, I first learned programming when I was 8 or 9 years old. Back then I’d have been programming stuff like fighting games, where guys called “Bob” and “Tom” fought each other to the death using varied weaponry.
For this section of the video you’re gonna have to afford me a little bit of artistic license.
So let’s say you have a character in a fighting game called “Bob” and he has 10 hit points. Bob had an enemy called “Tom” who has a crossbow which causes 5 damage. I would code that in Sinclair Basic with a few of lines of code, something like this. Sinclair Basic required line numbering, which most languages ditched a long time ago since it made no sense.
10 bobhitpoints = 10
20 bobweapon = "knife"
30 bobweapondamage = 2
40 tomhitpoints = 6
50 tomweapon = "crossbow"
60 tomweapondamage = 5
70 print "Tom attacks Bob with a " + tomweapon + " for " + tomweapondamage + " damage!"
80 print "Bob has " + (bobhitpoints - tomweapondamage) + " hit points left!"
So looking through this amazing program we see some data. We have variables like [bobhitpoints], [tomweapon], [tomweapondamage] and their associated values of 10, [“crossbow”] and 5.
In the first 6 lines of code (lines 10 – 60), the Z80 CPU creates that data, which is then stored in the Spectrum’s memory. Just imagine it as you see it, for example on line 40 the CPU creates a variable called [tomhitpoints] and gives it the value of 6. That data is stored in the memory all the way over (as pointed by the arrow).
Line 70 is where the actual fight starts. We have a print command which we can break down from left to right. First of all, it prints out
"Tom attacks Bob with a "
Then we get to this [+ tomweapon] part.
What happens now in the program is the CPU needs to find out what this variable [tomweapon] actually means. So it takes a long trip over to the Spectrum’s memory and finds that [tomweapon] has a value of “crossbow”.
So the [“Tom attacks Bob with a ” + tomweapon] will produce:
"Tom attacks Bob with a crossbow"
Then the word [” for “] – with spaces at each side – is added (concatenated, is the proper terminology), so we now have this:
"Tom attacks Bob with a crossbow for "
As before with [tomweapon], at this point the CPU now needs to know what this variable [tomweapondamage] stands for, so it takes the long trip over to the Spectrums memory and finds [tomweapondamage] and the value of .
We’re almost finished now with:
"Tom attacks Bob with a crossbow for 5"
and simply concatenating the word [” damage!”] at the end for the final result of:
"Tom attacks Bob with a crossbow for 5 damage!"
Line 80 (the next line of code), has further trips out to memory in order to find what values are stored in the [bobhitpoints] and [tomweapondamage] variables so it can perform a math operation (a subtraction): [“bobhitpoints – tomweapondamage”], which means [10 – 5] which equals 5.
And that last one, [tomweapondamage], is where it gets interesting, because in the previous line of code we already made the trip to the Spectrum’s memory to find out what value was stored in [tomweapondamage] (the value of 5), however in the very next line of code we need to go all the way back over to the memory to find it again, even though we just used it.
This is clearly an extremely simple example of a “game”, but believe it or not, this isn’t that far off how games were programmed back in the day, in Basic at least. And in most cases this was fine. CPUs were dog-slow, and with memory which was also dog-slow, it was a good match. But CPU speeds started to improve quickly while memory speeds improved much more slowly.
Pretty soon it didn’t matter how quickly the CPU was able to perform arithmetic or logic operations, because it was bottlenecked by the speed at which it could get hold of the data it needed to perform those operations.
The time taken for those trips out to the computer’s memory – the latency – was a real problem. What’s the point in having faster CPUs when they are totally bottlenecked by the speed of accessing the memory?
Of course there was a simple solution. If those trips out to memory are taking too long, then we need to shorten the trip. We need to bring some memory closer to the CPU, or even better – on to the CPU.
It was 1989 when Intel first put memory on the CPU with the 80486, which was also the first 1 million+ transistor x86 chip for that reason. Here we can see a die shot with the memory cache, which looks like 4 blocks of 2KB of cache, for a total of 8KB of cache memory.
But it’s still only 8KB. Is that really enough memory for the 90’s when we’d got used to 64MB RAM and even 128KB RAM in the 80’s? Well the thing about this new memory cache on the CPU was that even though it was a lot smaller, it was much, much faster than system RAM. So the trick was to ensure only the most frequently used data was stored on the cache.
If we go back to my gaming example at the last two lines:
70 print "Tom attacks Bob with a " + tomweapon + " for " + tomweapondamage + " damage!"
80 print "Bob has " + (bobhitpoints - tomweapondamage) + " hit points left!"
Remember that on line 70 we had to take a trip out to memory to find the value stored in [tomweapondamage], only to have to do it again on line 80.
With a CPU cache, the first time any data is accessed from the main memory a copy of it is stored in the cache. So all these variables
will first be read from system memory and at the same time, copied to the CPU cache.
Now the next time the CPU needs the information in [tomweapondamage] (which in our case is the very next line of code), it looks in the CPU cache first, finds it and completes the line of code in a fraction of the time it would have taken had it needed to go all the way out to system memory again.
And every subsequent time the computer needed that information, it was usually right nearby in the cache. Computer software is actually quite predictable – even games. If your program is doing something or using certain pieces of data, like [tomweapondamage], the chances are it’ll use that data over and over.
So the benefits of having fast memory on the CPU was clear, and cache sizes began to increase as more and more of the CPU area was given over to it. But increasing the size of the L1 cache also increased the average latency – as it had to go through more and more memory to find the desired data – so it soon became a tradeoff between size and speed.
The answer to this growing problem was to add a second level of cache – L2 cache. The first level, the L1 Cache, would remain relatively small, which is why it’s been around the same 32KB size for years, although it’s now split into two different caches of around that size – one cache for instructions and one cache for data. The instruction cache holds the most frequently used programming instructions while the data cache holds the most frequently used data.
The L2 cache was allowed to grow a bit larger, which again meant it was slower than L1 – however still much, much faster than going out to the main memory. L2 caches hold both instructions and data and are often inclusive of L1 – that is all the blocks held in L1 are also held in L2.
Before this one gets overly complicated too quickly, perhaps the best thing to do right now is explain how multi-level caches work and for that, we’ll continue to use my programming example.
CPU caches are actually divided into blocks where each instruction or piece of data is stored. When the CPU first attempts to use the [tomweapondamage] data a “read request” is sent to the CPUs memory controller, which first of all looks in the L1 cache. If [tomweapondamage] wasn’t found in a block of L1, the memory controller then searches L2 instead and if [tomweapondamage] still wasn’t found in L2 then it is fetched from the main memory and copied in both the L1 and L2 caches.
The next time during the program execution, the CPU looks for that [tomweapondamage] data, it’ll find it very quickly in the L1.
Now imagine over the course of the game, the L1 cache starts to fill up with the most frequently used data. Our main character Bob maybe found some armour which reduces all the damage he takes and perhaps he also learned how to cast spells so he now has [bobmanapoints] as well as [bobhitpoints]. This is all very important data which will be frequently used by the program.
But lets say he now finds a potion which can replace lost health [bobhealthpotion = 4].
As before, when the program first tries to use this data, a read request is sent to the memory controller which first of all looks in the L1, then the L2 caches before finding it in main memory. Now it’s been found in main memory, the [bobhealthpotion = 4] data is copied to both the L1 and L2 caches. However, let’s say that’s enough, there are no more free blocks left in the L1 cache, so something else has to be evicted from it L1 cache.
In this case, let’s assume [bobmanapoints = 10] was not used in a long while, as the player hadn’t cast any spells for a few seconds and so that data wasn’t read recently. So [bobmanapoints = 10] is evicted out of the L1 to make room for [bobhealthpotion = 4], however [bobmanapoints = 10] remains in the L2 rather than being evicted from that too. The reasoning behind this is pretty simple – if it was in L1 previously then chances are it will be needed in L1 again soon, even if it’s not needed right now.
And that reasoning makes perfect sense because the next time Bob casts a spell, which presumably won’t be that far off, the program checks to make sure he has enough [bobmanapoints] first of all:
if bobmanapoints >= 5 then "cast spell"
and instead of having to go all the way over to main memory to find that data, it is of course much closer, in the L2. And whenever the data is read from the L2, a copy is placed in L1, which will mean that something else will have to be evicted from the L1 cache by the exact same mechanism described above.
If you’re wondering – yes data will also finally be evicted from L2, at which point it’ll also be removed from the L1 as well. The reasoning there is, if data has been in the L2 long enough without being used, chances are it’s just wasting space in the L1 cache too, right?
That simple example there is basically what “inclusion” means regarding caches and is a passable explanation of how caches work in general. It’s all about the most frequently used data being as short a trip away as possible.
Many CPUs today of course have not only L1 and L2, but also L3 cache. Just as L2 is larger and slower than L1, the L3 is larger and slower than L2, but still quite a bit faster than system memory. The first “consumer” CPU to include L3 was the Pentium 4 Extreme Edition which had a mammoth 2MB of L3 cache back in 2003. It was actually a re-purposed server CPU as Intel had to go to unprecedented lengths in order to remain somewhat competitive.
It was only a couple of years later when we got the first dual-core CPUs, and L3 cache became even more important for the reason that while each CPU core kept it’s own L1 and L2 while the large L3 was shared between the cores. Soon we had quad cores and L3 caches became larger and larger over time, as chip architects tried to push the trip to system memory further and further away.
Next, I’ll take a look at the most recent big CPU architecture change, which was obviously Zen.
Zen arrived in 2017 with 8 cores divided over two core complexes (CCX’s) of 4 cores each. Here’s one complex on the left and another on the right.
Looking at an individual CCX, we see 4 cores, with a large shared 8MB L3 cache in the middle with two cores either side. Remember this is one core complex and Zen has two of these per chip so in total the chip would have 16MB of L3 cache.
Each core has it’s own L1 Data and Instruction caches – remember these are very small at only 32KB and 64KB respectively, right inside the core. Just to the right of the core in this example is 512KB of L2 cache, again as part of the core.
The L2 cache in Zen is inclusive of L1 and we know that means any data in the L1 is also held in L2 – instructions and data. Whenever data is read from memory a copy of it is placed in both the L1 and L2.
The L3 cache however is a victim cache. I used the word evicted earlier when discussing what happens when data is removed from the L1 and L2 caches.
A victim cache like Zen’s is exclusive. When data is read from memory, a copy is placed in L1 and L2 as normal, but not placed in the L3. In fact the only way that data can enter Zen’s L3 cache is if it is “evicted” from L2. So obviously L1 needs to be filled with data before L2 is filled with data at which point the older data – data that hasn’t been accessed “recently” (and when I say “recently” here we could be talking in terms of under a second or a few seconds) is evicted from the L2 into the L3. The logic follows a similar pattern to before, that is, if data was used previously then chances are it’ll be used again at some point soon and it’s better to be held in the L3 cache instead of just being held in the main system memory.
What’s interesting is Zen’s core complex setup means that even though the CPU has 16MB of L3 cache, in most cases we see it’s really only got 8MB available at typical L3 cache speeds.
I’ll try to explain this as best as I can using a very interesting post at reddit by BFBooger, which actually gave me the idea which morphed into this video.
Near the beginning of the post BFBooger says:
A single Zen chiplet has 2 CCX’s, each with its own L3 cache. As a victim cache, the only way that data gets into CCX0’s L3 cache is if it was first in one of the L1/L2 caches on one of the cores there. Likewise for CCX1.
So that’s what I just explained. Zen’s L3 can only be filled by evicted data from its L2 on one of it’s 4 cores.
Let’s take a single-threaded program which needs a piece of information; hell, let’s use my program as an example again as that is definitely single-threaded. The first time it needs that information it grabs it from main memory and copies it into the L1 and L2. Over time the caches fill up and data spills into the L3. But what happens after the L3 is filled? Does evicted data then go over to the L3 in the second CCX? No it does not, because we know that the only way the L3 cache in the second core complex can be filled is by evicted data from the L2 of one of the cores there.
So the only way to get data into the second core complex’s L3, is by using the cores on the second core complex. Let’s say you do that – you add a second thread to your program which uses the 5th core (labeled as core 4 as the first CCX has cores 0 to 3). To start with, all the cache on the second CCX is empty, so when that 5th core needs to read data first of all, it won’t find anything in it’s own L1, L2 or L3. So then it takes a trip over to the other core complex to have a look and luckily, it then finds the data it needs in the first CCX’s L3. As we know, that data is then copied into the L1 and L2 of that 5th core.
But what happens if it doesn’t find anything in the other core complex? Well then it basically has to go to main memory. This is exactly the same thing that happens in low-threaded programs. If your program only ever has a single thread it’ll fill up the L3 in one core complex then go to main memory after that. The L3 on the other complex will remain unused, along with the whole rest of the CCX in fact.
With that in mind is it possible that what many people are analysing as Zen’s “cross CCX penalty” is in fact, the result of Zen going to main memory? And is there a way to test this? Well, I sure tried.
My first task was to find a cache benchmarking program, and for that I opted to use the program “Userbenchmark”, the reason being that this was where we saw the first leak of Zen 2’s cache performance, in a leaked 12C sample.
My first idea was to test different memory speeds. My system has 16MB of DDR-3400 so I ran 6 tests at 3400 Megatransfers/second then another 6 tests at 2133 Megatransfers per second.
I’ll take a closer look at 3400 mega-transfers result in order to explain what is going on and to reinforce what we’ve already learned.
Scrolling down to near the bottom of the benchmark, which uploads automatically after running, and we get to the System Memory Latency Ladder.L1/L2/L3 CPU cache and main memory access latencies in nano seconds. Nanoseconds on the y axis, and memory sizes on the x axis.
So I believe how the benchmark works is simply by filling the memory system with data up to certain amounts, then randomly accessing some of that data multiple times. So for example in the first benchmark of 8KB, all that data will easily fit into the L1 cache, so all subsequent random accesses will find the data in the L1. Looking at the other end of the scale and 128MB of memory to be accessed, it should be clear that on a CPU which only has 16MB of L3, the vast majority of random accesses will occur in the main system memory.
Getting back to the left of the scale again and we know Zen has 32K of L1 data cache, and if we look over the first 3 values of 8KB, 16KB and 32KB we see some very small latencies of just over 1 nanosecond. One nanosecond is one billionth of a second.
When we get to 64KB though, it’s doubled to just over 2 nanoseconds. With this information we can be pretty sure that this latency program measured 32KB of L1, right? The reason why it doubled at 64KB is that at 64KB, it’s accessing L1 half the time and L2 the other half. So we might deduce that the L1 cache access time of the 2700X is around 1 nanosecond and the L2 cache access time must be around 3 nanoseconds? If accessing half of each gives an average result near 2 nanoseconds.
And this appears to be correct as by looking at 128KB and then 256KB, we see a rise up to 2.75ns. At 256KB, around 1/8 of the reads will be L1 with the other 7/8 being slower L2 reads. We know the limit of L2 is 512KB per core and we see that between 512KB and 1MB, the latencies nearly double again. Once again, this is because we’ve now filled the L2 at the 512KB mark and so at 1MB data size around half the reads are now taking place in the L3 cache.
Onward from 1MB the latencies continue to rise fairly slowly all the way up to 8MB, which is the limit of the L3 victim cache on one core complex, before a large increase at 16MB. What caused that? Well after the 8MB L3 victim cache was filled, the latency measuring program then went to main memory.
I know this because after testing the whole 2700X CPU, which is 4 cores on each CCX, I then disabled the cores on the second CCX to test 4 + 0 configuration (cores per CCX).
When looking at the 4+0 test, which remember means only one CCX active, the results were very similar to 4+4. Flipping between each and we barely notice any difference. The reason 16MB is “halfway” is simply that half of the reads are in the cache where the other half are in the main memory.
At 32MB we’re reading an awful lot from main memory and that continues on, which is why we get a near flat line afterwards. By this point at the end (128 MB), the benchmark is basically a benchmark of system memory – not the cache – because the vast majority of reads are accessing the main memory.
Looking back at the results in the table above, we see that indeed, there wasn’t much of a difference at all between having one or two CCX-es active.
I went on to test multiple configurations including 2+0, 1+0 and 2+2. They all gave similar results except – curiously – 2+2 which consistently performed better than anything else, including 4+4.
For some reason on Zen, having 2 cores on each complex appears to be better than 4 cores on each, or 4 cores on one complex and none on the other – at least in terms of latency. Trying to figure that one out spawned another day of testing – I asked some guys on my Patreon discord to help out in confirming my results and they did. 2+2 is simply better than 4+4, at least in this userbenchmark program.
Let me just say thanks to everybody who helped out with this including with some batch file writing in windows, testing different cache latency programs and just general explanations of what could be going on.
Right I’m done with this one and it’s been heavy on the analysis again, which is just starting to wear on me a little bit, so there will be a change of focus for the next one and with a bit of luck it won’t be “Seven” days before you see it.