Threadripper 3960X Virtualization and Efficiency Testing

A couple of weeks ago, we covered the initial launch of AMD’s HEDT Threadripper 3960X, revealing some massive desktop performance for their premium line of desktop CPUs. However, we only had a few days prior to launch to actually test the platform. Now, we follow up on our coverage with additional testing exploring scalability, virtualization, over- and under-clocking, and power efficiency tuning.

Test System

AMD Threadripper 3960X Test Bench
CPUAMD Threadripper 3960X
CPU CoolerNoctua NH-U14S
GPUGigabyte AORUS Radeon RX 570 4GB
MotherboardGIGABYTE TRX40 AORUS Xtreme
MemoryG.Skill Sniper 3600MHz CL19 4x16GB (64GB Total)
StorageCorsair Force MP600 1TB (PCIe 4.0) NVMe
CaseLian Li PC-O11DW O11 Dynamic
PSUFSP Group Aurum PT 1200W

We’re using the same setup as our initial review, except we’ve swapped out the AORUS GTX 1080 Ti with a Gigabyte AORUS Radeon RX 570 4GB and since we’re testing overclocking, we replaced the NZXT Kraken X62 water cooler rather quickly into our testing with a Noctua NH-U14S that Noctua sent us to be part of a future sTRX4 air-cooling review we’re working on. We’ll talk more about why when we hit the overclocking section.

We’re using a Ubiquiti mFi mPower power strip to monitor wall-outlet power and take manual readings over a 30-second period during each test and average those together for our result.

VMware ESXi 6.7u3

VMware is a very common virtualization platform, utilized by many businesses in the industry. It’s the first platform we’re going to do testing with and see how well it behaves on the new TRX40 platform. The first thing we ran into was finding out the AORUS TRX40 disables “SVT” by default, so we needed to enable it under Advanced CPU Options in the BIOS. Curiously, it’s the only advanced CPU option that was disabled by default.

VMware’s Pink Screen of Death

Our single-guest VM testing went great, but as soon as we cloned the VM and started stress-testing with two VMs concurrently, we started encountering periodic problems of software crashes in the second VM (never the first, oddly), which culminated finally in a Pink Screen of Death. This was easily remedied by dropping our RAM clock from 3600MHz to 3466MHz. Doing so resulting in complete resolution of this weird behavior. Since we’re using mainstream, semi-budget, non-ECC RAM, it was no real surprise to us that the RAM gave us a little difficulty. Had we been using more-premium RAM that could more-easily reach higher clockspeeds, or just ECC RAM in general, we’d likely not have had this issue. Even more strange, we ended up reinstalling ESXi at the end of our testing, to validate some numbers that stood out in final checking, and even with our RAM clocked back at 3600MHz, we didn’t encounter this issue again during the handful of retests.

First, we’ll look at the encoding speed for VMs at different vCPU counts. The “bare metal” tier gives Handbrake access to all 48 threads of our Threadripper 3960X with Windows 10 installed natively on the system. Next, we make a Windows 10 VM under VMware ESXi and allocate 24 vCPUs to it and run Handbrake again, to give us a comparison point for the next tier. You get to see just how little overhead VMware has vs running bare metal. Of course, Handbrake has an inflection point where more cores just doesn’t help out our workload all that much, hence the similar finish times (barely a 5% difference) even though the VM has half the available threads. This particular result is why we reinstalled ESXi on our system; we didn’t believe the number ourselves when put into context with the rest of the tests we ran (including Hyper-V). Boosting efficiencies is the only thing we could deduce as the reason for the near-negligable performance loss.

Next, the “2 VMs” tier shows how fast (on average) a VM processes our transcodes, while the second VM also performs the same transcode. With fully utilizing the CPU, we lose some boosting opportunities that ESXi had efficiently consolidated for, as well as having a heavier saturation of system bandwidth, all of which amounts to a near 35% performance loss. Next we look at reducing vCPUs by half again to 12 per VM. This increases our transcode time by a little more than 50%, which is not bad considering we halved the vCPU count. The last tier, we run four of those 12 vCPU VMs concurrently. This, again, adds 50% to our transcode time verses the same VM running solo.

The main benefit of this chart, however, is to provide base single-VM performance metrics for each vCPU tier we’re using for our concurrent tests. Of course, these numbers are more useful when you look at it in a different way:

This chart shows the transcode time each VM averaged during our tests, while running with the specified number of other VMs. It also normalizes those runs as in a “4 Encodes” metric, which is the time it would take for the system to complete four of our H.265 transcode tests. Bare metal, we’d have a quick transcode time of 517 seconds, but we’d have to run four of those end-to-end to achieve four transcodes, resulting in 2068 seconds. If we run that same transcode, but concurrently in four VMs, we complete the task in 1282 seconds, taking only 62% of the time. Two VMs concurrently completes in 71% of the time.

This barrage of testing demonstrates quite well that Handbrake has diminishing returns with more and more cores, but also shows that Handbrake responds quite positively to clockrate as well, making it a good benchmark tool to find systems that have great lightly-threaded performance, but also good multi-thread capability. The improved opportunistic all-core boosting in Threadripper, and Zen2-based CPUs in general, is demonstrated well in our 24 vCPU 2-VM-level testing.

Microsoft Hyper-V on Windows Server 2019 (v1809)

Next up, we installed Microsoft Windows Server 2019 (version 1809) on our system. We were able to install many of the drivers, however we did end up with a number of unknown devices in Device Manager. All of the PCI Devices that didn’t have drivers installed were PCI\VEN_1022&DEV_148A and PCI\VEN_1022&DEV_1485, and the remaining three unknown devices were:

  • BTH\MS_BTHPAN\9&&0&2 – A Bluetooth device, even though Bluetooth appears to be installed correctly.
  • PCI\VEN_8086&DEV_2723 – This is Intel’s Wi-Fi 6 AX200, which means we must not have used correct drivers (we tried a couple), shown by no network adapter for WiFi in our network adapters list.
  • ACPI\VEN_AMDI&DEV_0030 – This is AMD’s GPIO Controller. The driver for this should have been installed during the chipset install, but either wasn’t or wasn’t compatible with Server 2019 (which isn’t surprising since there isn’t a chipset installer for Server 2019 from AMD’s website for Threadripper).

Otherwise, we had no issues under Server 2019. We were even able to run without issue at 3600MHz RAM speeds. With the piece-meal support under Windows Server 2019, I doubt we were getting as good of an experience as we did under Windows 10 though.

We stripped out the H.264 tests, as they don’t offer much in addition to just the H.265 encoding. The first notable difference is the single 24 vCPU VM test. It’s fairly close to the same performance as running all cores in the two simultaneous VM run. This being so close, but VMware being very similar to bare metal 48-thread performance is why we reinstalled ESXi just to test this metric again. Sure enough, VMware consistently got the better score, so this is not a fluke. This might be the VMware scheduler or perhaps the lack of a driver to do power and performance monitoring in Windows Server 2019, we’re not entirely sure.

The concurrent runs and encoding times have a nearly-identical curve as compared to VMware, so everything we said there applies here. But how does VMware stack up against Hyper-V directly?

Here you can easily see the difference for the single 24 vCPU in VMware vs Hyper-V. You can also see VMware is slightly ahead of Hyper-V across the board, not-withstanding the 3466MHz we had to reduce the RAM speed down to for VMware.

If you look at it from the four-transcodes point of view, the VMware setup has a slight ~3.4% advantage over Hyper-V. That performance bump, however, also comes at with small power-usage penalty as well.

And speaking of power usage…

Power Consumption and Over-/Under-clocking

Here’s our Total System Power chart from our initial Threadripper 3960X review, but updated to include power draw during our four concurrent VM Handbrake test for each VM platform. ESXi draws an extra 11 watts (~2.7%) to gain that extra ~3.4% performance, so it’s not a bad trade-off. We’re looking at a 16% power increase verses bare-metal Handbrake, but we’re doing 61% more work. Granted, this is just a single use-case of real-world software that isn’t the greatest at fully-utilizing all the processing power available to it, but video transcoding is a very common workload at this performance tier.

Are we able to tune our Threadripper to get better efficiency out of the platform, or perhaps just more power in general? How much would that cost us in watts? We ran a barrage of overclocks, underclocks, undervolts, and even core-disabling configurations through Handbrake and Cinebench R20 for quantification and stability testing. Let’s dig into the results, starting with Handbrake.

Handbrake has a wildly variable workload, causing the wattage to vary significantly moment to moment, so we took observations over a 30-second period and then averaged all the data points to get a the power-usage in watts. Since this was measured at the wall, it’s Total System Power. We record how many seconds our 4K to H.265 preset transcode takes and use the average of three runs for our chart.

We can see right off, stock surprisingly gives us the fastest render time. The Auto-Overclock feature of Ryzen Master comes in second when we tweaked the Boost to +200 instead of the default +100, and only consumes about 8 more watts. The Auto-OC with the default +100 Boost came in 4th, using significantly higher power, about 30 watts more than stock. Precision Boost Overdrive came in 5th, taking 14 seconds longer to transcode, but kept the power down to just over stock.

We also performed a series of manual overclocks to find the highest stable overclock at a reasonable voltage target and settled on 4.1GHz all-core at 1.2V. We’re not trying for the highest overclock, but rather a slight undervolt from stock, but a slight overclock as well. We also went for a more extreme power efficiency setting by targeting a 3.0GHz clockspeed and seeing how low we could drop the voltage at that clock. We hit a stable 0.8V at 3.0GHz, and experienced application crashes at 0.75V, so our minimum is in a fairly large range between those two points, but we’re wanting to error on the side of stability with plenty of margin, so we backed off to 0.8V. We also tested disabling cores and CCXs, and while we got some power savings, the performance cost was rather steep.

Our 4.1GHz @1.2V overclock landed third in our chart and slashed our stock power by nearly 63 watts, which is phenominal considering we only lost four seconds on our transcode time. We also managed to shave another 17 watts by manually setting a 4.0GHz all-core at a low 1.15V, but at a 15-second transcode loss compared to stock.

Now, it’s hard to see which setting is a great Handbrake configuration by those numbers alone, so I calculated the watt-hours it costs per transcode so we have a time-weighted efficiency metric to judge with. This instantly highlights our low-voltage 3.0GHz underclock as the best efficiency at only 37.2 watt-hours per transcode, but if time is important, we’d only complete 5.5 transcodes an hour at that clockspeed. This leads us to our true underdog, the 4.0GHz all-core at 1.15V, which costs only 39.8 watt-hours per transcode, but 7.3 transcodes per hour. We could settle on the 4.1GHz @ 1.2V for 41.2 watt-hours and 7.45 transcoders per hour, but which setting is the better all-around? Let’s pull in our Cinebench tests, since a heavy all-core constant-compute workload may have a different effect on our lineup.

Our Cinebench R20 results confirm the 4.0GHz @1.15V manual overclock is more efficient, getting 38.1 points per watt, drawing 28 watts less overall than the 4.1GHz overclock, while only loosing about 300 points (~2.1%). Our 3.0GHz @0.8V is still the best efficiency, but is too slow in my opinion to really be a viable consideration for most needs. The rest of the Auto-Overclock, PBO, and stock settings pull at least 50 watts more than our 4.0GHz overclock, while scoring worse. It’s certainly worth considering a slight overclock and aggressive undervolt if you want to easily save some power (and apparently thermals) on the Threadripper platform.

Our 3.0GHz underclock at 0.8V remained a rather frosty 41°C under our Noctua NH-U14S in Cinebench R20 after repeated back-to-back runs. When we first started pushing the overclocks using PBO and Auto-OC, we were using the NZXT Kraken X62, however, we were seeing 83°C CPU temps with PBO and 93°C with Auto-OC, and knew we needed something potentially better.

Coverage is even worse than it looks, as only the I/O die is directly under water.

Looking at our heatspreader coverage, and considering the small footprint of the actual liquid in the NZXT, we broke out the NH-U14S that Noctua sent us, to see if we could get better cooling and help with our overclocking or at least stability.

The NH-U14S has a massive cold plate, the same size as our Threadripper’s heatspreader, and is so tall, it stands out of our Lian Li test bench case, so we have to leave the side panel off. Needless to say, we surpassed what the NZXT did by being able to Auto-OC with a +200 Boost and only hit 91°C at the fastest the fan cared to spin (1464rpm). Until we have a water loop that’s got a cold plate designed with Threadripper in mind, I don’t think we’ll manage much better cooling than this. Meanwhile, we’ll be doing a review of three of Noctua’s sTRX4-compatible heatsinks soon, so stay tuned!

Conclusion

The Threadripper platform makes a great professional workstation, easily surpassing anything the competition can throw at it. It also has some great tuning performance which can make even Intel’s i9-10980XE look like a power hog, especially when actual performance is taken into account.

The Threadripper did remarkably well in our virtualization testing, being supported out of the box on VMware, and doing well even in Windows Server 2019. This this level of platform, I really wish Windows Server 2019 was fully supported though, as I can see several use-cases where that would be a preference. However, it worked well-enough to get a passing grade none-the-less. We’ve certainly shown the Threadripper to be a great transcoding platform that one would want to seriously consider splitting into at least two transcoding queues to get the most out of it.

Our power testing demonstrates that undervolting Threadripper can lead to significant power savings, and even some slight performance improvements at the same time. If you have a Threadripper, I would certainly recommend exploring the limits of your particular CPU to see what improvements it’s capable of obtaining and tuning it to meet your particular needs.

The whole Zen2-based lineup from AMD has been a very exciting set of products, causing major disruption up and down the spectrum, and even redefining storage tiers altogether. We know AMD isn’t finished yet, with the Threadripper 3990X still on the horizon, and the blatantly obvious, unannounced 48-core 3980X. Then we’ll get to see APUs with Zen2 cores next, before we go into a lull until Zen3.

More Stories
OverVolted #9 – AMD’s Leaky Ship
Do NOT follow this link or you will be banned from the site!