Typical computer use has been steadily moving away from single-task driven to concurrent computing. With a storage server with this level of compute, typical use-cases would be tier 1 backup storage, media transcoding and archiving, or video repository for camera systems or media libraries. Each of these tend to take multiple simultaneous datastreams and encode or process them highly concurrently. Therefore, having tests to show the concurrency effectiveness of a platform can be as important as the raw performance of the processors themselves.
Benchmarks, in order to provide consistent repeatable results, are ran in very sanitized environments that eliminate as many concurrent processes as possible and, consequentially, are far from a typical setup. These isolated benchmarks end up only testing the core-scaling ability or clockspeed dependency of the specific software or benchmark being ran. This is especially evident in single-thread benchmarks which hit boost clocks that would rarely, if ever, be seen on a typical computer, let alone a server, that has background tasks loading additional cores. In an attempt to test the CPU’s ability to handle multiple workloads simultaneously, I selected Handbrake to explore some scenarios with. We’ve seen from our Handbrake benchmark already that the Xeons’ 40 threads at 2.2GHz (with a 3.0GHz “max frequency”) comes in around 40% behind the Ryzen’s 24 threads at 4.025GHz (the all-core speed my 3900X was running at). This could have been for any number of reasons, including latency of having two sockets vs two AMD CCDs, the core connectivity on-die vs AMD’s Infinity Fabric, the real-world benefit of Hyperthreading vs AMD’s SMT implementation, and whether Handbrake prefers clockspeed over core-count. For the 4K H.265 transcode, it translates out to 29.3 seconds per thread of work on the 3900X vs only 26.4 seconds worth of work per thread on the Xeons. That’s a 10.1% deficit of work for the Xeons when normalized by thread count, in spite of being only 55% the clockspeed, suggesting there’s more room to explore here.
With two instances of Handbrake running, one set to transcode from the 4K source to HQ 1080p30 Surround (H.264) and the other set for 4K to H.265 MKV 1080p30, I started both as simultaneously as fast-clicking on the H.265 instance and then the H.264 instance could be. Note, the logs showed one second or less delay between starts, but the order may be important for thread scheduling reasons, and the offset would have happened even if rendering were triggered via command-line, so at least this way it was controlled on which one initialized first. Of course, we run this multiple times, with three runs proving good enough to demonstrate consistent results and produce a good average.
What these two charts reveal is that running these two transcode tasks concurrently, for which the contention for resources obviously makes each task take longer individually, altogether take less time in the end. The Xeons take a 1857 second run and whittle it down to 1410 seconds (a 24% reduction) while the Ryzen reduces a total runtime of 1190 seconds down to 1058 (11%). The Xeons are likely benefiting from increased memory bandwidth, which becomes increasingly important when more cores are utilized. This constraint isn’t apparent in “real world” solitary benchmarks, and is usually demonstrated with a synthetic memory bandwidth benchmark. Now we have some real numbers in real workloads.
Liked it? Take a second to support Kirk Johnson on Patreon!