What’s in a TFLOP? – Understanding NVIDIA’s Ampere Architecture

In this report, we’ll take a quick look at NVIDIA’s Ampere architecture, which is powering the company’s upcoming GeForce RTX 30 series graphics cards. NVIDIA has made significant changes to Ampere compared to Turing.

During their presentation, NVIDIA touted some rather impressive numbers, but there were some glaring issues that some may have overlooked, which we’re going to try to explain today.

Densely Packed

NVIDIA’s new RTX 30 series GPUs are manufactured on Samsung’s 8nm process, which is an enhanced version of the company’s 10nm process. This is a full generation jump from their previous node, which used TSMC’s 12nm process, itself being an enhanced version of TSMC’s 16nm process. That means we can expect the flagship GA102 GPU powering the RTX 3080 and RTX 3090 to be significantly smaller than the massively sized 754mm2 TU102 found in the previous-generation flagship, RTX 2080 Ti, despite the increase of nearly 10 billion transistors.

1.9x Performance Per Watt? Hold on.

So, Ampere-based GPUs will be quite dense, but as we’ve learned from the presentation, they’ll also be very power hungry as they reach TDPs up to 350W. Despite this, NVIDIA is touting some rather bold efficiency gains for Ampere. As we see in the chart above, NVIDA claims a 1.9x performance per watt improvement for Ampere from Turing. However, these claims don’t quite add up as traditionally, power efficiency is measured at fixed power levels rather than fixed performance levels. By doing it this way, it allows the much more shader-packed Ampere to clock lower and still hit the same performance targets as Turing. It’s sort of like comparing gas mileage between two cars, but one of them has a hole in the tank.

While I’m sure Ampere is likely more efficient than Turing, I’m not so sure about the claims here.

Redesigned Shader, RT and Tensor Cores

While NVIDIA has not introduced any new functional cores, they have improved on the existing ones. A single Ampere Tensor core can provide double the throughput compared to Turing, and Ampere’s tensor cores support sparsity, for even more performance. However, unlike the larger GA100 GPU, gaming Ampere GPUs such as GA102 will feature just 4 Tensor cores per SM instead of 8. This means that in some Tensor-related workloads, Ampere is only a slight improvement over Turing.

As for RT, NVIDIA is claiming up to 2x faster performance compared to Turing, but has not provided any details at so how this was achieved, aside from the increase in SMs.

Which brings us to the shader cores, this is arguably where NVIDIA has made the most changes to Ampere when compared to Turing, but the company did not go much into detail during their presentation. Thankfully, the company has addressed this during a Q&A on the NVIDIA Subreddit.

One of the key design goals for the Ampere 30-series SM was to achieve twice the throughput for FP32 operations compared to the Turing SM. To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.” – Tony Tamasi, Senior Director Of Desktop Product Management, NVIDIA

Not All TFLOPs Are the Same.

So, you see, this is how NVIDIA is getting to those massive 30+ TFLOP numbers from its new GPUs, but it also means that the number of Shader-TFLOPs an Ampere GPU is capable of does not quite reflect its gaming performance compared to Turing, or even Pascal when using this measurement. In essence, to utilize all of the floating point performance these GPUs are capable of, a game would have to avoid using any integer workloads, at all. That doesn’t mean you won’t see a huge improvement in performance compared to Turing, as we’ve seen an average of 80% more performance displayed by EuroGamer’s early look at the RTX 3080 already. However, it does explain why a graphics card with nearly 3x the amount of theoretical compute potential is only just capable of nearly doubling the gaming performance of its predecessor.

So, what does this mean for you, the customer? Well, honestly, not much. These new RTX 30 series GPUs are still much faster than their previous generation counterparts, and if you are a 10-series user who wasn’t ready to jump on board the ray tracing train for the RTX 20 series, then the RTX 30 series may be for you, as it seems to finally provide a significant increase in performance compared to the iconic GTX 1080 Ti, which the RTX 2080 series mostly failed to do, at least below $1,000. However, you should at least come away from this with knowledge that not all TFLOPs are the same. When comparing an RTX 30 series GPU to an AMD competitor, or even to an RTX 20 or GTX 10 series GPU, you should of course, always look at the actual gaming performance as displayed in independent testing rather than the marketing on the box from either vendor.

We’ll likely have more on NVIDIA’s Ampere architecture soon as the company plans to release more detailed information on it in the coming days. So, stay tuned!

Liked it? Take a second to support Donny Stanley on Patreon!
Become a patron at Patreon!