MFU is the only GPU efficiency metric that matters during training

6 min

Open nvidia-smi during a training run, see 99% GPU utilization, and conclude that the hardware is fully engaged. It is a satisfying number, and frequently a misleading one. Some of the most well-known LLM training runs in the industry achieved 100% reported GPU utilization while delivering only 20% of the theoretical performance the silicon could produce. The metric most teams reach for is measuring the wrong thing.

What nvidia-smi actually shows you

The "GPU utilization" number reported by nvidia-smi reflects the percentage of time at least one streaming multiprocessor was active. It does not reflect how many SMs were active, whether they were doing tensor math or memory shuffling, or whether the work being done contributed to your model's forward and backward passes.

A workload that is saturating the GPU with memory copies, moving tensors around without doing meaningful compute, can register 100% utilization while delivering near-zero training throughput. The metric tells you the GPU is not idle, but it does not tell you the GPU is useful.

Model FLOPs Utilization, plainly

MFU is the ratio of the floating-point operations your model is actually completing per second to the floating-point operations the hardware is theoretically capable of at peak. It is a hardware-agnostic measure of how well your training stack is using the silicon for the work your model actually needs done.

A well-tuned LLM training run on H100s typically lands in the 35 to 45% MFU range. The leaders in the space are achieving 50% or higher. CoreWeave has publicly reported sustained MFU above 50% on Hopper clusters, which translates to roughly 20% more useful compute than the public benchmark range.

Why MFU is hard to move

Pushing MFU upward requires attention to every layer of the stack. Data loading matters because if the GPU is waiting on the next batch, it does not matter how fast it can compute. Kernel selection matters because cuBLAS and cuDNN paths vary in efficiency for different matrix shapes, and picking the wrong one leaves 20% on the table. Communication overlap matters because in distributed training, the time spent in all-reduce operations is time the GPU is not doing forward or backward passes unless you have overlapped them properly. Mixed precision matters because BF16 and FP8 paths can multiply throughput, but only if every operator in the model supports them.

The metrics that lead to MFU improvements

A few secondary metrics correlate strongly with MFU and are more actionable in the moment. SM Efficiency measures what percentage of SMs are active during the busy intervals. Tensor Core Utilization measures what percentage of the work is hitting the dedicated matrix units. Memory bandwidth utilization shows whether you are actually using the HBM3 you are paying for. DataLoader wait time shows how often the GPU is idle waiting for the next batch.

A training run with 99% GPU utilization, 30% SM efficiency, and 12% tensor core utilization is doing a lot of memory shuffling and very little real model math. The fix is in the data pipeline, the precision configuration, or both.

What good infrastructure makes possible

Some MFU constraints are software, but many are hardware, including bandwidth between GPUs, latency across the fabric, NUMA topology, and contention from other tenants. Bare metal removes the multi-tenant variance, InfiniBand or well-tuned RoCEv2 keeps all-reduce times low, and direct PCIe topology lets the framework optimize tensor placement. None of these are sufficient for high MFU on their own, but together they form the floor below which the software optimizations cannot matter.

What to track during a long training run

Sample MFU every few minutes across the entire training run, not just at the start, and watch for drift. A job that begins at 42% MFU and degrades to 28% over a week is usually losing throughput to growing checkpoint sizes, accumulating fragmentation, or thermal throttling. The earlier you catch the drift, the cheaper the fix.

Where Aolani Cloud fits

Aolani Cloud's GPU Cloud and Bare Metal infrastructure is built to maximize the share of MFU you can actually achieve, with dedicated hardware, low-latency fabrics, direct topology visibility, and no hypervisor variance. We cannot optimize your training code for you, but we can make sure your hardware is not the reason your MFU is leaving performance on the table.

Scale AI Infrastructure from Chip to Cluster

Access GPU cloud and bare metal compute designed for teams building the next generation of AI in the region.