MLPerf and the rise of latency-aware LLM benchmarking

Any discussion of modern AI system performance must include MLCommons and its MLPerf benchmark suite, which has become the industry’s de facto standard for measuring machine learning performance. Since its debut in 2018, MLPerf has provided a neutral, peer-reviewed framework for comparing hardware and software platforms across a broad range of AI workloads.

The original MLPerf benchmarks reflected the dominant AI workloads of the late 2010s. Early inference tests focused on models such as image classification with ResNet-50, natural language processing with Bidirectional Encoder Representations from Transformers (BERT), object detection with RetinaNet, and recommendation with Deep Learning Recommendation Model (DLRM).

These workloads were important and representative at the time, but they shared one characteristic: they were highly parallel and relatively easy to map onto GPU architectures.

For several years, benchmark results reinforced a simple narrative. Each new generation of accelerators delivered higher throughput, lower latency, and better energy efficiency. Because the workloads aligned well with GPU strengths, the benchmark curves rose steadily and predictably.

The generative AI shockwave: Rewriting the rules of MLPerf

Autoregressive LLMs introduced a fundamentally different inference pattern. Prompt processing remained highly parallel, but token generation became sequential and memory bound. Suddenly, raw TeraFLOPS no longer told the whole story.

MLPerf began incorporating this new reality in stages. Inference v4.0 introduced the first LLM benchmark based on Meta platform Llama 2 70B. This benchmark measured token throughput and provided the industry with its first standardized method for comparing LLM inference systems.

MLPerf Inference v5.0 released in 2025 significantly expanded the generative AI focus. It added Llama 3.1 405B Instruct, a 405-billion parameter model with a 128,000-token context window. The benchmark also introduced an interactive variant of Llama 2 70B that imposed strict limits on Time to First Token (TTFT) and Time Per Output Token (TPOT), two metrics that directly capture user experience in conversational applications.

These additions were pivotal because they exposed the core weakness of GPU-based inference systems. When unconstrained by latency, GPUs could buffer requests, create large batches, and deliver excellent throughput. Under interactive latency limits, batching opportunities shrank, hardware utilization dropped, and throughput fell sharply.

In other words, MLPerf began measuring not just how fast a system could run under ideal conditions, but also how responsive it remained under realistic conditions.

Inference disaggregation: Optimization of resources

This evolution reached another milestone in MLPerf Inference v5.1 and the emerging v6.x era. The benchmark suite broadened its focus to include increasingly sophisticated workloads, including reasoning models such as DeepSeek-R1 and more demanding long-context applications. At the same time, submissions began showcasing system-level optimizations such as inference disaggregation, where prompt processing and decoding are assigned to different accelerator pools.

Disaggregation has become one of the most consequential developments in modern inference benchmarking.

Historically, MLPerf treated each benchmark run as a single system under test, leaving vendors free to optimize their hardware and software stacks as they saw fit. As long as submissions complied with accuracy and latency requirements, any architectural technique was fair game.

This openness allowed participants to introduce increasingly sophisticated serving strategies. One of the most effective has been the separation of prefill and generation across distinct groups of accelerators. The prefill cluster handles the compute-intensive prompt processing stage, while the generation cluster focuses exclusively on token decoding.

In controlled benchmark scenarios, where prompt lengths and output lengths are known in advance, disaggregation can produce dramatic gains. By eliminating interference between the two phases, systems reduce preemption and improve latency-sensitive throughput.

Yet this raises an important question. Does the benchmark still measure accelerator capability, or is it increasingly measuring system orchestration? The answer is both.

Modern AI performance depends on the interaction between processor, memory hierarchy, interconnect fabric, runtime software, and serving algorithms. MLPerf has evolved accordingly. It now rewards system-level innovation rather than isolated chip performance.

That shift is entirely appropriate, but it also means benchmark results must be interpreted carefully.

A disaggregated configuration optimized for long document summarization may perform brilliantly in MLPerf while delivering more modest benefits in production environments where workloads vary continuously. Real-world deployments must cope with unpredictable prompt lengths, bursty traffic, and rapidly changing ratios of prefill to generation demand.

Consequently, MLPerf increasingly measures a system’s ability to align resources with a known workload profile. This is a valuable metric, but it’s not synonymous with universal real-world performance.

Illustrative comparison: MLPerf 5.x versus MLPerf 6.x

Table below illustrates how benchmark methodology evolved as MLPerf shifted from throughput-oriented LLM tests to more latency-sensitive and system-aware workloads. The numbers are representative rather than exact, but they reflect the broad trends seen in published results and vendor disclosures.

Publicly discussed MLPerf inference results based on Llama 3.1 405B LLM run on a leading-edge GPU-based processor in three scenarios (off-line, server mode, and interactive mode) highlight MLPerf’s evolution. Source: Author

From chip benchmark to system benchmark

The history of MLPerf mirrors the evolution of AI itself.

The early benchmark suites focused on relatively static workloads that aligned naturally with the strengths of GPU architectures. Tasks such as image recognition, recommendation systems, and conventional deep learning inference relied heavily on dense matrix operations and large-scale parallelism, allowing GPUs to demonstrate exceptional throughput and scalability. In that era, benchmark leadership was closely associated with raw compute capability, memory bandwidth, and increasingly larger accelerator configurations.

The rise of generative AI fundamentally changed that equation.

As autoregressive LLMs became the dominant workload, MLPerf evolved accordingly, introducing larger models, longer context windows, interactive server scenarios, and increasingly strict latency constraints. These additions exposed a critical reality: while GPUs remain extraordinarily efficient during the highly parallel prefill phase, they are far less efficient during token generation, where inference becomes sequential, memory-bound, and heavily dependent on latency-sensitive execution.

This shift transformed the meaning of benchmark performance.

Modern MLPerf results no longer measure the capabilities of an isolated accelerator alone. Instead, they measure the effectiveness of an entire inference architecture.

Disaggregation, scheduling policies, key-value (KV) cache management, streaming pipelines, runtime orchestration, and workload balancing have become just as important as the underlying silicon itself. In many cases, the benchmark winner is no longer the system with the most compute power, but the one that most effectively adapts a fundamentally sequential workload to hardware originally designed for massively parallel graphics and HPC computation.

As a result, benchmark interpretation has become significantly more nuanced. The headline numbers increasingly reflect how intelligently the system orchestrates resources across racks of accelerators, separates prefill from generation, minimizes preemption, and maintains throughput under realistic latency constraints. MLPerf has evolved from a pure hardware benchmark into a broader measure of system architecture and software orchestration.

At the same time, this evolution reveals something even more profound. The latest MLPerf 6.x requirements implicitly highlight the growing limitations of conventional GPU architectures for real-time LLM inference. The industry has reached a point where increasingly sophisticated scheduling mechanisms and disaggregated serving infrastructures are being used to compensate for a deeper architectural mismatch between autoregressive inference and massively parallel processors.

In many respects, the benchmark itself is beginning to suggest the next major transition in AI infrastructure design.

Rather than continuing to optimize architectures originally developed for graphics rendering and parallel numerical computing, the future may require entirely new inference-centric architectures built specifically for the unique characteristics of the LLM generation. Such architectures would need to deliver high utilization and low latency even with very small batch sizes—potentially down to a single user request—while minimizing data movement, reducing memory bottlenecks, and supporting continuous token generation without relying on increasingly complex orchestration layers to hide inefficiencies.

In that sense, MLPerf has become more than a benchmark suite. It is now a window into the architectural tensions shaping the future of AI computing, revealing both the extraordinary adaptability of modern accelerator systems and the growing need for a fundamentally new class of inference hardware designed from the ground up for the realities of autoregressive AI.

Lauro Rizzatti is a business development executive with Vsora, a technology company offering semiconductor solutions that redefine design performance. He is a noted chip design verification consultant and industry expert on hardware emulation.

Editor’s Note

This is Part 2 of the mini-series that examines how LLM inference forced changes to MLPerf benchmarking. In Part 1, contributor Lauro Rizzattti analyzes LLM inference across its two processing phases—prefill versus generation—and highlights how this workflow exposes structural inefficiencies in GPU-based accelerators.

Related Content

The post MLPerf and the rise of latency-aware LLM benchmarking appeared first on EDN.

MLPerf and the rise of latency-aware LLM benchmarking

Become a member

Become a subscriber

Become a sponsor