NVIDIA today published a set of MLPerf benchmarks that suggests the forthcoming Blackwell series of graphic processor units (GPUs) will outperform the previous Hopper class series of GPUs by a factor of four when running large language models (LLMs).
The report specifically noted that on the biggest LLM workload included in the benchmark, Llama 2 70B, that higher level of performance is driven primarily by the second-generation Transformer Engine and FP4 Tensor Cores the NVIDIA is promising to deliver in 2025.
Additionally, NVIDIA is reporting that in the latest set of MLPerf benchmarks there were also gains across its Hopper, Jetson platform and Triton Inference Server portfolio. For example, the NVIDIA H200 delivered up to 27% more generative AI inference performance over previous benchmarks tests. The Triton Server meanwhile, delivered near-equal performance to NVIDIA’s bare-metal servers, while the NVIDIA Jetson AGX Orin system-on-modules achieved more than a 6.2x throughput improvement and 2.5x latency improvement over the previous round GPT-J LLM workload benchmark.
The performance of AI platforms is becoming a more critical issue as more inference AI engines are deployed in production environments, says Dave Salvator, director of accelerated computing products for NVIDIA. “As we’ve seen models grow in size over time and the fact that most generative AI applications are expected to run in real time, the requirement for inference has gone up dramatically over the last several years,” notes Salvator.
In the not too distant future, organizations will also be able to deploy inference engines across multiple server nodes that will be integrated using the NVIDIA NV Switch, he added. IT teams will be able to employ two NV Switches to achieve 14.4 TB per second of total bandwidth, noted Salvator.
Each IT organization will need to rightsize AI inference engines for different classes of LLMs. The Blackwell series, for example, is more likely to appeal to providers of AI services being consumed at scale, while enterprise IT organizations will typically run smaller models on previous generations of GPUs or platforms based on alternative processors specifically optimized to run AI workloads.
For example, application specific integrated circuits (ASICs), also known as xPUs, will be able to run some AI workloads more efficiently than a GPU, notes Daniel Newman, CEO of the Futurum Group. “Not every model runs best on a GPU,” says Newman.
Today those xPUs account for 3% market share for AI workloads in 2023 but will experience 31% compound annual growth rates over the next five years, reaching $3.7 billion in 2028, according to an AI chipset market report published by The Futurum Group.
In contrast, GPUs today account for 74% of chipsets used in AI applications within data centers and is forecasted to grow by 30% CAGR over the next 5 years, reaching $102 billion by 2028.
Traditional CPUs, meanwhile, hold a 20% market share in 2023 and will experience 28% compound annual growth rate to reach $26 billion in 2028.
The one thing that is for certain is IT teams will be spending a lot of time in the months ahead analyzing which AI models run best on specific classes of processors. Benchmarks provided some indication of how certain AI models will perform, but as every IT team knows, mileage will vary depending on the unique attributes of the AI model deployed.