Broadcom at the VMware Explore 2024 conference today announced that the instance of VMware Cloud Foundation that is optimized for artificial intelligence applications now supports the Intel Gaudi 2 AI Accelerators optimized to run inference engines.
In addition, Broadcom has extended the ecosystem for VMware Private AI, originally launched last year, to include alliances with Codeium and Tabnine, two providers of AI coding assistants, as well as IT services providers WWT and HCLTech.
Chris Wolf, global head of AI and Advanced Services for Broadcom, said it’s still early, but Broadcom is seeing some interest in running AI inference engines on platforms other than ones that incorporate comparatively expensive graphical processor units (GPUs) that are challenging to procure.
The tradeoff is that the outputs delivered by a large language model may not be as accurate as they are when employing GPUs to run inference engines, he noted.
There is, however, plenty of opportunity to reduce costs by relying more on other classes of processors to enable retrieval augmented generation (RAG) workflows that expose an organization’s data to a LMM that was most likely trained using GPUs, added Wolf.
Given the cost and ongoing shortage of GPUs, maximizing utilization rates for this class of processors has become crucial. One of the major reasons VMware Private AI is gaining traction is that it provides a distributed resource scheduler that makes it simpler for IT teams to maximize GPU utilization rates, said Wolf.
Once the Private AI becomes available on the latest version of the VMware Cloud Foundation (VCF), Broadcom will also be able to achieve full language model parity with Amazon Web Service (AWS) and Google on a platform that gives IT teams more control over where to run inference engines, he added.
In general, responsibility for managing inference engines in production environments is shifting toward IT operations teams that have the expertise required to manage memory allocation, networking and storage resources at scale, said Wolf. Broadcom is also working toward further simplifying the management of those resources using an intelligent assistant, due out in 2025, that makes use of generative AI to optimize allocation of resources, said Wolf.
In the meantime, organizations of all sizes are looking to move beyond proof-of-concepts. The challenge is that managing the underlying IT infrastructure needed to operationalize AI requires an ability to dynamically rightsize resources across a set of use cases that each have unique characteristics, noted Wolf. Determining which class of GPUs can optimally be used for each use case requires significant testing, he added.
For now, the cost of deploying AI models at scale is likely to limit the number of use cases any organization is likely to deploy. There may come a day when competition drives down the cost of the GPUs that are widely relied on today to train and deploy AI models, but for the short term the focus is going to be on squeezing every ounce of capacity out of every GPU an IT team can lay its hands on.