HelixML this week made available a 1.0 release of scheduler for graphical processor units (GPUs) that makes it simpler to switch between open source large language models (LLMs) in a way that also maximizes utilization of infrastructure resources.
Based on an application programing interface (API) defined by OpenAI, Helix also uses APIs to streamline retrieval-augmented generation (RAG) workflows to expose additional data to an LLM or fine tune a foundational model.
Helix itself can either be deployed on-premises, or it can be invoked via a software-as-a-service (SaaS) application. In either scenario, Helix makes it feasible for organizations to invoke LLMs optimized for different use cases in a way that complies with regulatory mandates by ensuring organizations retain control over their data, says HelixML CEO Luke Marsden. “The LLMs run in a private data center,” he says.
There is, of course, no shortage of options for invoking LLMs in the cloud, but the cost of the token used to invoke those services quickly adds up to the point where it makes more economic sense to deploy an LLM in an on-premise IT environment, noted Marsden.
It also makes it easier for organizations to employ multiple versions of an AI model that are accessed via a scheduler, he adds.
Responsibility for the management of the infrastructure needed to train and deploy AI models is increasingly being assumed by DevOps and IT operations teams. The data scientists that typically train AI models usually lack the expertise needed to manage them at scale in production environments. The challenge those teams currently face is they lack some of the tooling required, such as a GPU scheduler, notes Marsden.
The tooling is especially critical in an era where GPUs are an expensive scarce commodity, he adds. Many organizations are limiting the number of AI models they will train and deploy, simply because access to GPUs resources is limited. Helix enables IT teams to rightsize LLMs to any type of GPU resource that an IT team can provision in a way that also maximizes utilization across multiple AI models, says Marsden.
Today, of course, most of those AI models are deployed on GPUs, but there will come a day when more of them will also be deployed on other classes of processors. Each IT team will need to decide for themselves which tradeoffs in terms of accuracy can be made in the name of lowering infrastructure costs.
In the meantime, the number of generative AI models of varying sizes that will need to be trained, customized, fine-tuned and deployed is only going to increase. Unfortunately, determining which type of LLM lends itself best to any given specific use case is still a matter of trial-and-error. The pace at which LLMs are advancing is also likely to create scenarios where one LLM is likely to be superseded by another LMM multiple times over the course of the lifecycle of an application.
In effect, the management of LLMs and the underlying IT infrastructure required to run them is about to profoundly change the way applications are managed and updated.