
Red Hat today added a Red Hat AI Inference Server to its portfolio that makes it simpler to deploy artificial intelligence workloads on its Linux and Kubernetes platforms.
Announced at the Red Hat Summit 2025 conference, Red Hat Inference Server is based on an open-source vLLM inference server and optimization technologies that Red Hat gained with its acquisition of Neural Magic. The vLLM inference server, originally developed at the Sky Computing Lab hosted by the University of California at Berkeley, is being made available in a container platform, dubbed llm-d, that makes it possible to distribute AI workloads running large language models (LLMs) across multiple Kubernetes clusters.
Designed to be deployed on Red Hat Enterprise Linux and Red Hat OpenShift, an application development and deployment platform based on Kubernetes, the Red Hat AI Inference Server is designed to be deployable on multiple classes of graphics processor units running either in a public cloud or on-premises IT environments.
The vLLM project provides IT organizations with an alternative to CUDA, the framework developed by NVIDIA for deploying AI applications on its processors, says Brian Stevens, senior vice president and AI chief technology officer for Red Hat. The llm-d project takes that a step further by making it possible to distribute those workloads across multiple Kubernetes clusters, he adds. “Kubernetes is increasingly becoming the orchestration framework for LLM serving,” says Stevens.
In addition, Red Hat is also adding support for the Model Context Protocol (MCP) developed by Anthropic to integrate multiple LLMs and AI workloads
Red Hat AI Inference Server also provides access to LLM compression tools to reduce the size of both foundational and fine-tuned AI models. Additionally, Red Hat is curating a set of validated and optimized models in a repository hosted on the Hugging Face cloud service.
Finally, Red Hat is working with its parent company, IBM, to make InstructLab, a set of tools for fine-tuning AI models to run more efficiently, as a cloud service.
While most AI applications continue to run on GPUs from NVIDIA, a growing number are starting to be deployed on GPUs and CPUs provided by AMD and Intel. Organizations that adopt the vLLM platform will be able to more flexibly build and deploy AI models as additional advances are made by NVIDIA, Intel and now AMD, notes Stevens.
It’s not clear which teams within organizations are deploying inference servers. In some cases, data science teams have their own engineers who manage AI infrastructure. In other organizations, however, AI platforms are managed by a centralized IT team alongside traditional servers and storage systems.
Regardless of approach, Red Hat is making a case for an approach that makes it possible to mix and match processors, platforms and models as organizations best see fit. The challenge, of course, is that many existing AI workloads already make extensive use of CUDA. Refactoring those applications to run on an inference server based on vLLM will require a significant amount of skills and expertise that may be difficult to find. However, there is little doubt that as AI workloads continue to evolve, the need to be able to deploy them anywhere may require organizations to revisit the software stacks being used to build and deploy their AI applications.