
While much attention is focused on artificial intelligence (AI) training of large, complex models, the reality for most organizations is different. Instead of building models from scratch, many companies are leaning towards using pre-trained models and fine-tuning them for specific tasks.
These companies focus the majority of their resources and operational effort on serving the models at scale for real-world applications. This is where the challenge of inferencing at scale becomes paramount.
At the AI Infrastructure Field Day in April, Google offered a deeper look at how its new GKE Inference Gateway helps organizations optimize their inference workloads at scale.
What Does Inferencing at Scale Mean?
Inferencing at scale is taking a trained or fine-tuned model and using it to make predictions or generate outputs based on new data, handling massive volumes of requests efficiently, reliably and with low latency.
Unlike the contained and constrained environment of training, inference happens “in the wild,” directly interacting with users, applications and other systems.
According to Vaibhav Katkade, senior product manager at Google Cloud Networking, inferencing with modern large language models (LLMs) is “six orders of magnitude as computationally intensive as traditional web serving or even traditional inference requests.”
This computational intensity, combined with potentially “highly variable processing times in the order of seconds to minutes versus traditional requests that complete in the order of milliseconds,” makes traditional scaling and load balancing approaches insufficient.
Challenges of Inferencing at Scale with Kubernetes
Vaibhav laid out the unique challenges inferencing at scale presents, particularly when leveraging Kubernetes:
- Constrained Accelerator Capacity: GPU and TPU capacity, essential for accelerating inference, is often constrained across various regions. Customers frequently struggle to find and secure enough capacity.
- Inefficient Traffic Distribution: Distributing traffic effectively across nodes and pods is critical, but traditional techniques aren’t tuned to the workload profiles of LLM inferencing. Traditional round-robin load balancing can lead to uneven traffic distribution that impacts overall performance.
- Dynamic Compute Allocation Difficulty: Determining and allocating the right amount of compute resources is a dynamic problem that depends on multiple factors, including the request volume, model, size, GPU type and latency objectives. Accurately allocating compute resources remains an “uphill task for platform operators,” Vaibhav noted.
Google’s Approach: The GKE Inference Gateway
Recognizing these challenges, Google Cloud has invested in enhancing its infrastructure, particularly within Google Kubernetes Engine (GKE), to better support AI workloads throughout their lifecycle.
The enhancements include high-speed, secure connectivity options, currently at 100 Gbps and scaling up to 400 Gbps, with application awareness for prioritizing data movement. Additionally, GKE clusters now support up to 65,000 nodes and use a purpose-built RDMA VPC, enabling up to 3.2 terabytes of non-blocking GPU to GPU connectivity.
The focus on inference at scale has led to the introduction of the GKE Inference Gateway. A component of the Kubernetes infrastructure layer, the GKE Inference Gateway operates as a cloud load balancer that routes traffic directly to pods within the GKE cluster. The gateway is tuned specifically for AI inferencing, incorporating AI safety and security guardrails, AI-aware load balancing, autoscaling and routing to increase serving density.
Designed for developers and platform operators creating and deploying inference applications, including organizations running internal workloads as well as model service providers who wish to expose their models publicly, the GKE Inference Gateway is optimized for LLM serving.
Features and Benefits
The GKE Inference Gateway incorporates several key features designed to tackle the challenges of LLM inference at scale:
- Optimized Load Balancing Based on Inference Metrics: Extensive benchmarking revealed that the most effective metric for optimal inferencing performance is Key/Value (KV) cache utilization. Vaibhav said that by routing requests to the model server with the least utilized KV cache, the gateway helps avoid queuing and improves performance, leading to “60% lower latency and 40% higher throughput of inference serving” in Google’s benchmarking.
- Autoscaling Based on Model Server Metrics: The gateway goes beyond just load balancing by also enabling autoscaling based on these model server metrics. This addresses the dynamic compute allocation problem, and different autoscaling thresholds can be set for different workload types, like production versus dev/test.
- Increased Model Density With LoRa Adapters: To combat the constraint of accelerator capacity, the gateway supports the use of Low-Rank Adaptation (LoRA). This allows customers to use a common base model and multiplex and load multiple model use cases on that one base model on a single GPU or TPU accelerator. Particularly useful for scenarios like serving models fine-tuned for different languages from a single base model instance on a single GPU, LoRa adapters help increase model density and accelerator efficiency.
- Multi-Region Capacity Chasing: The gateway can route requests to regions where accelerator capacity is available. This helps manage load surges or capacity constraints in a primary region by allowing requests to be served from capacity across multiple Google Cloud regions as part of a single gateway. This provides a more efficient use of pooled capacity compared to managing dedicated regional clusters.
- Model-Aware Routing and Prioritization: Compliant with the OpenAI API specification, the gateway can inspect the model name in the body of the requests for prioritization and autoscaling. This provides granular control over how different model deployments are served.
- Workload Prioritization: Beyond model names, you can assign different serving priority levels to jobs. This allows prioritizing latency-sensitive critical jobs like chatbots over standard or schedulable jobs like reasoning agents or batch workloads.
Integrating AI Security and Safety Guardrails
Vaibhav emphasized the critical importance of integrating AI security and safety guardrails into the inference process, noting that LLMs present an additional attack surface, with specific concerns around prompt injection and safety of output. Traditionally, embedding safety and security checks within application code is a fragmented approach. By integrating these AI security tools directly at the gateway level, platform and infrastructure teams gain centralized control. This ensures a consistent level of baseline coverage across all models running in the cluster and the environment.
Google has integrated Google’s own Model Armor, which screens requests and responses using configurable filters and sensitivity levels, and is also partnering with leading providers like Palo Alto Networks and NVIDIA for additional safety and security capabilities.