Modern AI infrastructure faces unprecedented complexity as holistic control and visibility across the entire hardware and software stack are required to build and run large-scale AI clouds. While the raw performance is determined by the underlying hardware, software plays the key role in achieving high efficiency and cost-effectiveness.
How Hardware Shapes Performance
AI workloads are compute-intensive, where hardware capabilities directly determine the peak performance. GPU memory bandwidth, tensor processing unit throughput, and interconnect latencies create hard performance ceilings that software cannot transcend. When debugging performance issues, teams must trace execution paths from application-level tensor operations down to silicon-level compute units. Without full-stack visibility, it becomes nearly impossible to identify whether a 40% performance degradation originates from memory bottlenecks, interconnect saturation, or kernel scheduling.
Unblocking Performance with Software
While hardware sets performance limits, software determines how efficiently those resources are utilized.
There are accelerating trends requiring hardware/software co-design and co-optimization. For instance, model quantization was first conceived as a software-level optimization technique, but recent GPUs are rapidly expanding their native support for them. If combined properly, model quantization techniques can reduce memory footprint by 4x, while optimal batch sizing can improve GPU utilization to 97%. Advanced compiler optimizations like TensorRT or XLA fusion can deliver 2-3x efficiency gains through kernel fusion and memory layout optimization. Debugging suboptimal performance with complex model layouts, mixed quantization, and complicated DNN software stacks having lots of configuration options lays an immediate challenge for AI practitioners and enterprise developers.
Unfortunately, it is only a fraction of the issue. As the workload size grows, there are higher-level, inter-node hardware/software co-design and co-optimization techniques arising. Disaggregated serving requires careful coordination between the job orchestrator and high-speed storage with the interconnection fabric. With disaggregated serving, a single LLM inference request passes through multiple accelerators and potentially multiple nodes scattered in the datacenter. There are even attempts to utilize different GPUs and NPUs to separate the prefill/decode stages of transformer models. To avoid communication bottlenecks in the request processing pipeline, RoCE (RDMA over Converged Ethernet) and InfiniBand connections are becoming the norm, and the complexity of network configurations and virtualization is rapidly increasing. There are open-source projects such as llm-d to abstract this complexity, but most are still in a very early stage.
Holistic Observability Through Stack Integration
The role of Kubernetes and other container-based workload orchestrators is expanding beyond conventional expectation. They were requested to be a foundation of container lifecycle trackers, but now they need to provide high fidelity access and coordination to the underlying hardware for full control and visibility. It is not enough to monitor individual layers, but correlating events and metrics across the entire stack is the key to resolve performance and efficiency problems.
To enhance scalability and resiliency of AI systems, observability integrations should be able to correlate hardware metrics (temperature, power consumption, memory utilization) with software telemetry (model accuracy drift, inference latency, batch processing times) in real-time. Via such carefully designed integrations, an orchestration platform could avoid hotspots, prevent faulty nodes from slowing down the entire job, and automatically restore from transient failures.
Benefitting from Connecting Hardware and Software Vertically
Vertical integration delivers measurable advantages through end-to-end optimization. Co-designed hardware-software stacks achieve higher performance-per-watt compared to generic implementations. This integration could enable advanced features like dynamic voltage scaling synchronized with inference workload patterns, hardware-accelerated model compression, and predictive thermal management.
The economic imperative is clear: AI infrastructure costs compound across hardware Capex, cooling Opex, and software inefficiencies. Vertically integrated management reduces total cost of ownership by eliminating redundant monitoring tools, minimizing performance gaps, and enabling precise resource allocation and orchestration based on unified telemetry data across the entire AI infrastructure stack.
What’s Next?
There are new challenges and opportunities in hardware-software co-designs as the world observes widespread adoption of agentic AIs and agent VMs. For example, there will be increasing pressure for privacy and security concerns when running LLMs and MCPs, leading to new requirements like confidential computing combined with optimized disaggregated serving architectures. To keep pace with these demands, continuous innovation is essential. Building and improving cloud-native AI infrastructures that combine scalability and resilience will be the critical elements supporting next-generation agentic AIs.
KubeCon + CloudNativeCon North America 2025 is taking place in Atlanta, Georgia, from November 10 to 13. Register now.

