Synopsis: The default reflex for running AI workloads has been to point them at whichever hyperscaler the rest of the business already lives on. That worked when AI was an experimental side project. It stops working as soon as inference becomes a real line item on the infrastructure budget, because the constraints that determine whether an inference workload performs well — GPU availability, memory bandwidth, network latency to the model and to the data — are not the constraints the big clouds were optimized around.

Kyle Sosnowski, VP Engineering for Cloud at Crusoe, sits down with Mike Vizard to make the case that AI-native organizations are quietly building a different kind of cloud strategy underneath the surface. Multi-cloud and neo-cloud deployments aren’t being driven by procurement politics anymore — they’re being driven by hard physics. The right GPU class, in the right region, with the right interconnect, sitting close enough to the data, beats a default hyperscaler footprint on both latency and unit economics by a wide margin.

Token economics is doing most of the forcing. As soon as a workload scales from prototype traffic to production traffic, the cost-per-token math starts dictating where models can profitably run, which GPUs they should run on, and how much of the inference layer needs to live at the edge instead of inside a centralized region. Purpose-built infrastructure becomes the difference between an AI feature that pays for itself and one that quietly eats margin.

Sosnowski’s broader point is that the next layer of AI infrastructure work is agentic. Automated capacity placement, observability that understands GPU utilization and model behavior, and self-managing inference fleets are where enterprises serious about scaling AI workloads are spending engineering time — not on hand-tuning deployments inside whichever public cloud they happened to standardize on five years ago.