Synopsis: In this Techstrong AI video, Galileo CTO, Atin Sanyal, dives into why the capabilities of artificial intelligence (AI) agents will need to be continuously tracked and ranked.
In this Techstrong AI interview, Mike Vizard talks with Atin Sanyal, CTO and co-founder of Galileo, about their new AI agent leaderboard designed to benchmark agent performance for real-world, industrial use cases. Sanyal explains that Galileo’s mission grew out of the challenges he observed at Uber and Apple, where a lack of robust AI evaluation often led to production failures. Instead of relying solely on academic benchmarks, Galileo’s leaderboard evaluates agents across 25 industry-specific tasks, using proprietary metrics to reveal surprising differences between models’ practical performance and their academic reputations.
One major takeaway from the leaderboard, Sanyal shares, is that performance differences among top agents are often minimal—just around 4%—while cost differences can be as high as 20x. This opens opportunities for businesses to swap out expensive models for cheaper ones without major trade-offs in quality. As the AI landscape rapidly evolves, companies are increasingly building internal tools or using platforms like Galileo’s to dynamically A/B test, monitor, and swap models in production environments. This flexibility is becoming essential because new models, tools, and frameworks are constantly entering the market, making static, one-time decisions about AI infrastructure impractical.
Looking ahead, Sanyal predicts the AI space will divide into two main categories: expensive, high-reasoning models for complex tasks, and cheaper, lightweight models for common applications like chatbots and summarization. He also foresees more automation, with AI models eventually helping orchestrate the multiplexing of tools and models across ecosystems. However, he cautions that many organizations today are still overwhelmed by “tool fatigue” and urges them to invest in systems that allow fast experimentation and real-time performance monitoring to stay adaptable in this fast-moving environment.