Lemony today launched an open source tool that makes it simpler for application developers to route queries to different large language models (LLMs) based on parameters such as cost and latency.
The cascadeflow tool makes it simpler for developers to, for example, dynamically determine if a prompt might be better served by a smaller LLM that costs less to invoke than automatically defaulting to a more expensive option, says Lemony CEO Sascha Buehrle.
Based on a quality engine that Lemony developed for its own efforts to embed AI into applications running at the network edge, the cascadeflow tool, rather than relying on static rules, is designed to speculatively invoke small, fast models first. It then validates the quality of responses using configurable thresholds and, based on that assessment, will escalate a query to a larger model should the quality validation fail.
Via an application programming interface (API) provided, developers can invoke AI models hosted on OpenAI, Anthropic, Groq, Ollama, vLLM, Together or Hugging Face platforms.
Research suggests 40-70% of text prompts and 20-60% of agent calls don’t need expensive larger models, which, if routed to a smaller AI model, will substantially reduce total costs, says Buehrle. Organizations can also rescue application programming interface (API) calls by 40-85% through intelligent model cascading and speculative execution, he adds.
Additionally, built-in telemetry for query, model, and provider-level cost tracking makes it possible to programmatically enforce spending caps, notes Buehrle.
In effect, Lemony has created a tool that understands the prompt well enough to determine how best to respond using a specific LLM. That capability will ultimately drive more developers to invoke smaller, domain-specific models that often are a better option, says Buehrle.
It’s not clear to what degree costs are limiting the number of AI initiatives that organizations might be willing to launch, but as the number of these projects increases, it quickly becomes apparent how expensive it is to deploy an AI application that depends on AI models that are invoked using tokens. Each input and output requires a token, so every time an end user, for example, types in “hello” and “thank you,” the number of tokens used increases.
With the rise of AI agents that are invoking LLMs programmatically, those token costs are likely to rise even higher.
Naturally, the providers of AI models don’t have much of an incentive, so it’s up to the organizations using these AI models to find ways to control costs. The challenge, of course, is that every developer seems convinced they need to access the latest AI models that happen to run on the most expensive graphical processor units (GPUs) available.
Hopefully, there will come a day soon when more rational decisions about which type of AI model to use at what instance are made based on factors such as cost and latency. In the meantime, however, there is at least one open source tool available that makes it clear to developers that AI models are anything but free.


