Tensormesh Makes Serverless Caching Service for AI Inference Generally Available

Tensormesh today made generally available an inference engine for artificial intelligence (AI) applications that is accessed via a software-as-a-service (SaaS) platform that is based on caching software deployed using a serverless computing framework.

Fresh off raising an additional $20 million in financing, Tensormesh CEO Junchen Jiang says the Tensormesh Inference platform makes use of key-value (KV) caching to reduce the cost of running AI applications. Instead of having to reprocess the same data within the context window of every prompt. Tensormesh Inference makes it possible to store frequently used data in cache that is then readily accessible via an OpenAI-compatible application programming interface (API) through which access to a curated catalog of frontier models is provided.

Based on open source LMCache software, the Tensormesh Inference service eliminates the need to recreate a full context window by storing frequently used data in cache, thereby cutting token costs, speeding time to first token, and reducing the number of application programming interface (API) calls that need to be made, says Jiang. “Using KV caching improves efficiency”, he adds.

In the absence of any type of caching software, each call to a model reprocesses the full context window, including system prompts, conversation history, and tool definitions, from scratch. As cache hit rates grow, savings compound as cache hit rates start to rise above 70%, notes Jiang. Cached input tokens, as a result, cost nothing when a request is served from the KV cache, which in turn can reduce spending on graphical processor unit (GPU) resources by a factor of ten, he says.

The platform also gives IT teams direct control over how much cache backend storage is allocated to their deployments and surfaces the metrics they need to understand exactly how that storage is performing. Cache hit rate, KV cache usage ratio, and token-level cost breakdowns are all visible in real time to enable IT teams to continuously tune their cache configuration and maximize the portion of requests served from storage.

There is also a Cost Savings Dashboard that makes the financial impact of caching visible in real time by tracking cache hit rate, the ratio of cached to total prompt tokens, and converts that into a dollar amount that is continuously updated.

The overall goal is to reduce the total cost of deploying, for example, AI agents by reducing the amount of processing that would otherwise be required every time the same data is reprocessed in a way that eliminates the need for an internal IT team to provision or manage IT infrastructure.

Backed by AMD Ventures, CoreWeave, and NVIDIA’s NVentures, Tensormesh is one of several vendors that are now looking to offload processing from large language models either by using cache, creating an index or providing access to a graph or database.

Regardless of approach, the one thing that is certain is that as the cost of accessing AI infrastructure becomes more apparent, organizations have become much more sensitive to costs. The challenge and the opportunity now is determining how to run as many AI workloads in production environments within the confines of a highly limited set of infrastructure resources.

Tensormesh Makes Serverless Caching Service for AI Inference Generally Available

SHARE THIS STORY

FOLLOW US

Tensormesh Makes Serverless Caching Service for AI Inference Generally Available

TECHSTRONG AI PODCAST

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP