platform engineering, MLOps, machine learning,

A model on its own is typically not enough. It requires the data, which comes in a very specific format and has to be the same format that will be used at the time of inference or prediction. But the reality, on the other hand, is that data changes all the time, and sometimes even data formats change. So, typically, you need to put something in front of the model, which makes sure that the data coming in still fits the template, header, or schema the model has seen when it was trained. And that’s a very special factor—an artifact, really—for MLOps.

MLOps Tools and Principles

The only way to effectively productionize any machine learning project at scale is with MLOps tools and principles. But there are unique problems that, by their demanding nature, are universally challenging to the engineering teams. And those challenges are leading to the rise of foundation model operations (FMOps) and large language model operations (LLMOps)—a subset of FMOps.

The models get larger and larger, such as generative AI models and large language models. “They are so large that you need specialized infrastructure, and you sometimes need to come up with new creative ways so you can ensure that responses are delivered in a timely fashion,” says Dr. Ingo Mierswa, an industry-veteran computer scientist and founder of Altair RapidMiner. “And you need to start understanding if you can maybe sacrifice some precision of your models to actually reduce the memory footprint.”

All of that is very new. “This problem didn’t exist about 20 years ago when I started in this field,” Mierswa reflected during our call. “We had already been running into all kinds of memory issues, and this was not a problem. Now, thanks to generative AI and LLMs, that problem is back; we are, because of how much data is being generated, developing resource-intensive models that the consumer-based hardware before us is not sufficient to work with.”

And if you need to use specialized hardware, he implied it also means that the engineer needs specialty skills to work with that hardware and make it scalable or reduce the memory footprint of models.

AWS

Consider GPT-based series of applications, since most of them are chat-based (text-to-text) applications. You will always see—you’ll type something, and then it streams a set of tokens or text. “One of the reasons behind that is the inference time of GPT is very slow, on the order of several seconds. And for deployment, the challenges that engineering teams face when deploying large language models for search applications, recommendation applications or ad applications is that the latency requirements are on the order of several milliseconds—not seconds,” says Raghavan Muthuregunathan, Senior Engineering Manager at LinkedIn, leading the Typeahead and whole-page optimization of LinkedIn Search.

And how are big tech engineering teams trying to solve that? “The knowledge distillation, and the fine-tuning where engineers are trying to deploy a very small, fine-tuned model for that specific task within a GPU. This helps decrease the inference time from several seconds to just a few hundred milliseconds,” Muthuregunathan articulated. “In fact, there is a technique called Look ahead decoding. It still is a very active area of research where engineers are trying to see if LLMs’ inference time can be reduced.”

Google has a very limited preview of their AI-powered research search where you can ask a question, such as ‘tell me about Pier 57 in NYC,’ and within seconds, you’ll see everything loading with the search results. It does take a few seconds to load, and that delay is likely due to the extremely slow inference time of a large language model. That’s why, for instance, if you were to type a query like ‘Donald Trump’ on Google, they won’t provide AI-generated results immediately, because they know it would take more time for this specific query, and users probably don’t need it instantly.

“The user’s intent is likely to navigate to some specific web page than consume the LLM summarized content,” Muthuregunathan pointed out. So, they’ve introduced a ‘generate’ button feature. If you choose to, you can wait for those several seconds, and then the results will be generated.

“The way people are circumventing this latency issue is through a product experience rather than relying on advanced AI or infrastructure techniques,” he explained. They’ve created a button so that when people click it, they’re okay with waiting a few seconds, as opposed to waiting several seconds of delay after typing a query, which otherwise might not create a good user experience.”

“And everyone is trying to do streaming of LLMs to circumvent the limitation puzzle instead of making a single call to LLMs and getting all responses at once,” he mentioned. “Why? Because every application is more of a streaming application—that’s part of why most of these applications are chat applications instead of search engine applications.”

Reducing the memory footprint is inordinately hard. We don’t have good hardware for vision-related tasks, and the availability of GPUs today for production use cases is a challenge. But there are sparks of some progress. Nvidia founder and CEO, Jensen Huang, at the recent AWS re:Invent, announced their new AI chip, H200, which will be available for AWS customers.

Google is bringing Cloud TPU v5p and AI Hypercomputer for GPU-accelerated computing for running deep learning workloads. OpenAI is running the arms race, developing its own chipset. And Tesla? They’re definitely forging ahead.

The extent to which these efforts will deliver on their seemingly great promise of overcoming computational limitations is still up in the air. But because of these limitations, it continues to be common among engineering teams to add more boxes, given that AI workloads rely heavily on extremely high-performance computing nodes.

TECHSTRONG TV

Click full-screen to enable volume control
Watch latest episodes and shows

AI Field Day

TECHSTRONG AI PODCAST

SHARE THIS STORY