Onehouse today revealed it is adding a vector embedding generator to its data lake to make it simpler for organizations to customize large language models (LLMs) at scale.
As a part of its managed extract, lift and transform (ELT) service, the vector embeddings generator will make it possible to create pipelines to streamline retrieval-augmented generation (RAG) workflows, says Onehouse CEO Vinoth Chandar.
That approach, in addition to reducing the total cost of storage, also makes it simpler to build and deploy custom generative artificial intelligence (AI) applications at scale, because vector embeddings no longer need to be stored in a local database, he adds.
Instead, text, audio and video data and vector embeddings are passed when needed to a database via the ELT pipeline managed by Onehouse, said Chandar. The models then return the embeddings to Onehouse, which stores them in highly optimized tables within the context of an existing centralized data management workflow. “The data lake becomes the source of truth,” he says.
Longer term, Onehouse is also exploring the possibility of exposing vector embeddings directly from its data lake to an LLM, notes Chandar.
One of the major challenges organizations are encountering as they operationalize LLMs is managing the data pipelines needed to convert data into vectors that an LLM can search. The overall goal is to extend the capability of an LLM beyond the initial data set it was trained on. That enables organizations to apply generative AI to their data without having to train their own LLM. It also reduces the odds the LLM will return inaccurate outputs, also known as hallucinations, by enabling it to search more relevant data.
Each organization will need to decide to what degree they will require data engineers to manage that process. Onehouse is streamlining data workflows in a way that should make customizing LLMs more accessible to a broader number of organizations.
It’s not clear how many organizations are opting to build LLMs versus customizing foundational models that are made available by providers such as Open AI, Microsoft, Anthropic, Hugging Face and Amazon Web Services (AWS). It may be simpler for many organizations to customize an existing LLM than build their own, but there are organizations that have the data science expertise needed to build an LLM. Even then, however, it’s likely those organizations will also employ RAG techniques to extend the pool of data that an LLM can access.
Regardless of approach, however, many companies are still trying to determine what the killer application for generative AI is for their organization, noted Chandar.
Of course, an LLM is only as useful as the quality of the data that has been exposed to it. Organizations that may have not have historically managed data well will need to make sure that data exposed to an LLM is of the highest quality possible to ensure the best results possible. Otherwise, end users over time won’t trust the outputs of the LLM enough to warrant the investment that was required to build them in the first place.