In an AI world where big tends to be the norm–think large language models (LLMs) and powerful systems built to run AI workloads–NVIDIA is developing a text-to-image personalization method that includes a model that is smaller than those in use today.
NVIDIA briefly talked about the technology, called “Perfusion,” in a blog post in May 2023 as being one of among 20 or so papers about generative AI and neural graphics the GPU maker will present at the upcoming SIGGRAPH 2023 show, which kicks off August 6, 2023, in Los Angeles, California.
In the post, Aaron Lefohn, vice president of graphics research at NVIDIA, described Perfusion as a “highly compact model … which takes a handful of concept images to allow users to combine multiple personalized elements—such as a specific teddy bear and teapot—into a single AI-generated visual.”
In this case, “highly compact” means a model that needs only 100KB of parameters for a new visual concept to be added to an existing model, magnitudes smaller than the hundreds of megabytes or gigabytes needed by other techniques used by technologies like Dall-E, MidJourney or Stable Diffusion.
Perfusion essentially lets users make small updates to a text-to-image model rather than having to retrain it entirely.
Perfusion fits with NVIDIA’s laser focus on AI and machine learning, which founder and CEO Jensen Huang said almost a decade ago would be the key drivers of the company’s growth.
“The computer industry is going through two simultaneous transitions—accelerated computing and generative AI,” Huang said in a statement announcing fiscal first-quarter numbers in May. “A trillion dollars of installed global data center infrastructure will transition from general-purpose to accelerated computing as companies race to apply generative AI into every product, service and business process.”
Text-to-Image Has its Challenges
A text-to-image model is a machine learning algorithm that lets users write prompts in natural language to create an AI-generated image. As shown in a research paper about Perfusion written with Tel Aviv University in Israel, that could mean a prompt like “a cat acting in a play wearing a costume” or “a sculpture wearing a sombrero.”
The models understand the relationship between the image data and text and create images. Text-to-image personalization comes in when a user wants to fine-tune the image by adding new concepts–like a specific cat or sombrero image–to the model.
“Text-to-image models … offer a new level of flexibility by allowing users to guide the creative process through natural language,” researchers wrote in the paper. “However, personalizing these models to align with user-provided visual concepts remains a challenging problem.”
Those problems include keeping the high quality of the image while enabling the user to keep creative control, combining multiple personalized concepts in a single image and keeping the model small. Perfusion addresses these, they wrote.
Central to this is what NVIDIA is calling key-locking, which allows for small updates to the representations already in the model, like adding the concept of a red sombrero and “key locking” it to the more general concept of a hat. Any retraining happens on the new concepts that are introduced rather than on the entire model, which means less demand for expensive compute power and storage.
The Key is Key-Locking
Key-locking mitigates the problem of overfitting, which makes it difficult for the model to create new versions of the idea because it’s tightly tied to the images it was initially trained on.
At the same time, Perfusion also enables more than one personalized concept–like a red sombrero and the sculpture of a human–to be combined in a single image via a text prompt.
“The concepts are individually learned and merged only during the runtime process to produce the final image,” the researchers wrote.
These innovations in Perfusion mean that personalizing a text-to-image model only requires 100KB per concept, which opens up the ability for users to build more creative AI images using more detailed prompts and concepts than with other methods. It also makes personalizing the models less expensive, opening up more opportunities for more people at a time when art platforms that use AI are proliferating.
That said, there are limitations that the researchers said need to be addressed. Word choices can result in over-generalizing the concept, while right now, it takes a lot of work to get the text prompt right to ensure the desired results.
More work will be done on the technology and NVIDIA seems intent on releasing the code.