OpenAI this week is taking the next step in the generative AI field that it jumpstarted with the release of the ChatGPT chatbot, introducing Sora, a text-to-video generator that can create a video as long as 60 seconds based on a typed prompt.
The company introduced the video generator this week, showing a series of high-fidelity videos created from text prompts. In addition, OpenAI CEO Sam Altman on this X (formerly Twitter) feed took in suggested prompts from followers to create another series of stunning videos that featured everything from a street-level tour of a futuristics city to two Golden Retrievers podcasting from the top of a mountain to an instruction cooking session for making gnocchi.
“Sora is able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background,” OpenAI wrote in the Sora announcement. “The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world.”
Despite what the Sora demonstrations show, it will be a while before the video generator is released to the public. In announcing Sora – Japanese for “Sky” – OpenAI said it is taking steps to ensure the security and safety of the tool by making it available to red teamers to assess it for risks or harms.
The company also is looking for getting feedback from visual artists, filmmakers, and designers to help improve the presentation and make it more useful to “creative professionals.”
That said, OpenAI executives wrote that they want to “give the public a sense of what AI capabilities are on the horizon.”
Google, Meta, Others on the Same Path
OpenAI isn’t the only high-profile AI company pursuing text-to-video capabilities. Google late last month released a research paper on its tool, called Lumiere, which uses a new architecture called Space-Time U-Net, which the company wrote that, “by deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales.”
Meta is building out its image- and video-generation capabilities with its Expressive Media Universe – or Emu – its first foundation model for image generation.
Meanwhile, Apple earlier this month released its own research paper about the technology behind its Keyframer tool that enables users to create an animated video from a still image and a natural-language text prompt.
For its part, OpenAI wrote that for Sora, the company trains text-conditional diffusion models on both videos and images that come in different durations, resolutions and aspect ratios.
“We leverage a transformer architecture that operates on spacetime patches of video and image latent codes,” the executives wrote in a detailed research note. “Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.”
The inspiration for the process comes from large-language models (LLMs) that train on “internet-scale data.” LLMs are successful in part by the use of tokens that unify diverse modalities of text, including code, math and various natural languages. The architecture underpinning Sora uses the LLM model for inspiration, but there are differences.
“Whereas LLMs have text tokens, Sora has visual patches,” they wrote. “Patches have previously been shown to be an effective representation for models of visual data. We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.”
Sora, like other GPT models, uses a transformer architecture that the company said fuels its scaling performance.
Improvements Need to Be Made
The images are impressive and the model has myriad capabilities, including having a deep understanding of language that enables it to accurately interpret prompt and generate characters. It also can create multiple shots within a single generated video that ensures the characters and visual styles remain consistent.
That said, there are parts of the model that need to be improved, they wrote.
“It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect,” the executives wrote. “For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark.”
The model also may not completely understand the spatial details for a prompt, so it can mix up left and right and struggle with precise descriptions of events that occur over time, such as following a specific camera trajectory.
A Focus on Safety
OpenAI also is putting an emphasis on safety, an ongoing concern with generative AI. Lawmakers and government agencies in the United States, the European Union, and elsewhere have been trying to keep up with the rapid innovation and adoption of the technology to ensure that the data and privacy of organizations, consumers and governments are secure and that they are protected against deepfakes, voice cloning, and other tools that can generate disinformation and similar threats.
The Biden Administration in October 2023 issued an executive order addressing necessary security and privacy standards around AI that agencies throughout the government have since worked to address. Most recently, the Federal Trade Commission this week proposed rules for banning AI-generated impersonation of individuals, governments and businesses.
Meanwhile, the European Parliament this week approved a preliminary agreement for the EU’s AI Act, which would include a risk-based framework for AI applications.
Looking Toward AGI
OpenAI executives wrote that even as Sora begins making waves now, it also has a role in the future development of artificial general intelligence (AGI), the theoretical stage where AI can learn, comprehend and perform as well as humans can, essentially making AI the equal of humans.
“Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI,” they wrote.
Altman and others believe that, with the breakneck pace of innovation of generative, AGI is only years away, while some like Yann LeCun, vice president and chief AI scientist at Meta, disagree. In an interview with Time, LeCun said what LLMs can do when trained at scale is “astonishing,” but it’s limited. Such systems don’t understand the real world and require “enormous amounts of data to reach a level of intelligence that is not that great in the end.”
“And they can’t really reason,” LeCun said. “They can’t plan anything other than things they’ve been trained on. So they’re not a road towards what people call ‘AGI.’ I hate the term. They’re useful, there’s no question. But they are not a path towards human-level intelligence.”