What to Make of OpenAI's New Text-to-Video Technology
Is it an expensive toy? A new disinformation-producing tool? A scam? A bit of everything?
Hello, all. Parker here. Happy Monday.
Let’s jump right into this week’s First Five segment.
The Present Age is a reader-supported newsletter. Please consider becoming a free or paid subscriber. Thank you!
First Five: Stories on a Single Topic to Start Your Week
This week’s theme is OpenAI’s Sora text-to-video technology. We don’t yet know a whole lot about it and it’s not yet available for the public to try. Still, the example videos posted by OpenAI and its testers looks like a pretty large leap forward for this tech. Below, I’ve curated five stories from the internet that run the gamut from people worried about this bringing us closer to a Matrix-like dystopia to one writer who flat-out calls this a scam.
But first, here’s a look at OpenAI’s video examples:
Watch OpenAI’s Sora Make Lifelike Videos Just From Text and Descriptions (Designboom, Matthew Burgos)
OpenAI debuts its new video generation model Sora, which can create realistic AI videos just from text prompts and instructions. In a recent interview with Bill Gates, returned OpenAI CEO Sam Altman mentioned the future of ChatGPT, which he hoped could also generate videos from text. That dream has finally come true in the form of Sora, and the text-to-video AI model can generate videos up to a minute long while, as the OpenAI team claims, ‘maintaining visual quality and adherence to the user’s prompt.’
OpenAI has released a series of samples from its new text-to-video model Sora. The text prompts need to be detailed so that the generated video can capture the visuals the user wants. So far, the text-to-video Sora can understand long instructions such as ‘The camera rotates around a large stack of vintage televisions all showing different programs — 1950s sci-fi movies, horror movies, news, static, a 1970s sitcom, etc, set inside a large New York museum gallery.’
OpenAI Sora: One Step Away From The Matrix (The Algorithmic Bridge, Alberto Romero)
Sora is a (primitive) world simulator
This is the news that has excited (worried?) me the most.
First, here’s a recap. Sora is a text-to-video model. Fine, it’s better than the rest but this technology already existed. Sora is a diffusion transformer. Likewise, OpenAI hasn’t invented the mix albeit they added interesting custom ingredients. Sora is a general and scalable visual model. Things begin to get interesting here. Possibilities open up for future research and surprise is warranted.
But, above all else, Sora is an AI model that can create physically sound scenes with believable real-world interactions. Sora is a world simulator. A primitive one, for sure (it fails, sometimes so badly that it’s better to call it “dream physics”) but the first of a kind.
OpenAI says Sora not only understands style, scenery, character, objects and concepts present in the prompt, etc., but also “how those things exist in the physical world.” I want to qualify this claim by saying that Sora’s eerie failures reveal that, although it might have learned an implicit set of physical rules that inform the video generation process, this isn’t a robust ability (OpenAI admits so much). But surely it’s a first step in that direction.
OpenAI’s Sora Is a Total Mystery (The Atlantic, Matteo Wong)
By OpenAI’s own admission, [Sora] struggles with depicting physics, cause and effect (the company says that you might ask for a video of a person biting into a cookie, only to notice that no bite mark is left behind), and other simple details (a man is shown running the wrong way on a treadmill). Internet sleuths have uncovered still other failures, such as disappearing objects and misshapen hands. Nonetheless, the product appears astonishing—which, for all the excitement, raises exceedingly familiar yet serious concerns over deepfakes, copyright infringement, artists’ livelihoods, hidden biases, and more.
Meanwhile, the internet swirls with paparazzi-esque theories and observations: guesses about how Sora works; insinuations that Sora is not generating new things but copying existing videos; comparisons showing similarities between its videos and the outputs of a leading text-to-image model. These concerns, for now, cannot be found right or wrong. The public still barely understands the inner workings of DALL-E and ChatGPT, but at least we can test those products’ capabilities for ourselves; with Sora’s announcement, OpenAI has entered the realm of mythmaking.
OpenAI's new text-to-video tool, Sora, has one artificial intelligence expert "terrified" (CBS News, Megan Cerullo)
Advances in technology have seemingly outpaced checks and balances on these kinds of tools, according to [Oren Etzioni, founder of TruMedia.org], who believes in using AI for good and with guardrails in place.
"We're trying to build this airplane as we're flying it, and it's going to land in November if not before — and we don't have the Federal Aviation Administration, we don't have the history and we don't have the tools in place to do this," he said.
All that's stopping the tool from becoming widely available is the company itself, Etzioni said, adding that he's confident Sora, or a similar technology from an OpenAI competitor, will be released to the public in the coming months.
Of course, any ordinary citizen can be affected by a deepfake scam, in addition to celebrity targets.
"And [Sora] will make it even easier for malicious actors to generate high-quality video deepfakes, and give them greater flexibility to create videos that could be used for offensive purposes," Dr. Andrew Newell, chief scientific officer for identify verification firm, iProov, told CBS MoneyWatch.
AI Video Is A Scam (Aftermath, Gita Jackson)
At first, the idea of generating videos with motion just from a text prompt sounds like magic. This is indeed a cutting edge technology, and one that sounds like it’s from a cyberpunk fever dream. But looking at the generated images for even a couple minutes reveals how uncanny and unpolished they are. Look at the way that the people in this video clap—it makes the hairs on the back of my neck stand on end.
More than that, the video does not match the details in the prompt. I don’t see an expression of pure joy and happiness on this grandmother’s face. Her friends aren’t seated around the table, but at a different separate table behind her. At no point does she blow out the candles either. You can see the figure make a few gestures as if she’s going to blow out the candles before wriggling around like a snake wearing human skin. Even a second’s glance at this moving image causes it to fall apart.
While people breathlessly circulated these videos as both a significant achievement in artificial intelligence and also a threat to the very fabric of reality, all I can see are the obvious flaws. I look at moving images all day, and have done so for over thirty years. I know what a human being looks like, how human bodies move, the conscious and unconscious gestures we make as we all move through life. There is more information stored in my brain about the moving image than could be crammed into an AI model like Sora; the human brain and eye are always going to outclass a machine designed to mimic them.
To me, the sample videos posted by OpenAI all have some pretty big “video game” vibes to them. Still, this technology is only going to continue to improve. Just check out this video from 10 months ago to see how far things have come in just the past year (this video is true nightmare fuel):
Will this be used for good? Eh. Will this be used for evil? Probably. Is this an ethical minefield that will probably put people out of work? Almost certainly. I’ve long been equal parts fascinated and horrified by generative machine learning/”AI” type stuff. Personally, I was more comfortable with and amused by the nightmare fuel stage of this. Now that it’s getting a bit more realistic, I’m not sure what to think.
What are your thoughts?
That’s it for me today. Thanks for reading. I’ll be back tomorrow with another newsletter.
Parker
I don't think we're ready as a society for what technology like this can do, if not now, in the near future. A lot of people are already too lazy to check if something they read or see an out of context video is true. If they see a video made with technology like this that confirms their priors, they'll just run with it. I think this is really going to suck.
I don’t trust it.