How OpenAI Sora works: Multimodal AI
Most people have seen Sora's various video examples by now, and it was so startling it literally stopped development in its tracks. You might not even realize that this tech can even integrate audio, now, too (albeit not completely). How did we get here so quickly?
Multimodal AI
There's a lot of hype around the idea of multimodal AI, which is just a fancy way to say "AI that understands different mediums". Another way to say it is "generative AI serving as a complete creative stack", really. If AI can churn out images, icons, videos, and even sound effects...well, that's a fairly complete creative pipeline, isn't it? That doesn't mean you can make a whole movie with it (yet) but maybe bits and pieces...and it just depends on the type of movie you're making.
Skeptics will point out the manifold flaws in this tech, while creatives in the industry get that sinking feeling in the pit of their gut that the robots will be coming for them much faster than they thought. It doesn't have to be perfect to be immensely attractive to producers that salivate at the prospect of AI-driven pipelines. But hey, it's just hype, so there's no need to worry. Hollywood producers have a reputation for being ethical, right?
One reason these demos are so eye-raising is because there's few other "modals" to expand to. If video is good enough to pass the "proof of concept" (it is affecting the industry already, so let's say it has), then "mutlimodal" stops being a buzzword after a while as the focus becomes refinement and scale and AI's multimodal abilities become the norm.
But...how does it work?
First: throw out anything you read on TwitterX, because that's not really a great forum to learn something so deep. OpenAI released a paper describing how Sora works, so...yeah, let's use that as a source. They should know! Anything else is just speculation. To quote from their article:
At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space and subsequently decomposing the representation into spacetime patches.
Well that makes it clear, right? No further questions, right? This isn't so difficult to understand once you unravel the Star Trek lexicon.
First, tokens are a really important aspect in how AI works. In a text model, sentences are broken down into "tokens", slicing something complex into smaller pieces for analysis. Each token might be an individual word, a punctuation mark, or even partial words when possible.
If you've ever studied compilation, you might already be familiar with the idea of tokens. For example, in PHP, part of the compilation process is "tokenizing" a file -- which serves a similar purpose. It breaks the code into smaller units, which essentially makes it easier to "put back together" in the interpreter. Tokenization is important if we want to make a computer understand something in general, even beyond the field of AI.
If we understand the idea behind tokenization with text prompts, we can understand how it might work for video, too. It must "tokenize" a source video. With video, this is more complex, but the same idea applies. Just like a text-based tokenizer might split individual words, a video "tokenizer" must break the video apart to understand individual facets. The purpose is to train on videos as an input, and have "representational" data as an output, which the AI can then trains based on this representational data, converting it back into pixels.
To use what what might become a tired analogy, it's like each video is a puzzle. Compressing the video into a lower-dimensional latent space is the process of breaking that puzzle into many distinct pieces, while the AI uses those pieces to attempt to output a video, as if glueing the pieces back together. By doing this, it can compress the video in multiple dimensions...both in the traditional way (throwing away pixels that aren't needed) and "temporally" by throwing away some frames.
What is this "diffusion model" people keep talking about?
Now that we understand what's happening at a high level, how does the thing actually generate a frame of video merely given a text prompt? If you know a bit about neural networks, you actually can understand the basic premise. With a neural network, you initialize every neuron with a random seed initially. The output will be nonsense, but you have to start somewhere when you're training. The same is true in a diffusion model. The "starting point" is noise: a bunch of random pixels. Given multiple iterations, the idea is to essentially "reverse" that noise into the desired output. There's obviously a lot of granular details (and maths) that go into this, but even with a basic understanding of a neural net, this isn't so surprising. AI is very much stochastic in how it works.
In other words, given a bunch of random noise, the AI knows how to slowly replace the noise to create new images, as if pasting bits of our puzzle back together. This process works better with more compute, as the linked paper so humorously illustrates. It's worth noting here that Sora is not really a "physics engine" in this regard, it's using diffusion much like other AI art generating tools.
Maybe it's possible that the tool is "learning" physics with this process as an emergent property (as they claim), but that's really hard to prove! For now, all we can say is that this works via diffusion, similar to other AI tools that generate pixels.
Another thing that's really important to note is this quote from their article:
Training text-to-video generation systems requires a large amount of videos with corresponding text captions.
AI depends on huge quantities of labelled data. For the text-to-video prompt to work, it has to have some text description of videos, and that means having detailed captions for each video that it trains against.
What does this mean...?
First, we live in a world where perception is much stronger than reality (okay, that's always been the nature of our world). It doesn't matter if AI tools are production ready, companies will deploy them, regardless. We see that time and again with the many hilariously horrible examples of AI generated nonsense leaking into the world. However, this is a form of survivor bias, too -- for every person caught with an obviously bad example of AI, there's people skating by with no attention. Hype alone will compel some companies to deploy AI tools, no matter what, so it doesn't really matter how "ready" the tech even is.
Beyond affecting potentially hundreds of industries, the obvious concern is with the proliferation of misinformation.
One solution might be to require all AI-created videos to embed a watermark (invisible or not) to show that it's AI-generated...but there's no law requiring this. Sora's videos are watermarked, for the record, but a watermark is easy to miss (again, information being toothpaste, the damage can be irreversible). You can also crop it out or obscure it. I'm sure there will even be AI tools that automatically purge any watermarks, because of course there will be. Did you even notice a watermark in Sora's videos? Honestly, I didn't until it was pointed out, and it wasn't clear that "this means the video was created by AI".
What happens if there's a scandalous (but fake) video released on the eve of an election? How do you know if that swayed the vote...and what happens if that means the wrong person takes office? The damage has been done and there's no real remedy, and all this because someone was able to manipulate a prompt? What if similar videos sneak their way into trials? Or get plastered onto the Internet to blame someone of a crime? The possibilities are too endless.
Further, it undermines the (already questionable) confidence we have in "real" media. If a video can be fake, then a video you disagree with can be fake, too. Empiricism takes a nasty turn when media is so easy to manipulate that all we can do is trust our own eyes. In that world, truth is very much in the eye of the beholder and it is up to each person to decide what they believe or don't. Any evidence that runs counter to your belief could simply be fake, and technology makes that plausible if not reasonable.
This trend is already very much reinforced by social media algorithms, because social media companies know that having your beliefs affirmed feels good. Having them challenged doesn't...but it usually makes us better human beings. A large portion of society is already trained on the so-called "echo chamber" effect, and AI will only further entrench generations into their silos.
It isn't just about how AI will proliferate misinformation, but how it will shift our perspective on traditional media and undermine our belief in video as a 'factual' medium (understanding it has never been perfect).
Let's go into fantasy land an imagine an even more odd extreme in how AI can be used to create media. Imagine a world where you can create your own shows, shuffling a pre-trained mix of actors and voices into an AI-generated story. As fantastical as that is (and as awful as the result might be), it isn't so absurd a fantasy given a century of development.
This idea is also antithetical to art, churning out things we want to see...but keeping us comfortable and constraining it to fit our own views. The point of art isn't to reinforce the ideas you already have! For that, find a mirror and a quiet room. A world where people can create the experiences they want is a world where media doesn't convey any perspective other than their own, which seems very bleak and boring.
Powered by What...?
Everyone that blogs will have their content ingested by AI, like it or not. That's already true for everyone that's ever touched Reddit...what makes you think that won't be true for YouTube, too? Every streamer that thinks they are posting content they own might be sad to realize that the machine has tokenized their videos, extracting patches of pixels and reconstituting them for Google's benefit. It's unclear if Sora trained on copyrighted material (it very likely did), and the litigation around this is still ongoing.
It's entirely possible that OpenAI will be successful in arguing that everything they can touch is "fair use" because of how tokenization works. They argue that they aren't "using" copyrighted content, the machine is merely "learning" based on copyrighted content....which seems like a silly distinction to me, but I'm no lawyer. Courts don't have a great track record with parsing the intricacies of tech and arriving at a fair judgement, and it's hard to be optimistic about who they'll side with.
To me, this is the icing on the crap cake, the idea that AI make it harder for you to find work (if it doesn't replace you outright), and it will train itself on everything you've ever posted to the Internet...possibly even your face and your voice. It depends on Internet-scale data to work! You can bet AI firms will do everything in their power to argue that they should be able to crawl anything they want.
There's never been a technology that very literally aims to commodify every bit and byte ever posted to the Internet, sucking our creativity into massive algorithmic buckets like some sort of global vampire. This tech might accomplish plenty of good things for humanity, but skepticism about how tech impacts our society isn't some luddite-fueled paranoia.
Conclusion
Sora and other video models are showcasing how powerful generative AI has become. It's worth understanding how Sora and similar models work, transforming noise into videos through a remarkably clever process called diffusion. Does it violate copyright? Is it showing emergent understanding of the physical world? These questions will be answered soon, and these answers will definitely affect society for years to come.