oh yeah, this will be crazy. DALL-E basically does text-to-image; there's a whole area of text-to-video that's working on exactly what you're talking about.
There are some good examples from a recent paper here: https://video-diffusion.github.io/ they generate timelapses of fireworks, rivers, pouring liquids, etc.
There are some good examples from a recent paper here: https://video-diffusion.github.io/ they generate timelapses of fireworks, rivers, pouring liquids, etc.
So it's a very good idea you had! ;-)