Text to video: a clip from a single description

If animating a photo starts from a ready frame, then text-to-video creates a clip from scratch — you have nothing but a sentence, and out comes a moving scene. It's the most "magical" and at the same time the most temperamental generation mode: here the prompt decides everything.

How it works

You describe the scene in words — the network draws not one frame but a whole sequence, holding a single world between them: the same character, light, camera movement. In essence it's image generation stretched over time, plus an understanding of the physics of motion.

Because of this, text-to-video is more expensive and more temperamental than image-to-video: the model has to invent both the content and its movement at once. So short scenes (5–10 seconds) come out great, while a long coherent plot is still assembled from several clips.

Describe a scene in one sentence — get a clip. A hint for your first time: "a neon street at night in the rain, a slow forward dolly, reflections in the puddles, cinematic atmosphere". Video costs more than pictures: a clip is available after signing up and onboarding — which grant starter tokens.

Here's the clip the network assembled from that very prompt about the neon street — without a single source frame, from text alone. Try your own below.

Загрузка…

What a good video prompt contains

An image prompt describes the frame. A video prompt also describes the movement and time. Keep five layers in mind:

Scene — what and where. "An old lighthouse on a rocky shore".
Movement in the frame — what's happening. "…waves crash on the rocks, gulls circle".
Camera — this is new and important. "…a slow dolly in", "an orbit", "a drone shot", "a static shot".
Light and time — "the setting sun, long shadows".
Style — "cinematic, like a film still", "3D animation", "documentary".

The main difference from a picture is the camera. It's the words about camera movement that turn a "living postcard" into a "film shot". Don't specify the camera and the model decides for itself, often badly.

Common beginner mistakes

Too much action. "A person runs, jumps, turns and waves" in 5 seconds falls apart. One clear movement per clip.
Text and captions in the frame. Still a weak spot of almost all models — letters "drift". Overlay text on the finished clip separately.
Complex hands and crowds. A classic pain; the fewer of them in the frame, the more stable the result.
Expecting a long plot. Think in "shots", not "scenes": assemble the clip from several short generations.

“10 video prompts that work”

Ready templates for ads, social, atmospheric backgrounds and product shots, with a breakdown of which camera and light words give “cinema”.

Гость

Аккаунт

Входит в подписку

What's next

You can create a scene from scratch. A specific but very in-demand case is when there has to be a person in the frame who speaks. That's a separate genre with its own rules.

In the Twelver chat a video prompt is written like an ordinary message — the clip comes back in reply. Starter tokens for video are granted after signing up and onboarding.

Try it yourself

Everything in this guide runs inside Twelver

One chat for text, images, video, music and voice — no separate services or subscriptions.

Open Twelver chat

Related pageVideo Generation

Оцените свой опыт