What a neural network that draws really is
When people say "a neural network drew a picture", it's easy to imagine a tiny artist sitting inside. In reality it's both simpler and more amazing. Once you understand exactly how words turn into an image, you'll stop seeing the result as magic or a lottery — and start controlling it.
The machine doesn't draw — it guesses
A neural network doesn't run a brush across a canvas left to right. It works the other way around: it starts from random noise — fine static, like an old TV — and step by step removes what doesn't belong until a picture emerges from the chaos, one that matches your description.


To be able to do this, the model looked in advance at hundreds of millions of "image + caption" pairs. It didn't memorize the pictures themselves — it captured connections: what a "sunset" usually looks like, how "watercolour" differs from "a photograph", that a cat has four legs, not five. When you write a prompt, the model simply assembles the image that's most consistent with everything it absorbed about those words.
From this come two important consequences that explain almost all the "oddities" of generation:
- No two results are the same. The start is always fresh random noise, so the same description gives a slightly different picture each time. That's not a glitch but the very nature of the method.
- The model is strong where it saw many examples. There were plenty of cats, landscapes and portraits in training — they come out great. But text on a sign or exactly five fingers the model "saw" inconsistently, so it errs there more often.
Why fingers and letters are a weak spot
The famous "six fingers" problem is a direct consequence of the model thinking not in objects but in pixel probabilities. It knows a hand has "roughly this many fingers", but it doesn't count them the way a person does. New models handle this better and better precisely because they're trained on higher-quality, better-labeled data — but understanding the cause is useful: it tells you what not to demand of the tool yet, and where the result needs checking.
Опрос
How often can you tell a neural-network picture from a real photo?
Проголосуйте, чтобы увидеть результаты
The text that turns into a picture
Between your sentence and the noise there's a translator: the model first turns the words into numbers — an internal representation of meaning — and uses those to "steer" the picture as it emerges. That's why phrasing matters so much: for the machine, "a red car" and "a scarlet auto at sunset" are two different sets of numbers and two different paths. This is where the craft begins, and exactly why the separate chapter on prompts is the most important in the guide.
This doesn't replace the artist — it's a new tool
It helps to stop arguing about whether "this is real art" and see generation for what it is: a tool that removes the technical barrier between the idea and the image. Between "I imagined it" and "I see it in front of me" there used to be years of drawing skill or a budget for a designer. Now — one precise sentence. The skill hasn't gone away; it has just shifted: from wielding a brush to phrasing clearly.
See for yourself
The best way to feel how a neural network develops a picture out of noise is to ask it to. Describe anything in one sentence and see what comes out.
Want to test all of this in practice? In the Twelver chat you can generate your first picture right in the conversation — free after signing up.
Try it yourself
Everything in this guide runs inside Twelver
One chat for text, images, video, music and voice — no separate services or subscriptions.
Open Twelver chat