AI Image Generation: The Complete Guide

Q: Why does AI get text in images wrong, and which models fix it?

Caption-trained diffusion models render letter-*shapes* without typesetting a known string, so words come out misspelled. The fix is models that treat text as a region containing a literal string — typically autoregressive or layout-trained models. As of mid-2026 the strongest text renderers include the top closed models and the best open-weight design models; test text across many generations because one good sample can hide a poor hit rate.

Q: Q: Why does AI get text in images wrong, and which models fix it?

Caption-trained diffusion models render letter-*shapes* without typesetting a known string, so words come out misspelled. The fix is models that treat text as a region containing a literal string — typically autoregressive or layout-trained models. As of mid-2026 the strongest text renderers include the top closed models and the best open-weight design models; test text across many generations because one good sample can hide a poor hit rate.

Image models went from "haha, seven-fingered hand" to "this is a usable production asset" in about three years. The model names churn every few months — a new leader every quarter, a new open-weight champion every other one — but the concepts underneath barely move. This guide is the concepts. Learn these and you can pick up any new model in an afternoon, because you'll know what questions to ask of it.

We'll go from how the models actually work, through how to prompt and edit them well, to how to choose one and ship it. The one section that dates — the current model rankings — is clearly marked as a snapshot you refresh; everything else is built to last.

Key takeaways
The four things you can ask an image model to do
How image models actually work
The two hard problems: "what" vs "where"
Layout and structural control
How to write image prompts
Editing images with AI
Why text rendering is hard — and why it got better
Resolution, aspect ratio, and upscaling
The model landscape (a dated snapshot)
How to choose a model for your job
How image models are evaluated
Cost, latency, and throughput
Licensing, provenance, and safety
Common failure modes and fixes
Where this is heading
FAQ

Key takeaways

Two model families dominate. Diffusion models start from noise and denoise toward an image; autoregressive models predict an image as a sequence of tokens, like an LLM. Most top models are one or a hybrid of these. You rarely need to care which — except it explains why some models are better at text and layout.
A prompt is a lossy spec. The model only knows what you wrote. The biggest quality lever is being specific about subject, style, composition, and lighting — and, when it matters, where things go.
"What" is easy; "where" is hard. Caption-only models are weak at spatial layout, counting, and binding the right attribute to the right object ("a red cube next to a blue sphere"). The fix is structural control — layout boxes, reference images, or ControlNet — not a cleverer sentence.
Editing is now first-class. Inpainting, outpainting, instruction edits ("make the jacket red"), and local region edits mean you iterate on an image instead of re-rolling the whole prompt. This is often more valuable than raw quality.
Text rendering finally works on the best models, because they learned to treat text as a region with a known string, not a texture to hallucinate. This is what made image models usable for posters, ads, and UI mockups.
Open weights are competitive, not winning. The top closed models lead on one-shot quality; the best open-weight models trail by a modest margin but win on control, cost, and fine-tunability. Pick on the axis you actually care about.
Rankings date in weeks; concepts don't. Treat any leaderboard as a snapshot. Decide what you're optimizing — beauty, control, text, cost, license — and choose accordingly.

The four things you can ask an image model to do

Almost every feature is a variant of four operations:

Text-to-image (t2i). A prompt in, a new image out. The headline use case.
Image-to-image (i2i). An input image plus a prompt; the model produces a new image guided by both. Style transfer, "make this photo a watercolor," variations on a layout.
Inpainting / outpainting. Regenerate part of an image (inpaint a masked region) or extend it beyond its borders (outpaint). The rest stays fixed.
Instruction editing. "Remove the person on the left," "change the sky to sunset," "make the text say SALE." The model edits an existing image from a natural-language instruction, ideally touching only what you asked.

Understanding which operation you need clarifies everything downstream — which model, which API parameters, which prompt style. "Generate a logo" is t2i; "fix the typo in this logo" is editing, and a model great at the first can be mediocre at the second.

How image models actually work

You don't need the math to use these well, but the mental model pays off constantly.

Diffusion: sculpting an image out of noise

A diffusion model is trained to remove noise. During training you take a real image, add a known amount of random noise, and teach the model to predict that noise so it can be subtracted. Do this across all noise levels and the model learns to walk from pure static back to a clean image.

At generation time you start from pure noise and run the model for a number of steps (typically 20–50), each step removing a little estimated noise, nudged at every step toward your prompt. The image "develops" like a Polaroid. Key knobs:

Steps — more steps, more refinement, more time. Diminishing returns past ~30 for most models.
Guidance scale (CFG) — how hard to push toward the prompt vs. letting the model be free. Too low: ignores your prompt. Too high: oversaturated, fried-looking images. There's a sweet spot per model.
Seed — the initial noise. Same seed + same prompt + same settings = same image. This is your reproducibility handle.

Most modern systems are latent diffusion: they don't denoise full-resolution pixels (expensive), they denoise a compressed latent representation, then a VAE decoder expands it to pixels. That's why these models can do megapixel images affordably. Newer variants use rectified flow / flow matching, a cleaner formulation of the same denoise-toward-data idea that needs fewer steps — but the user-facing mental model is identical.

Autoregressive: an image as a sequence of tokens

The other family treats an image like text. An image is tokenized into a grid of discrete tokens (via a learned tokenizer), and a transformer predicts those tokens one after another, conditioned on your prompt — exactly how an LLM predicts words. Because it's the same next-token machinery LLMs are built on, this family tends to be strong at structure: spelling text correctly, honoring counts, placing things deliberately. Many of the best 2025–2026 models are autoregressive or hybrids that get the structural strengths of token prediction and the texture quality of diffusion.

Text conditioning: how the model "reads" your prompt

Either family needs to turn your words into something it can steer with. A text encoder (a CLIP-style or T5-style model) converts your prompt into embeddings the image model attends to. This matters in practice:

Models with stronger/larger text encoders follow complex prompts and long instructions better.
The encoder's training is why attribute binding fails: a single pooled embedding of "a red cube and a blue sphere" carries the concepts but only weak signal about which color attaches to which shape.
It's also why prompt rewriting helps — some products quietly expand your short prompt into a richer one before generation, because the encoder responds well to detail.

The two hard problems: "what" vs "where"

Image models are excellent at what (a corgi, a cyberpunk street, watercolor style) and historically bad at where and how many. The chronic failures all live in the second bucket:

Attribute binding. "A red cube to the left of a blue sphere" → you get a blue cube and a red sphere, or both purple. The model has the concepts but binds them loosely.
Counting. "Exactly five coffee cups" → you get four, or six. Counts have to emerge from a caption rather than being specified as structure.
Spatial relations. "Left of," "behind," "in the top corner" are honored as statistical tendencies, not constraints — right maybe two-thirds of the time on caption-only models.
Text rendering. "The word SALE" comes out "SAEL" when the model paints letter-shaped textures instead of typesetting a known string.

The important insight: none of these are quality problems you fix with more steps or more parameters. They're specification problems. A caption is a low-bandwidth, order-free description, and you're asking the model to reconstruct a precise 2D arrangement from it. The fix is to give the model more structure — which is the next section.

Layout and structural control

"Structural control" is the umbrella for every technique that constrains where things go, not just what appears. From least to most explicit:

Regional / layout conditioning. Instead of one caption for the whole image, you provide regions — a bounding box plus a description of what goes in each. "This box: a red cube. This box: the word SALE in yellow." The best 2026 models were trained this way (bounding boxes tied to region descriptions), so honoring layout is native behavior, not a hack. This is what fixes binding, counting, and text in one move: each attribute is co-located with its region, each count is a number of boxes, each text string lives in its own box. It also makes images editable — every element has an address you can move or rewrite.
Reference images / IP-adapter. You supply an image as a style or identity reference. "Generate new scenes with this character" or "match this brand palette." The model conditions on the reference embedding alongside the prompt.
ControlNet and structural maps. You supply a control signal — an edge map, depth map, human pose skeleton, or segmentation mask — and the model generates an image that conforms to that geometry while you describe the content. This is the workhorse for "I need this exact composition but a different look."
Inpainting masks. The most direct spatial control: you literally paint the region to regenerate.

The throughline: when one-shot prompting won't give you the arrangement you need, don't fight it with adjectives — add structure. Which technique depends on what you have: a layout in your head (regions), a reference look (IP-adapter), an exact geometry (ControlNet), or a specific area to fix (inpaint).

How to write image prompts

Good image prompts are specific and structured. The habits below survive model upgrades because they're about giving the model information it can't infer, not about magic words.

Cover the dimensions that matter. A strong prompt usually specifies several of:

Subject — what is in the image, concretely. "A border collie" beats "a dog."
Style / medium — photo, oil painting, 3D render, line art, specific aesthetic.
Composition / framing — close-up, wide shot, rule-of-thirds, centered, flat lay.
Lighting — golden hour, soft studio light, dramatic rim lighting, overcast.
Camera / lens (for photoreal) — "85mm portrait, shallow depth of field."
Color / mood — palette, warm/cool, high-key vs moody.

Then the habits that move the dial:

Separate "what" from "where." Don't write one run-on sentence. Name the regions: foreground subject, midground, background, and where each sits. Even on a model that only takes text, this gives the layout machinery cleaner structure to work with.
Spell out text literally and place it. Headline "SUMMER SALE" across the top; subtext "up to 50% off" centered below. Quote the exact string, give it a position. The single biggest win for any design work.
State counts as structure. "Three product shots in a row, evenly spaced" beats "some products." A count is a layout instruction — phrase it like one.
Use negative prompts when supported. Many models accept a "do not include" field — "no text, no watermark, no extra fingers." Cheap insurance against known failure modes.
Don't over-incant. "Masterpiece, 8k, ultra-detailed, trending on artstation" was marginal even in 2023 and mostly noise now. Spend your words on subject, composition, and light instead.
Iterate and edit, don't re-roll. When something's 90% right, fix the one wrong region (see editing, below) rather than regenerating from scratch and losing the parts that worked.

If you've read our general prompting guide, this is the same principle — show structure, don't describe vibes — applied to pixels.

Editing images with AI

Generation gets the headlines; editing is where real work happens, because production assets are never right on the first try. The modern toolkit:

Inpainting. Mask a region, describe what should be there, regenerate only that region. Remove an object, fix a hand, swap a product.
Outpainting. Extend an image beyond its frame — turn a square into a banner, reveal more scene. The model invents consistent surroundings.
Instruction editing. "Make the jacket leather," "change the season to winter." No mask — the model parses the instruction and applies a local change while preserving the rest. The best models keep character and scene consistency so the unedited parts come back unchanged.
Region/layout editing. On layout-native models, every element has an address: drag its box to move it, rewrite its text string, swap its description, and only that element regenerates.

Why this matters more than a few quality points: caption-only editing means re-rolling the whole prompt and praying the unchanged parts return (they don't). Addressable, local, repeatable edits change what you can build — iterative design tools, "change the headline daily" ad pipelines, consistent character series. When you evaluate a model, test its editing, not just its first-shot beauty.

Why text rendering is hard — and why it got better

For years, legible text was the tell of AI images. The reason is structural: a diffusion model trained on captions renders "the vibe of letters," so it produces plausible letter-shapes that spell nonsense. Getting a five-letter word right is asking a texture generator to also be a typesetter.

Two things fixed it:

Autoregressive / token-based generation, which is naturally good at sequences — and a word is a sequence.
Layout-aware training, where text is a region whose description is a literal string. The model isn't guessing letter shapes from a mood; it's placing a known string into a known box.

This is why the best current models are suddenly good enough for posters, packaging, ads, slides, and app mockups — the commercial work where one misspelled word ruins the asset. If text matters to your use case, it should be your primary evaluation criterion, and you should test it across many generations (one good sample can hide a bad hit rate).

Resolution, aspect ratio, and upscaling

Native resolution. Each model has resolutions it generates best at (commonly around 1–2 megapixels; the strongest now do native 4K). Pushing far beyond native causes repetition artifacts ("two heads," tiled patterns).
Aspect ratio. Specify it explicitly (1:1, 16:9, 9:16, 4:5 for social). Models behave differently per ratio; portrait vs landscape can change composition quality.
Upscaling. To go bigger than native, generate at native then upscale with a dedicated model (it adds plausible detail, not just pixels). This is usually better than asking the base model for a huge image directly.
Tiling / high-res fix. Some pipelines generate a base image, then regenerate it in overlapping tiles at higher resolution to add detail. Great for print; slower.

Rule of thumb: generate at the model's comfort zone, then upscale. Don't ask for 8K up front.

The model landscape (a dated snapshot)

This is the part that dates. Treat it as a snapshot as of June 2026 and refresh it against a live leaderboard before relying on specifics. The categories and trade-offs below outlast any single ranking.

The field splits into closed/proprietary (best one-shot quality, API-only) and open-weight (self-hostable, fine-tunable, competitive but trailing slightly on raw quality).

A note on the consumer market — quality and adoption are diverging. Per a16z's Top 100 Gen AI Consumer Apps (March 2026), standalone image apps are losing ground to bundling: Midjourney slipped from a top-10 consumer product to #46 as ChatGPT and Gemini folded strong image generation directly into their general chat apps. The takeaway for builders: for most people, "good-enough image generation inside the app they already use" beats a separate best-in-class tool. If you ship images, assume you're competing with a free in-chat option, not just with other image models — which raises the bar for why a dedicated tool should exist (control, editing, fine-tuning, licensing — the axes below).

Representative text-to-image arena standing, mid-2026:

Tier	Examples	Notes
Frontier closed	GPT Image 2 (~1385 ELO), Gemini 3.x Image, Grok Imagine	Best one-shot quality and instruction following
Strong closed	Reve 2.0 (~1273, 4K + layout), MAI-Image-2.5, Seedream, Recraft	Specialists — 4K, layout, design/text
Best open-weight	Ideogram 4.0 (~1204, #1 open), FLUX.2 family, Qwen-Image, Hunyuan Image	Self-host + fine-tune; great for text and design
Legacy / lightweight	SDXL, SD 3.5, DALL·E 3	Older, cheaper, huge ecosystem of tools

The durable reads, independent of exact numbers:

Closed leaders win one-shot beauty. If "make one stunning image" is the job, that's where to look.
Open-weight is the call for control, cost, and customization. When you need to fine-tune on your style, run at volume, or keep data in-house, an open model is the obvious base — and the best open models are excellent at text and layout.
Specialists beat generalists for specific jobs. Design-and-text work, 4K, or precise layout often favors a specialist over the top generalist.

How to choose a model for your job

Decide what you're optimizing before you look at a leaderboard:

"Make it pretty" (hero art, illustration, mood): a frontier closed model. One-shot quality is the metric; pay for the API.
Design with text (posters, ads, packaging, UI): a layout/text specialist or top open model. Reliable, correctly-placed, legible text beats a few quality points.
Editing-heavy workflow (users iterate on assets): prioritize editing/inpainting quality and consistency, not first-shot scores. This is a different capability, not a marginal upgrade.
Volume / cost-sensitive / private data: open weights you host. Budget the GPUs, gain control and unit economics.
Need exact composition or your own style: open weights + ControlNet / fine-tuning. The control stack only fully exists on open models.

Don't pick on a single leaderboard column. Arena ELO answers "which one image looks better," and that's only one of these jobs.

How image models are evaluated

Human preference arenas (ELO). Show two images for the same prompt, ask people which is better, compute an ELO. This is the most trusted signal — but it measures one-shot aesthetic preference, not editability, text reliability across many tries, or adherence under tight constraints. A model can rank mid-pack and still be the best choice for your job.
Automated metrics. FID (how close generated images are to a real distribution — lower is better), CLIPScore (how well the image matches the prompt). Useful for tracking your own pipeline; weakly correlated with human taste, so don't over-trust them.
Task-specific evals. Text-rendering accuracy, counting accuracy, prompt-adherence rubrics. If you have a specific need, build a small eval for it — 20 prompts you care about beat any public leaderboard.

The meta-lesson: the public number measures a narrower question than "which should I use." Run your own 20-prompt bake-off on your actual use case.

Cost, latency, and throughput

Closed APIs charge per image, typically a few cents up to ~$0.20+ for high-res/high-quality tiers. Simple, no infra, scales instantly, costs add up at volume.
Self-hosting open weights trades per-image cost for GPU cost. Economical at volume, and the only path if data must stay in-house — but you own the ops.
Latency scales with steps × resolution. Fewer-step (distilled / flow-matching) models and lower resolutions are faster; reserve max steps and 4K for finals.
Throughput on your own hardware comes from batching and the same serving tricks as LLMs. For interactive UX, a fast "draft" model plus an on-demand "quality" pass is a common pattern.

Licensing, provenance, and safety

Output licensing. Check each model's terms — commercial use, ownership, and training-data provenance vary. "Open weights" governs the model, not necessarily unrestricted commercial use of outputs. Read the license.
Provenance and watermarking. Expect generated images to carry C2PA content credentials and/or invisible watermarks (e.g. SynthID-style). Increasingly required for platforms and some jurisdictions. If you publish at scale, plan for it.
Safety filters. Hosted models refuse certain content (real people, explicit material, violence, IP). Open models you run yourself shift that responsibility — and liability — to you.
Deepfakes and likeness. Generating real people's likenesses raises legal and ethical issues that differ by jurisdiction and are tightening. Don't build on shaky ground.

Common failure modes and fixes

Symptom	Likely cause	Fix
Wrong color on wrong object	Weak attribute binding	Use regional/layout prompting; co-locate attribute with object
Wrong number of objects	Counting from a caption	State count as structure; use layout boxes
Garbled text	Caption-only text rendering	Use a text/layout-strong model; quote the exact string and place it
Oversaturated / "fried" look	Guidance scale too high	Lower CFG
Ignores the prompt	Guidance too low, or weak text encoder	Raise CFG; add detail; try a stronger model
Duplicated subjects / "two heads"	Resolution beyond native	Generate at native, then upscale
Edit changes the whole image	Re-rolling instead of local edit	Inpaint the region, or use an instruction-edit model
Inconsistent character across images	No identity conditioning	Use a reference image / IP-adapter or character-consistency feature
Extra fingers / mangled hands	Classic anatomy weakness	Newer model; inpaint the hands; negative prompt

Where this is heading

Three durable trajectories, independent of which lab leads this quarter:

Image generation is following text's path from "one blob in, one blob out" to a structured intermediate representation (layouts) you can inspect and edit. Control and editability, not just fidelity, are the frontier.
The closed/open gap on quality is narrowing, while open weights keep their structural advantages (fine-tuning, ControlNet, on-prem). Expect the "best for my job" answer to land on open models more often.
Generation, editing, and understanding are merging into unified multimodal models that see and draw in the same system — so the same model that reads an image can edit it from conversation. The four operations in this guide collapse into one chat.

Learn the concepts here and the next model release is just new numbers in the snapshot — not a new thing to learn.

FAQ

Q: What's the difference between diffusion and autoregressive image models? Diffusion models start from random noise and denoise toward an image over many steps; autoregressive models predict the image as a sequence of tokens, like an LLM predicts words. Diffusion has historically been strongest on texture and photorealism; autoregressive (and hybrid) models tend to be better at structure — spelling text, honoring counts, placing things deliberately. Most top models are one of these or a hybrid, and as a user you mostly notice the difference in text and layout quality.

Q: Why does AI get text in images wrong, and which models fix it? Caption-trained diffusion models render letter-shapes without typesetting a known string, so words come out misspelled. The fix is models that treat text as a region containing a literal string — typically autoregressive or layout-trained models. As of mid-2026 the strongest text renderers include the top closed models and the best open-weight design models; test text across many generations because one good sample can hide a poor hit rate.

Q: Are open-weight image models good enough for real work? Yes. The best open models trail the top closed models by a modest margin on one-shot quality but match or beat them on control, text, fine-tunability, and cost. If you need to self-host, customize on your own style, run at volume, or keep data private, open weights are the right base. If you only need the single most beautiful one-shot image, a frontier closed model still has the edge.

Q: How do I get a specific layout instead of whatever the model decides? Add structure rather than more adjectives. Use regional/layout prompting (a description per bounding box), a ControlNet structural map (edges, depth, pose) for an exact composition, a reference image for style or identity, or inpainting to fix a specific area. Caption-only models are weak at spatial control by nature; structural control is the fix.

Q: What's the best AI image model right now? It depends on the job, and the ranking changes every few months — so treat any specific name as a snapshot. For one-shot beauty, a frontier closed model leads. For design with text, a layout/text specialist or top open model. For heavy editing, prioritize inpainting and consistency over leaderboard scores. The durable advice: define what you're optimizing (beauty, control, text, cost, license), then run a 20-prompt bake-off on your own use case.

Q: Why do my images look oversaturated or "fried"? Almost always the guidance scale (CFG) is too high — you're pushing the model too hard toward the prompt. Lower it. The opposite problem, an image that ignores your prompt, usually means CFG is too low or the prompt lacks detail.

Q: Can I edit one part of an image without regenerating the whole thing? Yes — that's inpainting (mask a region and regenerate just it) or instruction editing ("make the jacket red") on models that preserve the rest. On layout-native models you can move or rewrite individual elements directly. This is far better than re-rolling the whole prompt, which rarely brings back the parts you liked.

Q: Do I own the images an AI model generates? It varies by model and jurisdiction — read the specific license. "Open weights" refers to the model, not a blanket grant to use outputs commercially. Also expect generated images to carry provenance metadata (C2PA) or invisible watermarks, and note that generating real people's likenesses carries legal and ethical risk that's tightening over time.

Table of contents