How AI Chatbots Actually Work — Explained Without the Math
A plain-English guide to what's actually happening when you chat with ChatGPT, Claude, Gemini, or Copilot. What's a token, how does it 'know' things, why does it make stuff up, why does it cut off, and what it can and can't do — no math, no buzzwords.
You type a question into ChatGPT. A few seconds later, words appear. Sometimes the answer is genuinely useful, sometimes it's confidently wrong, and sometimes it just stops in the middle of a sentence. Most explanations of how this works either start with linear algebra or with marketing slides. This guide does neither. It explains what's actually going on, in language your sister-in-law would understand at Thanksgiving.
The short version. A chatbot is a very fancy auto-complete. It learned, by reading most of the public internet, which word tends to come after which other words. When you ask it something, it predicts the most plausible next word over and over, one word at a time, until it decides it's done. That's it. It is not thinking, not remembering you from last week, and not looking things up while it talks (unless you give it a tool that does). Everything that feels intelligent about it comes out of that one trick, scaled up to a size that's hard to picture.
The rest of this guide unpacks what that means in practice — why it gets things right, why it gets things wrong, why it cuts off mid-sentence, what "training" actually was, what it does and doesn't know, and the handful of habits that make the difference between a useful tool and a frustrating one. If you want the head-to-head version — ChatGPT vs Claude vs Gemini vs Copilot — see which AI should I use. If you want to know why these things make stuff up specifically, see AI hallucinations. If you want to know how your messages are handled, see AI chatbot privacy.
Table of contents
- Key takeaways
- Mental model: chatbots in one minute
- What a chatbot actually is
- Tokens: the way AI sees words
- How does it "know" things?
- Why does it make stuff up?
- Why does it cut off mid-sentence?
- Does it remember me?
- What it can do well
- What it doesn't do well
- How to get better answers out of it
- The full conversation lifecycle: keystroke to answer
- Tokenization in plain English: BPE
- Embeddings: meaning as coordinates
- Inside one transformer block
- The three stages of training: pretraining, SFT, RLHF
- Tool use: how chatbots ‘do' things
- System prompts: the hidden instructions
- Temperature and top-p: the randomness knobs
- Reasoning models: thinking out loud
- Agents: chatbots that take actions
- Multimodal: vision, audio, voice mode
- Custom GPTs, Projects, and personalisation
- Fine-tuning vs RAG: two ways to specialise
- Why responses vary, why refusals happen, why apologies pile up
- The four products in 2026, by architectural choice
- What's coming in 2026–2027
- Why coding works so well for chatbots
- Why long outputs degrade
- Why context windows matter, and what 200K to 2M tokens means
- Why chatbots apologise too much (and other RLHF artefacts)
- Voice mode: speech-to-speech architectures
- The questions every user should ask their chatbot vendor
- Costs, latencies, and where they come from
- Side-by-side concept reference
- The bottom line
- FAQ
- Extended FAQ
- Glossary
Key takeaways
- A chatbot is auto-complete, scaled up. It predicts the next word, then the next, until it decides to stop.
- It does not search the internet while it's talking unless you turn on a search tool. The "knowledge" comes from what it read during training, months or a year ago.
- It does not remember you between conversations by default. Most products now have "memory" features, but they store only short summaries, not the whole conversation.
- It makes things up because making things up looks the same as being right, from its perspective. It can't tell the difference.
- It cuts off because it has a response-length limit, and longer answers cost more to produce.
- It can be very good at: explaining, summarising, brainstorming, rewriting, translating, simple coding.
- It is bad at: precise facts, recent events, math past basic algebra, anything requiring real-world verification.
- The single biggest skill is showing it examples of what you want, instead of describing what you want.
Mental model: chatbots in one minute
Name the problem first: a chatbot is the next-token machine. It predicts the next word, that's it. Every behaviour you've noticed — the fluent answers, the confident wrong ones, the abrupt cutoffs, the apparent memory loss — is a downstream consequence of that single mechanism, scaled to hundreds of billions of parameters.
The cleanest analogy is autocomplete on steroids. Your phone suggests one word at a time based on what usually follows. A chatbot does the same thing, except it has read most of the public internet, can keep predicting for thousands of words, and has been polished with extra training so the predictions feel like a helpful assistant rather than a generic continuation. There is no fact lookup, no reasoning engine, no librarian. There is one mathematical operation — "given everything so far, what's the most plausible next token?" — repeated until a stop signal fires.
Side-by-side with mental models people often have:
| What people think it is | What it actually is |
|---|---|
| A search engine | A next-token predictor with no live lookup unless a tool is attached |
| A database of facts | Statistical patterns from training text |
| A reasoning system | Pattern-matching that resembles reasoning at scale |
| A persistent assistant | Stateless model + a notes file the product attaches |
| A truth-checker | Has no internal "I know vs I'm guessing" flag |
The production one-liner that everything reduces to:
while not done:
next_token = model.predict(context)
context += next_token
Sticky number to remember: in 2026, the top flagships — GPT-5, Claude Sonnet 4.6, Gemini 2.5 Pro — score within roughly ±3% of each other on most public benchmarks. The choice between them is less about raw smarts and more about personality, integrations, and price.
What a chatbot actually is
A modern AI chatbot is a very large mathematical model that takes some text as input and produces text as output. That's it. There's no little person inside it. There's no database it's looking things up in. It is a calculator for words.
Imagine a phone's predictive keyboard. You type "I'll see you", and it suggests "tomorrow," "soon," "later." It learned those suggestions by reading millions of text messages. It picked the next word based on which one tends to follow the words you typed.
A chatbot is the same idea, except:
- It learned from roughly all of the public text on the internet, plus a lot of books.
- It can keep up the prediction for thousands of words, not just one.
- It was given extra training to be helpful, to refuse harmful things, and to follow instructions.
That last bit — the extra training — is what turns "very fancy auto-complete" into "thing you can have a conversation with." The base model would just keep typing whatever sounded plausible. The trained chatbot version has been shaped to respond to your prompts in a useful, on-topic way.
You can think of the whole stack as three layers:
- The brain. A huge mathematical model that has read most of the internet and learned what tends to follow what.
- The training to be useful. Humans (and other AI) showed it thousands of examples of "good answer" vs "bad answer" until it learned to behave like a helpful assistant.
- The product. The website or app you use, which puts a chat interface around the model, manages memory, optionally connects it to search or other tools.
When ChatGPT or Claude or Gemini gets better between versions, usually all three are getting better at once.
How big is the brain, exactly?
The big chatbots in 2026 are trained on roughly 10–30 trillion words of text. For comparison, a person reading 200 words per minute for eight hours a day, every day, would take about 350 years to read 10 trillion words. The model sees more text in training than any single human will see in a lifetime, by several orders of magnitude.
What it stores from all that reading is harder to picture. The model is a network of billions of numbers — Claude Opus 4.x, GPT-5, and Gemini 2.5 are all in the hundreds-of-billions-of-parameters range, with the largest research models pushing past a trillion. Those numbers, taken together, encode statistical regularities about which words follow which other words in which contexts. There is no human-readable database inside. If you tried to open the model file with a text editor, you'd see ten gigabytes of seemingly random numbers.
The three flavors of model under the hood
Almost every chatbot in 2026 actually uses several different models, and the product picks which one based on the question:
- A fast model (GPT-4o mini, Claude Haiku 4.5, Gemini Flash) handles simple questions in a fraction of a second. Cheap to run.
- A flagship model (GPT-5, Claude Sonnet 4.6 / Opus, Gemini 2.5 Pro) handles harder questions. Slower, more expensive.
- A reasoning model (o3, o4, Claude with extended thinking, Gemini Deep Think) handles hard problems that require step-by-step thinking — math, complex code, multi-step research. Much slower (seconds to minutes), much more expensive.
This is why the same chatbot can feel snappy on a casual question and take 30 seconds on a hard one. The product is choosing which model to use under the hood. Most products let you force a specific model if you want.
Tokens: the way AI sees words
When you read this sentence, you see words. When the chatbot reads it, it sees something different: tokens. A token is a chunk of text — usually a word, sometimes a piece of a word, sometimes a single character.
The word "elephant" is one token. The word "anti-establishmentarianism" might be three or four tokens. A space is usually part of the token that follows it. Numbers and punctuation are their own tokens.
Why does this matter to you? Three reasons.
1. The price of using AI is measured in tokens. When you pay for the API or use a Pro plan, what's actually being counted is tokens in (your prompt) and tokens out (the answer). Roughly: one English word ≈ 1.3 tokens. So 1,000 tokens ≈ 750 words.
2. The "context window" is measured in tokens. When you hear "Claude has a 200,000-token context window," that means it can hold about 150,000 words of conversation and documents at once. Once you exceed that, older messages start falling out the back. Modern chatbots are generous — most products handle a few books' worth of text — but very long documents can still overflow.
3. The model is bad at things tokens are bad at. Tokens are usually whole words. So when you ask "how many r's are in strawberry," the model is looking at one token for "straw" and one for "berry" and one for the letter "r." It can't easily count letters because it doesn't see letters; it sees chunks. This is why "count the letters in this word" is famously a thing chatbots get wrong even though they can write you a sonnet.
The token concept also explains a frustration: ask it to "respond in exactly 100 words" and you often get 87 or 112. It doesn't count words; it predicts tokens until it feels done.
Real token prices in 2026
Most consumers never see token prices directly — you pay $20/month and get a usage allowance. But the numbers underneath shape every other decision the product makes. As of mid-2026:
| Model | Input ($ per 1M tokens) | Output ($ per 1M tokens) |
|---|---|---|
| GPT-5 | $5 | $15 |
| GPT-4o | $2.50 | $10 |
| GPT-4o mini | $0.15 | $0.60 |
| o3 reasoning | $20 | $80 |
| Claude Opus 4.x | $15 | $75 |
| Claude Sonnet 4.6 | $3 | $15 |
| Claude Haiku 4.5 | $0.80 | $4 |
| Gemini 2.5 Pro | $2 | $10 |
| Gemini 2.5 Flash | $0.10 | $0.40 |
A typical chatbot reply (300 input tokens, 500 output tokens) on Sonnet 4.6 costs about $0.008 — less than a cent. That's why the free tiers can give you so many messages a day before they start rate-limiting; the underlying compute is genuinely cheap for short answers. It gets expensive on long documents, long-running conversations, and reasoning-heavy queries.
Context windows: what each chatbot can hold
The context window is the maximum amount of text — your messages, the chatbot's replies, plus any documents you've attached — the model can consider at once. In 2026:
| Chatbot | Context window | Roughly how many pages of text |
|---|---|---|
| ChatGPT (Plus) | 128k tokens | ~400 pages |
| ChatGPT (Pro / Enterprise) | 1M tokens | ~3000 pages |
| Claude Pro | 200k tokens | ~600 pages |
| Claude Enterprise | 500k tokens (some plans 1M) | ~1500–3000 pages |
| Gemini (free) | 1M tokens | ~3000 pages |
| Gemini Advanced | 2M tokens | ~6000 pages |
| Copilot in M365 | varies by app | typically generous within a doc |
The advertised window isn't the same as the useful window. Models reliably handle the beginning and end of long inputs; they lose track in the middle of very long contexts. This is sometimes called "lost in the middle" and it's why people report worse answers when they paste in 80 pages versus 5.
How does it "know" things?
It doesn't, really. Not in the way you do.
What happened is: someone took a huge mathematical model and fed it billions of pages of text — Wikipedia, news articles, books, forum posts, scientific papers, code, the works. For each tiny chunk of that text, the model practiced predicting the next word. Over and over and over, for months on thousands of computers.
After enough practice, something interesting happens. To predict the next word well in a sentence about, say, the Roman Empire, the model had to absorb a lot about the Roman Empire. To predict the next line of code, it had to absorb how code works. To predict the next reply in a Q&A forum, it had to absorb common factual patterns.
So when you ask "who built the Pyramids of Giza?", the model has read so many texts about ancient Egypt that "the Egyptians" or "the ancient Egyptians under Khufu, Khafre, and Menkaure" is overwhelmingly the most plausible next-word pattern. It's not looking up a fact. It's predicting what comes next based on what it's seen.
This works astonishingly well for things that come up often in writing. It works less well for:
- Specific, niche, or recent things. The model only knows what was in its training data. Anything that happened after its "knowledge cutoff" — usually a few months to a year before the version was released — it just doesn't know about.
- Precise facts where being slightly wrong is wrong. Phone numbers, addresses, exact dates, exact prices, citations. The model can produce something that looks like a phone number, but it's making it up.
- Anything that requires looking at the actual current state of the world. What time is it? What's the weather? What's in the news today? Without a tool plugged in, it has no idea.
Some products solve the last one by connecting the model to search. ChatGPT with search, Claude with the web search tool, Perplexity, Google's Gemini in Google products — these can actually look things up while they answer. When they do, the answer is grounded in real, current sources. Without search turned on, you're getting auto-complete from training data.
Knowledge cutoffs in 2026
Each model has a "knowledge cutoff" — the date the training data was frozen. The model knows things up to that date and almost nothing after. As of mid-2026:
| Model | Approximate knowledge cutoff |
|---|---|
| GPT-5 | October 2024 |
| GPT-4o | April 2024 |
| Claude Opus 4.x / Sonnet 4.6 | January 2026 |
| Gemini 2.5 | December 2024 |
| Gemini 3 (where available) | Early 2026 |
| Copilot (uses OpenAI models) | inherits OpenAI cutoff |
Even after the cutoff, the model often thinks it knows things from later dates — it picked up press releases, blog posts, and Wikipedia edits about future events that turned out to be inaccurate. Always check current information against a source with web search if it matters.
Why does it make stuff up?
This is called hallucination, and it's the single most important thing to understand about chatbots.
Here's what's happening. The model is predicting the most plausible next word. When you ask "what's the capital of France," it predicts "Paris" because in the trillions of words it read, "the capital of France is Paris" appeared a lot. The answer is correct and the most plausible — those happen to be the same thing.
Now ask it: "what's the dosage of medication X for a 70-pound dog?" If it read enough veterinary text, it might give the right answer. If it didn't, it will still answer — confidently — with something that sounds like a dosage. It cannot tell the difference between "I read this and remember it" and "this is plausible-sounding text I just generated." From inside the model, both look identical.
This is why chatbots can confidently invent:
- Books that don't exist (with authors and ISBNs)
- Quotes from real people that they never said
- Legal cases (a US lawyer got sanctioned in 2023 for citing six made-up cases ChatGPT gave him)
- Scientific papers (with believable titles, authors, and journal names)
- API features that the actual software doesn't have
Some hallucinations are easy to spot ("the Eiffel Tower is 12 miles tall"). Most are not. The model writes them with the same confident tone as it writes true things.
Practical defenses:
- For anything important, verify the answer somewhere else. Especially for facts, citations, numbers, dates, code that interacts with real systems.
- Be more suspicious of specific claims than general ones. "Vitamin C is good for you" is probably fine. "Vitamin C cures condition X at dose Y per kilogram" is suspect.
- Use a model with search turned on for anything recent or factual. The model can still hallucinate, but a grounded answer with sources is much more likely to be right than a free-form one.
- Ask it to show its work. "Walk me through how you'd verify this" sometimes catches its own mistakes.
Hallucination is not a bug that's about to be fixed. It is intrinsic to how these models work. The newer models hallucinate less than the older ones, and the gap is closing on common topics. But the underlying mechanism — predict the most plausible next word — fundamentally can't tell truth from confident-sounding fiction. The detailed mechanics — why it happens at all, why it gets worse on niche topics, what reduces it and what doesn't — are in the AI hallucinations guide.
Which chatbots hallucinate least in 2026
There's no clean leaderboard, but published evals (HaluEval, TruthfulQA, FActScore) and consistent user reports through 2026 line up roughly like this:
| Chatbot | Hallucination tendency | When it's worst |
|---|---|---|
| Claude Opus 4.x / Sonnet 4.6 | Lowest among the big four | Niche scientific or legal claims |
| GPT-5 | Low to medium | Very recent events without search |
| Gemini 2.5 Pro | Low when search is on, medium when off | Citation-style queries |
| Copilot (in M365) | Low when grounded in your docs | Anything outside your tenant's content |
For anything important, treat all of them as confident but unreliable, and verify. Reasoning models (o3, Claude with extended thinking, Gemini Deep Think) hallucinate less on math and structured problems and more on broad factual claims — the longer the reasoning chain, the more chances to wander off.
Why does it cut off mid-sentence?
A few reasons, depending on where the cutoff happens.
The response-length limit. Every chatbot has a maximum response length per turn — usually thousands of words for paid plans, less for free ones. If you ask for "the complete history of the Roman Empire in detail," it will get to a stopping point and stop, even if the story isn't done. The fix is to ask for it in parts, or to say "continue" when it stops.
It got confused. Sometimes the model loses track in the middle of generating, especially on long answers, complex code, or when you've been chatting for a very long time. Starting a new conversation often fixes it.
Internet timeout. The chatbot is running on a server somewhere. If your connection blips or the server's busy, the stream of text can be interrupted. Try refreshing or sending the message again.
Safety filter. If the model thinks it's about to say something it shouldn't, some products will cut off the answer rather than finish it. Usually you'll see a notice. Sometimes it's silent.
You hit the conversation limit. Especially in free tiers, products cap how many turns or how many tokens you can use in a window. Once you hit the cap, replies stop or shrink.
If you frequently hit cutoffs in the middle of long answers, paid plans typically lift the limits. If you frequently get cutoffs at random points mid-sentence, that's the model losing the plot or the server having a bad day — start a fresh chat.
Comparing chatbot output limits in 2026
| Chatbot | Free tier max output | Paid tier max output | Notes |
|---|---|---|---|
| ChatGPT | ~1,500 tokens / ~1,100 words | ~4,000 tokens / ~3,000 words on GPT-5 | Pro tiers can extend via "continue" |
| Claude | ~4,000 tokens | ~8,000 tokens on Sonnet 4.6, more on Opus | Generally generous; "extended" mode adds reasoning room |
| Gemini | ~2,000 tokens on free | ~8,000 tokens Pro, ~16k Flash | 2M context but per-response cap is small |
| Copilot | varies by app | varies by app | Word/Excel responses respect doc context |
The product-imposed cap is almost always lower than the model's theoretical maximum. Hit "continue" for longer outputs, or ask for the answer in sections.
Why long answers sometimes get worse the longer they go
The model decides each next word based on everything that came before. The longer the response, the more chance there is for a small early mistake to compound — one wrong sentence sets up the next wrong sentence, and by paragraph 4 the answer has drifted. This is why "summarise this 50-page document in detail" often returns a strong opening, a competent middle, and a vague or repetitive end.
The mitigation: ask for shorter outputs and iterate. "Give me the executive summary first; I'll ask for detail on the parts I care about." Long single-shot outputs are a worse use of the model than several focused exchanges.
Does it remember me?
By default, no — each conversation starts blank. The model has no idea who you are, what you talked about yesterday, or what your preferences are.
In 2026 most products have added a "memory" feature:
- ChatGPT stores notes about you ("user is a vegetarian," "user is learning Spanish") that get pulled into every chat. You can see, edit, and delete these notes.
- Claude has "Projects" — a workspace where you can give it persistent context and files that apply only inside that project.
- Gemini has memory similar to ChatGPT, integrated with your Google account.
- Copilot in Microsoft 365 can pull context from your email, calendar, and documents within the company.
Memory is not the model "knowing" you over time the way a friend does. It's the product writing down a few facts and feeding them back into the next conversation. The model itself is the same model talking to a million other people; the memory is yours.
If you want the chatbot to actually remember a specific thing about a project or your preferences, the most reliable way is to tell it at the start of the conversation, or to save it explicitly to memory if the product supports it. Don't assume it remembers what you said last week unless the product confirms it.
How memory actually works under the hood
When you tell ChatGPT "I'm a vegetarian and learning Spanish," the product runs a small classifier or LLM step that decides whether that's worth saving. If yes, a short note ("user is vegetarian," "user is learning Spanish") gets written to your account's memory store — separate from the conversation log. The next time you open a new chat, the system prompt the model receives includes those notes, framed as facts about you. The model treats them like any other context.
Three implications. First, the memory is facts the product chose to save, not a transcript of everything you said. Second, the model can forget or contradict its own memory if the conversation pushes hard enough — it's just text in the prompt, not a hard constraint. Third, anything in memory is readable by you (most products let you view and edit the list) and by anyone the product shares your account with.
Privacy and memory
Memory features collect personal data by design. Whether your conversations are also used to train future models is a separate question — see AI chatbot privacy for the full breakdown. As a quick summary in 2026:
- ChatGPT free: training on conversations unless you opt out. Memory: on by default for many users; you can disable it.
- Claude consumer: not used for training by default; memory features are opt-in per workspace.
- Gemini free: training on conversations by default; you can disable in Activity settings.
- Copilot in M365 (enterprise): not used for training; tenant-isolated.
What it can do well
The honest list of things modern chatbots are genuinely good at:
- Explaining things you don't understand. A complex topic in three different ways until one clicks. Asking dumb questions without judgment.
- Summarising. Long article into bullet points. Meeting transcript into action items. Book into a 200-word pitch.
- Brainstorming. Twenty title ideas for an article. Fifty potential causes of a bug. Three different angles for a presentation.
- Rewriting. Same text, formal tone. Same text, shorter. Same text, in plain language. Same text, in a different language.
- Translating. Between major languages, with surprising nuance. Better than Google Translate for prose; comparable to a human for casual use; not yet trustable for legal or medical.
- Coding (with caveats). Boilerplate, simple scripts, debugging help, code review, explaining unfamiliar code. The newer models are stronger than the older ones; the limits show up in large, integrated projects.
- Editing your writing. Catching typos, suggesting clearer phrasing, restructuring sentences. Better than spellcheck; not yet better than a good human editor on substance.
- Drafting. Cover letters, emails, simple legal templates (always reviewed by a human), creative writing first drafts, social posts.
- Tutoring. Patient, infinite, doesn't get bored, can explain the same concept five times with different examples.
Where it's good enough to use but verify: research, fact-finding, anything you'll act on.
Where it's not yet reliable: precise factual recall, math beyond simple algebra, citations, anything time-sensitive without search, anything legally or medically consequential.
Why coding is the breakout use case
If you ask a coder which AI product they use every day in 2026, the answer is almost always Claude (Sonnet 4.6 or Opus 4.x), GPT-5 in ChatGPT or Codex, GitHub Copilot in the editor, or some combination. Coding works as a chatbot use case for a structural reason: code is unambiguous. The model writes a function; you run it; either it works or it doesn't. The feedback loop is fast and binary. Compare that to writing prose, where "is this good?" is subjective and slow.
Coding also benefits from the model having read essentially every open-source repo. Common patterns, library APIs, idiomatic style — the model has seen all of it many times. Where it stumbles: very large unfamiliar codebases (the model can't see the whole thing), niche internal frameworks (not in training data), and tasks that require knowing exactly which version of a library you're using (it confidently uses the wrong API).
What it doesn't do well
- Knowing things it didn't read. Anything past its knowledge cutoff. Anything niche enough that the internet doesn't cover it well. Your specific company's internal processes (unless connected to your company's documents).
- Math beyond the easy stuff. Modern chatbots can handle basic arithmetic, algebra, simple word problems. They get tripped up on long multi-step calculations and many things that require precise number tracking. For real math, give it a calculator tool — most products now do this automatically.
- Counting letters or characters. Famous weakness, see tokens. It doesn't see individual letters most of the time.
- Remembering you long-term without memory features turned on. Don't expect continuity that wasn't explicitly enabled.
- Telling truth from plausible fiction. See hallucination. The model doesn't know what it knows.
- Following very long, multi-part instructions perfectly. It will usually pick up most of what you asked for, drop one or two parts. Break complex requests into pieces.
- Visual reasoning without seeing the image. It can read images now (most major chatbots), but if you ask "what does this PDF say" and don't attach the PDF, it cannot read your screen.
- Real-world physical reasoning. Spatial problems, mechanical reasoning, anything where you'd want a physics intuition rather than a verbal one.
- Knowing when to refuse. It can refuse for the wrong reasons (over-cautious about benign questions) and fail to refuse for the right reasons (giving advice it shouldn't). Newer models are better; not perfect.
The "agreeable assistant" failure mode
A subtle weakness in every modern chatbot: trained to be helpful, it tends to agree with the framing of your question. Ask "why is X bad?" and you'll get reasons X is bad, even when X is debatable. Ask "why is X good?" and you'll get reasons X is good. The model picks up the slant in the question and runs with it. This is fine when you know what you want; it's a problem when you're trying to think clearly.
The fix is to ask the question without telegraphing the answer: "Is X bad? What are the arguments on both sides?" — or to explicitly ask for pushback: "Where am I most likely wrong here?" Newer models are slightly better at proactively offering counter-arguments, but the bias toward agreement is still strong.
Why it sometimes lectures you
A second consistent annoyance: chatbots add disclaimers, caveats, and "please consult a professional" lines to answers that don't need them. The reason is training — the model was rewarded for being cautious, and the cautious patterns leaked into questions where they're irritating. You can usually suppress this with a one-line instruction: "Skip the safety disclaimers, I know how to use this responsibly." Most models comply, within reason.
How to get better answers out of it
You don't need to be a "prompt engineer." A handful of simple habits cover 90% of the gain.
Show, don't tell. Don't describe what you want; give it an example. "Write a polite email declining this meeting" gets a generic answer. "Write a polite email declining this meeting, in the same tone as: [paste example email]" gets one in your voice.
Say who you're talking to. Asking "explain interest rates" gets a wall of text aimed at no one. "Explain interest rates to my 10-year-old" gets a clear, age-appropriate explanation. "Explain interest rates to a finance professional, in two sentences" gets you something useful for a presentation.
Ask for the format. "In bullet points," "as a table," "as a JSON object," "in 100 words or less," "with section headers." The model will format any way you specify; you just have to ask.
Iterate, don't restart. If the answer is close but not right, say so. "Same idea but in plainer language" or "good, but shorten the third paragraph." A second turn is almost always better than starting over.
Paste the actual material. "Help me reply to this email" without the email is guessing. "Help me reply to this email: [paste]" is concrete. The model can only work with what you give it.
For accuracy: ask it to check itself. "Now go through your answer and flag anything that might not be correct" sometimes catches its own mistakes. Not a replacement for verifying important facts, but a useful step.
For creative work: ask for variants. "Give me three different versions" is more useful than asking for one and hoping it's right. Pick the one closest to what you want, then iterate on that.
Don't over-prompt. Long elaborate prompts with "you are an expert in X with 20 years of experience" don't help nearly as much as people think. A clear, direct request with examples is better.
The "why doesn't it just say it doesn't know" question
Because it doesn't know that it doesn't know. The model produces a probability distribution over possible next words, picks one, then the next, and so on. There's no internal flag that says "this is shaky." Newer models have been trained to add hedges ("I'm not sure, but...") on certain question shapes, and reasoning models can sometimes notice their own uncertainty when they think step by step. But the underlying mechanism is statistical, not introspective.
If you want a chatbot to be more honest about its limits, the most reliable trick is to explicitly tell it to refuse when uncertain: "If you don't know the answer, say so rather than guessing." This works some of the time. It is not a guarantee.
A few prompts that consistently get better results
These work across all four big chatbots and are worth memorising:
- "Walk me through your reasoning step by step before you give the answer." Surfaces mistakes that would be buried in a confident one-line reply.
- "Give me three different angles on this, then pick the best." Better than asking for one answer and hoping.
- "What might I be missing? What's the strongest counter-argument?" Counters the model's default agreeableness.
- "Cite sources for anything factual." Only useful if search is on; otherwise the model fabricates citations.
- "What's the simplest version of this?" Cuts through verbose AI prose.
- "Pretend I have no background here and explain." Useful when the chatbot keeps assuming you know more than you do.
The full conversation lifecycle: what happens between your keystroke and the answer
Most explanations stop at "the model predicts the next token." The actual path from your message to the text you see on screen has 10–15 steps, and several of them shape how the answer turns out.
- Your message arrives at the product server. You type, hit send. The product receives your message + your account ID + your conversation history.
- System prompt is prepended. The product attaches its hidden instructions — the personality, the safety rules, the available tools.
- Memory is queried. Notes about you (vegetarian, learning Spanish) are added to the prompt.
- Tools are described. If web search is on, Python is on, image gen is on, the tool descriptions are added.
- Files are processed. Any uploads you attached are converted to text or image tokens.
- Conversation history is added. Previous turns in this conversation get appended.
- The complete prompt is sent to a GPU. The product picks which model server to route to (often via a load balancer).
- The model processes the prompt — embedding, attention, feed-forward through dozens of transformer blocks. This is the prefill phase, compute-heavy and parallel.
- The model starts generating tokens. One token at a time, each token informed by the entire context plus everything generated so far. This is decode, bandwidth-heavy and serial.
- Tool calls are intercepted. If the model emits a structured tool call, the product pauses, runs the tool, feeds the result back, and the model resumes.
- Safety filters run. Output is checked against the safety classifier. If it matches a refused category, the response is replaced or truncated.
- Streaming UI receives tokens. As each token arrives, the product streams it to your browser/app. This is why you see the response appear word by word.
- End-of-turn token fires. The model decides it's done. The product stops streaming.
- Memory is updated. A post-processing step examines the conversation to see if anything is worth saving to long-term memory.
- Costs are billed and logs are written. Token counts, latency, model used, tools invoked — all stored for analytics.
Many of the surprising behaviours come from steps that aren't the model itself: a refusal in step 11 (safety filter, not the model deciding to refuse), a tool call that takes 8 seconds in step 10 (slow web search, not slow chatbot), a memory update in step 14 (the chatbot "remembering you" is the product writing notes after the fact).
Why streaming responses sometimes pause
You'll occasionally see a chatbot pause mid-sentence for a few seconds. Possible reasons:
- The model is making a tool call. It generated a structured tool call, the product is executing it (web search, Python eval), and waiting for the result before resuming.
- Server-side rate limiting. Your account or the entire pool hit a limit briefly.
- Inference batch reorganisation. Production servers batch many users' requests together; a batch boundary can introduce a small pause.
- Safety check on the output so far. Some products run intermediate safety checks; if one triggers, generation pauses while a higher-tier check runs.
The pause usually resolves in a few seconds. If it doesn't, the request likely failed and the product will show an error.
Tokenization in plain English: BPE
Tokens were introduced earlier as "chunks of text the model sees." The algorithm that decides where one token ends and the next begins is called byte-pair encoding (BPE). It's worth understanding because three of the chatbot's weirdest behaviours come straight from it.
The idea is simple. Start with every character in your training data as its own one-character token. Find the most common pair of adjacent tokens — say "t" and "h" appear next to each other constantly. Merge them into a new token "th". Repeat thousands of times. Each merge captures a frequent character sequence. By the end you have a vocabulary of 50,000–200,000 tokens. Common words like "the", "and", "house" end up as single tokens. Rare words like "antidisestablishmentarianism" split into pieces ("anti", "dis", "establish", "ment", "arian", "ism"). Names you've never seen often split character-by-character.
Three consequences for users.
1. Numbers and letters are awkward. "1234" might be one token; "1235" might be two. The model can't easily reason about the digits because it can't easily see them. This is why arithmetic gets weird at the boundaries — single-digit math is reliable, but eight-digit multiplication frequently fails.
2. Non-English languages cost more. English text averages ~1.3 tokens per word. Chinese, Japanese, Arabic, Korean, Hindi all average 2–4 tokens per word because the tokenizer was optimised for English. A Spanish message and an English message of the same meaning can differ by 30–50% in token count, and your API bill reflects it.
3. Spelling and letter-counting bugs. The "how many r's in strawberry" famous failure is a tokenization artifact. The model sees "straw" + "berry" + "" — three tokens. It never sees individual letters unless they're separated by spaces. Newer reasoning models work around this by explicitly spelling out the word character-by-character before counting, but the base mechanism still has the blind spot.
GPT-style models use BPE-derived tokenizers (tiktoken for OpenAI, tokenizers for HuggingFace). Anthropic, Google, Meta, Mistral, and DeepSeek each have their own tokenizers with different vocabulary sizes and merge rules — same prompt, different token counts across providers, with sometimes 20–40% variance.
A 30-second BPE demo
Try this in your head. Vocabulary starts with {a, b, c, ..., z}. Training corpus is "banana banana banana".
- Most common pair: "a" + "n" appears 5 times. Merge → "an".
- Now corpus reads "b an an a b an an a b an an a". Next most common pair: "an" + "an" appears 6 times. Merge → "anan".
- Corpus: "b anan a b anan a b anan a". Continue.
After enough merges, "banana" becomes a single token. The tokenizer isn't deciding "banana is a word"; it's noticing "this character sequence is frequent enough to deserve its own slot."
Embeddings: meaning as coordinates
The first thing a chatbot does with your tokens is convert each one into a vector — a list of a few thousand numbers. That vector is the token's embedding. The model has learned, during training, to place tokens at positions in this high-dimensional space such that tokens with similar meanings or grammatical roles end up near each other.
This is the closest thing the chatbot has to "knowing what words mean." The word "king" lives near "queen", "prince", "ruler". "Paris" lives near "France", "capital", "Seine". A famous demonstration: take the vector for "king", subtract "man", add "woman", and you land near "queen". The model didn't learn that relationship explicitly — it emerged from the training task of predicting next words across billions of sentences.
Why this matters in practice:
- Why typos still work. "Embedding the wrod" — the model sees a token sequence it has never seen exactly, but the embedding of the misspelled token lands near the correct one. Behaviour degrades gracefully.
- Why analogies work. Asking the chatbot "if Paris is to France as Tokyo is to ___" works because the spatial structure of embeddings encodes the relationship.
- Why translation works. During pretraining the model sees enough parallel English-Spanish text that translations of the same concept get embedded near each other across languages. Multilingual ability is mostly an embedding-space phenomenon.
- Why "more context = better answer." Each new token's embedding affects all the others through attention (next section). Richer context means the model's prediction draws on more signal.
The embedding layer is a giant matrix: vocab size (say 100,000) × embedding dimension (say 4,096). For GPT-5-class models, that's hundreds of millions of parameters just in the embedding table. The information is dense; collapse a token to a vector, route the vector through dozens of transformer blocks, sample a new token at the end.
Inside one transformer block
Every modern chatbot is built from a stack of identical transformer blocks — typically 32 to 128 of them. Each block does two things: mixes tokens together (attention), then thinks per-token (a small neural network called the feed-forward layer). Stacking dozens of these lets information flow across long passages and get refined repeatedly.
Attention: tokens looking at each other
Picture reading a sentence: "The dog chased the cat because it was scared." When you process "it", you instinctively look back to "dog" or "cat" to figure out which one is scared. That look-back is what attention does, mechanically.
Each token produces three vectors: a query ("what am I looking for?"), a key ("here's what I am"), and a value ("here's what I contribute"). For every other token in the context, the model computes how much the query matches the key — a similarity score — then uses that score to weight how much of the other token's value to mix in. After this mixing, each token's representation has been updated to include relevant context from elsewhere in the prompt.
In a long context, "it" can attend to a token 50,000 words back. This is what makes long-document understanding possible. It's also what makes long contexts expensive — every token in the new output has to attend to every prior token, and the compute grows quadratically with context length.
The feed-forward layer: small private brain per token
After attention mixes information across tokens, the feed-forward layer processes each token individually through a small two-layer neural network. This is often described as the model's "long-term memory" — a lot of the factual content of the model lives in feed-forward weights.
The classic result that surprised researchers: feed-forward layers act like key-value stores. If you reach into a specific neuron in the feed-forward layer, you can sometimes find one that fires on "Paris-related context" and another that fires on "capital-of-France context." Combining attention's mixing with feed-forward's per-token computation is what gives the transformer its power.
Multi-head attention: many lookups in parallel
In practice, the attention mechanism is run many times in parallel — usually 32 to 96 "heads" per block. Each head learns to look at different patterns: one might track grammatical agreement, another track entity references, another track stylistic consistency. The outputs of all heads are concatenated and projected back down to the embedding dimension before continuing.
Residual stream: the information highway
Each token's representation flows through every block. After each block, the block's output is added to (not replaced by) the previous representation. This means the original embedding signal can propagate all the way to the end if it needs to, while each block contributes its refinement. This residual structure is what makes deep models trainable; without it, signals would degrade across dozens of layers.
The picture you can keep in your head: a token enters, gets mixed with other tokens via attention, gets refined by the feed-forward layer, the result is added back to the running representation, and the whole thing flows through the next block. Repeat 80 times. The final representation is fed to a small output layer that produces a probability distribution over the next token.
The three stages of training: pretraining, SFT, RLHF
The "brain" of a chatbot is built in three stages, in order, each doing a different thing.
Stage 1: pretraining — read everything
The model is shown billions of pieces of text and trained on one task: predict the next token. Given "The capital of France is", learn to assign high probability to "Paris". Given "def fibonacci(n):", learn to assign high probability to "if". The model isn't told what "France" or "Python" mean; it just learns to predict.
Pretraining runs for weeks to months on thousands of GPUs. The compute investment for a frontier 2026 model is in the hundreds of millions of dollars. The output is a base model — capable of completing text in a continuation style, but not yet useful as a chatbot. Ask a base model "What's the capital of France?" and it might reply with "What's the capital of Italy? What's the capital of Spain?" — continuing a list of questions, because lists of questions are common training-text patterns.
Stage 2: supervised fine-tuning (SFT) — learn to be helpful
Now the model is shown thousands to millions of human-written examples of "good chatbot responses." A prompt + a high-quality reply, prompt + reply, prompt + reply. The model trains to imitate the reply style. After SFT, the model behaves like a helpful assistant: it answers questions directly, uses a conversational tone, follows instructions.
The data for SFT is expensive — humans write and curate examples, often domain experts for specialised tasks. Coding examples come from professional engineers; medical examples from clinicians; creative writing from writers. The quality of the SFT dataset is one of the larger determinants of how good the final chatbot feels.
Stage 3: reinforcement learning from human feedback (RLHF)
After SFT, the model is helpful but not yet polished. RLHF tunes the model on a different signal: instead of imitating example replies, the model is shown multiple candidate replies it could give, and humans rank them. The model learns to produce replies humans prefer.
RLHF is what makes the difference between "the chatbot says correct but boring things" and "the chatbot says correct, interesting, well-formatted, appropriately cautious, appropriately concise things." It also installs the safety behaviours — refusing certain categories, hedging on uncertain claims, declining harmful requests.
In 2026, RLHF is increasingly being supplemented or replaced by DPO (direct preference optimisation), Constitutional AI (Anthropic), and RLAIF (reinforcement learning from AI feedback, where the ranker is another model). Different labs blend these differently:
- OpenAI — heavy RLHF, refined with synthetic preferences for newer models.
- Anthropic — Constitutional AI: the model critiques its own responses against a written constitution, and learns from those critiques.
- Google — mix of RLHF and RLAIF, plus Gemini-specific safety training.
- Meta — DPO-heavy in recent Llama releases; cheaper than full RLHF, similar quality.
The personality differences you feel between ChatGPT, Claude, and Gemini come almost entirely from stage 2 + 3 choices, not from the base model. Same pretraining recipe, different post-training, very different conversational character.
Tool use: how chatbots ‘do' things
A pure chatbot only outputs text. But ChatGPT can browse the web, Claude can run Python, Gemini can search Google, Copilot can edit your spreadsheet. How?
The mechanism is tool calling. The product tells the model, in its system prompt, what tools are available — typically a list of named functions with descriptions and parameter schemas. Something like: "You have access to web_search(query), python(code), calendar_create_event(title, date)."
During generation, instead of producing prose, the model can emit a structured tool call: {"tool": "web_search", "query": "current weather in Tokyo"}. The product intercepts the structured output, runs the actual search, and feeds the result back into the conversation as a new message ("tool result: 18°C, partly cloudy"). The model then continues with that information in context.
To the user, it looks like the model "did" something. Mechanically, the model only ever outputs text — but some of that text is a structured tool call that the surrounding product knows how to execute. Tool use is what makes chatbots useful for tasks beyond conversation.
A worked tool-call example
You ask Claude: "What's the weather in Tokyo right now, and what time is it there?"
Internally:
- Claude reads your message + system prompt (which lists available tools including
web_search). - Claude generates:
{"tool": "web_search", "query": "current weather Tokyo"}. - The product intercepts, runs the search, returns: "Tokyo: 22°C partly cloudy, time 11:48 PM JST."
- Claude continues with that observation in context.
- Claude generates: "It's 11:48 PM in Tokyo right now, partly cloudy at 22°C — a mild night."
You see one fluent answer. Under the hood: model output, tool call, real-world data fetch, model resumption. Each step takes a fraction of a second; total user-visible latency 1–3 seconds.
The same architecture supports much more complex flows: a coding agent that proposes a fix, runs the tests, observes the failure, proposes another fix, and iterates 10 times before producing a final answer. Each iteration is one "tool call + model resumption" cycle.
What tools each chatbot has in 2026
| Chatbot | Default tools |
|---|---|
| ChatGPT | Web search, Python (Code Interpreter), DALL-E image gen, file analysis, custom GPTs with custom tools |
| Claude | Web search, computer use (Claude can operate a virtual computer), file analysis, MCP-connected tools |
| Gemini | Google search, Google apps (Calendar, Docs, Gmail), Python execution, image gen via Imagen |
| Copilot in M365 | Email, Calendar, Word docs, Excel cells, Teams chat, SharePoint, Power Platform |
| Perplexity | Web search (its core feature), file analysis |
| Custom (via API) | Whatever you define — every API supports developer-defined tools |
The newer trend is MCP (Model Context Protocol), an open standard for connecting tools to chatbots. Claude was the first to ship MCP-based tools in 2024; OpenAI and Google have followed with similar protocols. The promise: any chatbot can connect to any MCP-compatible tool without custom integration.
System prompts: the hidden instructions
Before your first message in any chat, the product feeds the model a hidden system prompt — typically several thousand tokens of instructions about how to behave. This is where the personality, the safety rules, the formatting preferences, and the tool descriptions live.
A simplified ChatGPT system prompt might say something like: "You are ChatGPT, a large language model trained by OpenAI. Be helpful, honest, and concise. Use Markdown for formatting. The current date is May 16, 2026. You have access to the following tools: web_search, python, image_gen, ..."
Claude's system prompt (leaked in various forms over 2024–2025) is famously long — north of 10,000 tokens — and detailed. It includes specific instructions about how to handle ambiguity, when to ask clarifying questions, how to format code, how to caveat uncertain claims, how to refuse harmful requests, and many edge cases.
The system prompt is invisible to you in the chat UI but takes a real chunk of the context window. It also explains why different products with the same underlying model can feel so different — Copilot using GPT-5 with Microsoft's system prompt behaves differently from ChatGPT using the same model with OpenAI's system prompt.
Why your own custom instructions are basically your own system prompt
ChatGPT's "Custom Instructions", Claude's "Projects" with custom context, and Gemini's saved memory all work the same way: text you write gets injected into the system prompt for your conversations. The model treats them with the same priority as the company's built-in instructions, which is why a few sentences of "always reply in formal British English, with bullet points, and skip the disclaimers" can have such a big effect.
Temperature and top-p: the randomness knobs
When the model has produced a probability distribution over the next token, it has to pick one. The two parameters that govern that pick are temperature and top-p.
Temperature. A scalar between 0 and ~2. Temperature 0 means "always pick the most probable token" — deterministic, robotic, repetitive. Temperature 1 means "sample proportionally to the probabilities" — the default for creative tasks. Temperature 2 means "flatten the distribution, take more risks" — gets weird fast.
Top-p (nucleus sampling). Truncate the distribution to just the most probable tokens that together account for p% of the probability mass. Top-p 0.9 means "only consider the top tokens that add up to 90% probability; ignore the long tail." Prevents the model from picking absurd low-probability tokens.
These are the knobs that explain why the same question gives different answers each time. ChatGPT, Claude, and Gemini default to non-zero temperature on consumer chat — typically around 0.7. Set it to 0 in the API and the model becomes deterministic (modulo numerical noise on the hardware).
Practical guidance:
- Code, math, structured output, factual lookup: temperature 0 (or low). Want consistency.
- Brainstorming, creative writing: temperature 0.7–1.0. Want variation.
- Anything where "the right answer" exists: low temperature.
- Anything where "many valid answers" exist: higher temperature.
Most consumer products don't expose temperature directly. ChatGPT's "Be more creative" / "Be more precise" toggles, Claude's Concise mode, and Gemini's response styles are all wrappers around temperature plus some other knobs.
Reasoning models: thinking out loud
In late 2024, OpenAI released o1 — the first widely-available reasoning model. By 2026 the category includes OpenAI o3 and o4, Claude with extended thinking, Gemini 2.5 Deep Think, and DeepSeek R1. They behave differently from standard chatbots in one specific way: they generate a long internal chain of thought before producing the visible answer.
Concretely: you ask a hard math question. A standard chatbot produces 200 tokens of explanation and an answer. A reasoning model first produces 5,000 tokens of internal reasoning ("Let me think about this. The problem says X. So I need to compute Y. Let me check that. Actually, X implies Z, so..."), then produces a 200-token visible answer.
The internal reasoning is often hidden from users (OpenAI redacts o3's full chain; Anthropic shows Claude's by default; DeepSeek shows R1's). The user-visible answer is similar in length to a non-reasoning model's. The difference: the reasoning model has had a chance to catch its own mistakes, try multiple approaches, and verify before committing.
What this buys you:
- Better math, coding, and logical reasoning. Reasoning models score 20–50 percentage points higher on benchmarks like MATH, AIME, GPQA, SWE-bench than non-reasoning models of similar size.
- More reliable multi-step instructions. "Do steps A, B, C, then check D" works better with reasoning models.
- Less obvious hallucination on quantitative claims (because the model can double-check).
What it costs you:
- Time. Reasoning takes 10 seconds to 5 minutes per answer. You feel the wait.
- Money. All those internal tokens bill as output. o3-high can cost $1 per question.
- Worse on broad factual recall. Reasoning models sometimes overthink simple lookups.
When to use a reasoning model: hard math, complex coding, multi-step planning, anything where you'd want a careful answer over a fast one. When not to: casual conversation, simple factual lookups, anything time-sensitive.
Agents: chatbots that take actions
An agent is a chatbot wrapped in a loop. Instead of one user message → one chatbot reply, an agent runs:
- Look at the goal.
- Decide on the next action (often a tool call).
- Execute it. Observe the result.
- Update its plan based on the result.
- Go to step 2 — or stop if the goal is achieved.
Agents are what you get when you give a reasoning model + tools + a loop. The chatbot can now do things like: research a topic across 20 web pages, fill out an application form, debug a piece of code by running it and fixing errors, plan a trip and book the flights, write a 50-page report with citations.
In 2026 the most visible agent products are: OpenAI's "Operator" and Codex (autonomous coding), Anthropic's Claude with computer use, Google's Project Mariner / Astra, Cursor and Devin in software engineering, and dozens of B2B agent products.
Agents are real but rough. They get stuck on UI changes, hallucinate steps, run up costs unexpectedly, and occasionally take damaging actions. Production-quality agents in 2026 require careful guardrails, observability, and human-in-the-loop checkpoints. See agent serving infrastructure for the engineering details.
Multimodal: vision, audio, voice mode
Modern chatbots aren't text-only. Most can also see (images, PDFs), hear (audio), and speak (TTS or native voice).
Vision
Drop an image into ChatGPT, Claude, or Gemini and ask a question. The model processes the image through a separate vision encoder (a smaller neural network that converts images into a sequence of tokens compatible with the language model). Those image tokens get prepended to your text tokens, and the model continues as normal.
Vision works well for: reading text from screenshots, identifying objects, describing scenes, summarising charts, answering questions about diagrams. It works less well for: counting many small items, reading very small or stylised text, anything requiring pixel-perfect localisation.
Audio: speech-to-text vs native audio
There are two architectures for audio chatbots.
Pipeline approach. Audio → speech-to-text (Whisper, AssemblyAI) → LLM → text-to-speech (ElevenLabs, OpenAI TTS). Each step adds latency. Total response time 2–6 seconds.
Native audio. The model is trained to take audio tokens as input and emit audio tokens as output, with the LLM directly in the middle. GPT-4o's voice mode, Gemini Live, and the newest Claude releases work this way. Response time can drop to 200–500 ms — fast enough for natural conversation.
Native audio captures tone of voice, emphasis, hesitation, accent — information lost in transcription. It's what makes the new voice modes feel different from "phone-tree IVR" interactions.
How vision encoders see images
A vision encoder typically slices an image into tiles — say 14×14 pixel patches — then converts each patch into a token vector via a CNN-like or transformer-based encoder. A 1024×1024 image might produce 256 image tokens; a high-detail image with multiple tile levels might produce 1500+.
These image tokens get prepended to your text tokens before going into the main language model. Attention then mixes the image tokens with your text tokens, letting the model "see" your image in the same way it "sees" your words.
The implication: long image prompts cost real tokens. A page of a PDF (rendered as a 1500×2000 image) can use 1500–2000 input tokens. A 30-page PDF: 50k+ input tokens. This is why uploading large PDFs can hit context limits faster than the equivalent text would.
| Image size + detail | Approx tokens | Cost ($/M = $5 example) |
|---|---|---|
| 512×512 low detail | ~85 | $0.000425 |
| 1024×1024 standard | ~765 | $0.0038 |
| 1024×1024 high detail | ~1545 | $0.0077 |
| 2048×2048 high detail | ~2913 | $0.0146 |
(Exact numbers vary by provider; see multimodal serving.)
Voice mode in 2026
ChatGPT's Advanced Voice Mode, Gemini Live, and Claude's voice features all support real-time bidirectional voice. You can interrupt the assistant; it can hear you laugh; it picks up your accent and matches your conversational style. The mechanism is the same native audio path plus a streaming pipeline that keeps latency under 500 ms.
What it isn't yet (mid-2026): perfect for ambient conversation in noisy environments, reliable for transcribing speakers other than you in the room, suitable for hands-free continuous use without explicit invocation.
Custom GPTs, Projects, and personalisation
Several products let you create persistent personalised chatbot configurations.
ChatGPT's Custom GPTs. A Custom GPT is a chatbot built on top of GPT-5 with: a custom system prompt, an optional set of files (knowledge base), and optional tools. You can publish them to a public marketplace or keep them private. Anyone using your Custom GPT gets your configured behaviour.
Claude's Projects. A Project is a workspace with custom instructions and a set of attached files. Every chat inside the project inherits the instructions and has access to the files. Particularly useful for long-running work on a single codebase or document set.
Gemini's Gems. Google's equivalent to Custom GPTs. Same idea: instructions + optional knowledge + optional tools.
Copilot Studio. Microsoft's enterprise-targeted Custom Copilot builder. Wires up to Microsoft 365 data and approved tools, with enterprise admin controls.
The pattern across all four: under the hood, they're system-prompt + RAG (retrieval over your files) + optional tools. Nothing magical. But they make personalisation a no-code operation, which expanded who can build useful AI products dramatically.
Memory features vs Custom GPTs/Projects
Memory is per-account, applies to all chats with that chatbot, and persists short summary notes. Custom GPTs / Projects are scoped containers, apply only inside the container, and can hold larger structured context. Use memory for personal preferences ("I'm vegetarian, I'm learning Spanish"). Use Custom GPTs / Projects for task-specific configurations ("This is my customer-support bot trained on our docs").
Fine-tuning vs RAG: two ways to specialise
Both are ways to make a chatbot work better on your specific data. They do very different things mechanically.
RAG (retrieval-augmented generation): you store your data (docs, FAQs, code) in a database. When a user asks a question, the system searches the database for relevant chunks and pastes them into the model's context before asking the model to answer. The model doesn't change; the prompt does. RAG is best for: facts that change frequently, documents the user can verify, source-cited answers.
Fine-tuning: you take the model and continue training it on examples of how you want it to behave. The model itself changes — its weights are updated. Fine-tuning is best for: style ("write like our brand voice"), structured output ("always emit valid JSON with this schema"), specialised tasks the base model is weak at.
For consumer chatbots, fine-tuning is mostly invisible — the products do it internally and you see the result as a new model version. For developers, fine-tuning is offered as an API: upload your training examples, get back a custom model. OpenAI, Anthropic, Google, Together, Fireworks all support this.
Most production AI products in 2026 do both: a fine-tuned base for style and reliability, plus RAG for current information. See RAG production architecture and multi-tenant LoRA serving for the engineering side.
Comparing the two on the same task
Suppose you want a chatbot that answers customer questions about your product, using your docs.
RAG path: index your docs in a vector database. On each user question, search the database for the 3 most relevant chunks, paste them into the prompt with "use only the following information to answer," and let the model write the reply.
Pros: current (you update docs, the chatbot sees the updates immediately), source-citeable, no training cost, easy to debug ("which chunks did it use?"). Cons: pays for retrieval infra (vector DB + embeddings), retrieval quality matters (a bad search returns irrelevant chunks), the prompt is long every call.
Fine-tune path: create training examples ("Question: how do I cancel? Answer: visit /account..."), train a LoRA adapter on top of a base model, deploy. Each customer question goes directly to your fine-tuned model.
Pros: shortest prompts (no retrieved chunks in the call), fastest responses, baked-in style. Cons: stale (re-train when docs change), no source citations, harder to debug ("why did it say that?").
In practice, large-scale customer-support deployments combine both: fine-tune for style and structure, RAG for current facts. Costs land in similar ranges. Choice depends on your engineering preference and how often your knowledge changes.
When fine-tuning makes the bill go down
A common misconception: fine-tuning is expensive. It is, upfront — but it can lower running costs if it lets you use a smaller model for the same task.
Example: a customer-support classifier triages incoming tickets into 12 categories. Off the shelf with GPT-4o, it costs $2.50 per million input tokens × 500 tokens/ticket × 1M tickets/year = $1,250/year. With fine-tuning, you can use a 7B open-weight model that runs at $0.10/M input — saving $1,150/year on token cost. The fine-tune itself costs $1,500 to train. Break-even: end of year 1; pure savings after.
For high-volume, narrow tasks, fine-tuning a smaller model is almost always cheaper. For low-volume or constantly-shifting tasks, stick with prompting a larger model.
Embedding similarity: the math behind RAG retrieval
When RAG searches your docs, it converts the user's question to an embedding (using a separate embedding model — text-embedding-3-small, voyage-3, bge-large), then finds doc chunks whose embeddings have highest cosine similarity. The cosine similarity is just the angle between two vectors; high similarity means "these two pieces of text are about similar things."
Quality of retrieval depends heavily on the embedding model. Older or smaller embedding models miss semantic matches ("cancel subscription" vs "end my plan" might not match). Newer ones (Voyage-3, OpenAI text-embedding-3, Cohere embed-v3) are much better. The retrieval step is often a bigger source of RAG failures than the generation step.
Why responses vary, why refusals happen, why apologies pile up
Three behaviours that confuse users, all explained by training mechanics.
Why the same prompt gives different answers
Temperature plus sampling randomness. Even at temperature 0, results can vary slightly across calls because of non-deterministic GPU floating-point operations. Different load balancers, different GPU batches, different cache states all shift the math by tiny amounts. At higher temperatures the variance is much larger.
The practical implication: if consistency matters (programmatic use, A/B testing), set temperature to 0 in the API and use a single provider region. For human use, embrace the variation — sometimes a regenerate produces a better answer.
Why models sometimes refuse benign requests
The safety training (RLHF stage) installed thousands of "refuse this" patterns. The patterns are imperfect — they pattern-match too aggressively in some areas. Common over-refusal triggers:
- Medical questions, even for pets, even general information.
- Security topics, even for CTF practice or defensive research.
- Anything involving violence in fiction (depending on the model).
- Legal questions that could be construed as advice.
- Anything mentioning minors, even in completely benign contexts.
Newer models are better calibrated. If you hit an over-refusal, the most reliable workaround is adding context: "I'm a nurse asking for professional reference," "This is for a fiction project," "I'm a security researcher with proper authorisation." Most models comply when given context that justifies the request.
Why models apologise so much
RLHF training rewarded models for being humble, accepting correction, and acknowledging mistakes. The behaviour leaks into situations where it's annoying — apologies for things that aren't mistakes, hedge after hedge on confident claims.
You can suppress this with one line: "Don't apologise unless you actually made a mistake. State things directly." Most models comply. Anthropic's Claude is somewhat better-calibrated on this out of the box; OpenAI and Google models tend toward more apology.
The personality knobs: helpfulness, harmlessness, honesty
Lab researchers refer to the three-axis tradeoff: helpfulness (do what the user wants), harmlessness (don't cause harm), honesty (don't deceive). All three are imperfect proxies and often conflict. A request for legal advice triggers: helpfulness says "answer," harmlessness says "refuse, they should see a lawyer," honesty says "give an honest answer about what you actually know."
The three labs prioritise differently. Anthropic emphasises honesty and harmlessness via Constitutional AI; the result is a chatbot that hedges more but is more candid about uncertainty. OpenAI emphasises helpfulness; ChatGPT is more directly useful but more confidently wrong. Google's Gemini sits in the middle. There is no "correct" balance — different products for different uses.
The four products in 2026, by architectural choice
The four major consumer chatbots differ less in their underlying transformer architecture than in their training-data choices, post-training methods, and product-level decisions. Reading them side by side:
ChatGPT (GPT-5 / GPT-4o / o-series)
- Base model strategy. GPT-5 as the new flagship, GPT-4o as fast/cheap, o3/o4 for reasoning.
- Personality. Helpful, direct, sometimes overconfident. Eager to answer.
- Strengths. Tool integration breadth (web, Python, image gen, file analysis), Custom GPT marketplace, voice mode quality.
- Weaknesses. More prone to confident hallucination than Claude. Memory feature can feel intrusive.
- Distinguishing choice. Aggressive product velocity. New features land first; rough edges sometimes follow.
Claude (Sonnet 4.6, Opus 4.x, Haiku 4.5)
- Base model strategy. Three-tier (Haiku/Sonnet/Opus). Extended thinking toggle on demand.
- Personality. Thoughtful, hedging, well-formatted. Asks clarifying questions more often.
- Strengths. Writing quality, coding (especially in editors via Cursor/Zed/Sourcegraph), long-context reasoning, MCP tool ecosystem.
- Weaknesses. Smaller free tier than competitors. No native image generation.
- Distinguishing choice. Constitutional AI training. Models reason about their own outputs against a written constitution.
Gemini (2.5 Pro, Flash, Deep Think, Nano)
- Base model strategy. Four tiers spanning 2M-context Flash to Deep Think reasoning.
- Personality. Pragmatic, slightly formal, integrated with Google's products.
- Strengths. Largest context window (2M tokens), native multimodal (audio, video, images), tight Google integration (Workspace, Search, Photos).
- Weaknesses. Personality is more variable across model tiers; consumer Gemini app sometimes routes differently than expected.
- Distinguishing choice. TPU-native training stack and tight Google Search integration.
Copilot (Microsoft 365 / GitHub / Windows)
- Base model strategy. Built on OpenAI's models, with Microsoft's tuning and tooling.
- Personality. Business-focused, plays well with Microsoft's apps.
- Strengths. Deep Microsoft 365 integration (Email, Calendar, Word, Excel, Teams), enterprise compliance (Microsoft tenancy isolation), GitHub Copilot's coding integration.
- Weaknesses. Slower to ship new OpenAI features than ChatGPT itself. Multiple "Copilot" products with confusing branding.
- Distinguishing choice. Enterprise-first. Whatever ChatGPT does, Copilot does inside your tenant with audit logs.
Which to use when
| Use case | Best pick |
|---|---|
| General everyday chat | Personal preference; try all three |
| Writing, editing, prose | Claude |
| Coding in IDE | Copilot / Cursor with Claude |
| Coding via chat | Claude or ChatGPT |
| Research with citations | Perplexity or ChatGPT/Gemini with search on |
| Tight Google ecosystem | Gemini |
| Tight Microsoft ecosystem | Copilot |
| Maximum free-tier capability | Gemini |
| Voice mode | ChatGPT or Gemini Live |
| Reasoning-heavy work | o3 or Claude extended thinking |
| Multimodal with video | Gemini |
What's coming in 2026–2027
Predicting AI 18 months out is a fool's game, but several trends are clear enough to act on.
Better agents. The current generation of agents is rough — they fail on UI changes, get stuck, run up costs. By late 2026 expect noticeably more reliable agents for narrow domains (coding, customer support, research). General-purpose "do anything" agents will still be unreliable.
Cheaper reasoning. DeepSeek R1 already showed that strong reasoning can run at 1/30 the price of frontier reasoning. By 2027 reasoning-class accuracy at standard chat prices is likely. The premium for reasoning collapses.
Longer working memory. Context windows already hit 2M tokens on Gemini. The next step is models that can maintain coherence across days or weeks of work — agents that pick up where they left off without re-reading every prior message. The architecture for this is being explored (state-space models, recurrent updates, persistent KV cache); the products are starting to ship.
Multimodal everywhere. By late 2026, expect every major chatbot to handle image, audio, and video natively. The current "you can attach files" model gives way to "the chatbot can see what you're seeing in real time" for the products willing to take the latency and privacy tradeoffs.
Better personalisation. Today's memory features are crude — short text notes appended to the system prompt. Coming: models that incorporate per-user fine-tuning at runtime via lightweight adapters, so the chatbot's writing style and knowledge gradually shift toward you over months of use.
The commodity layer matures. Open-weight models (Llama 4, Qwen, DeepSeek, Mistral) reach near-frontier quality. The competitive differentiation moves up the stack — to UX, integrations, and trust. Pricing pressure intensifies on hosted APIs.
Regulation lands. EU AI Act enforcement begins, US state-level laws (California, Colorado) take effect, China's measures tighten further. Expect more visible safety labels, more required disclosures, more friction on certain use cases.
What probably won't happen by 2027: AGI in any rigorous sense, fully autonomous agents you trust unsupervised with money, chatbots that "understand" you in the way humans understand each other, or the consumer-product landscape consolidating to one winner.
The on-device chatbot trend
A separate trend worth flagging: small models running locally on your phone or laptop. Apple Intelligence runs a ~3B-parameter model on-device on iPhone 16 and newer. Google Gemini Nano runs on Pixel 9 and Galaxy S25. Microsoft's Phi-4 family runs on Copilot+ PCs. These aren't competitive with frontier cloud models on hard tasks, but they handle simple chat, transcription, summarisation, and on-device tool calls without sending your data to a server.
By 2027, expect on-device models to handle ~80% of common chat queries with cloud fallback for hard ones. The privacy story is meaningfully different: when the model is on your hardware, the data stays on your hardware. The cost story is different too: no per-token fee, but the model is constrained by your battery and SoC. See AI chatbot privacy for the privacy implications.
Watermarking and provenance
Coming in 2026–2027: invisible watermarks on AI-generated text and images that let downstream systems identify AI output. Google's SynthID is the most-deployed implementation. OpenAI has internal watermarking but has not turned it on publicly as of mid-2026. The motivation: detect AI-generated misinformation, prevent training new models on AI output, and label AI content in social feeds.
The catch: text watermarks degrade when text is paraphrased or translated. Image watermarks degrade under heavy compression or cropping. Watermarking is a partial defence, not an ironclad detection mechanism.
Why coding works so well for chatbots
Coding is the surprise success story of the chatbot generation. Ask any 2024–2026 flagship to write a Python function, and the result is often startlingly good — better, in many cases, than the model's performance on the same task verbally described in natural language. There are good reasons.
The training signal is unusually clean. Code either compiles or it doesn't. Tests either pass or they don't. The internet contains billions of lines of code with attached test suites, error messages, fixes, and version histories — a near-perfect feedback signal that natural language rarely has. The post-training stage (RLHF, but also process-supervision and execution-grounded methods) can directly verify that a code generation was correct, which is much harder to do for "is this paragraph well-written."
The structure is regular. Programming languages have small grammars, well-defined semantics, and a finite number of correct shapes. A chatbot trained on enough code learns the syntax thoroughly. By contrast, natural language has fuzzier rules and more cases where multiple answers are equally valid.
The training data is biased toward the task. GitHub, Stack Overflow, official documentation, programming books, technical blogs — the corpus that flagship models train on contains heroic amounts of code relative to the share of code in general internet text. Recent models have explicit code-pretraining stages where the proportion is intentionally even higher.
Tools close the loop. Modern coding chatbots (Cursor, Copilot, Claude Code, Codex) don't just produce code; they run it. Compilation errors and test failures feed back into the model's next attempt. The loop "generate → run → repair → repeat" is the difference between "writes plausible code" and "ships working code."
Code rewards iteration. A function that compiles but doesn't pass tests is closer to "done" than a paragraph that's grammatically correct but says the wrong thing. The chatbot can keep refining, and each refinement step has a measurable success criterion.
What this means in practice: chatbots are now reliably useful for code in ways they are not reliably useful for, say, medical diagnosis or legal analysis. The asymmetry is not because the model is "smarter at code" but because code has structure and feedback signals that other domains lack. Domains with similar properties (math with proofs, formal verification, structured data manipulation) tend to work similarly well. Domains without (subjective writing, novel research, unfamiliar institutional knowledge) work less well.
For more on the production stack behind coding agents, see our agent serving infrastructure post.
Why long outputs degrade
Ask a chatbot for a short response and it's usually crisp. Ask it for a 5,000-word essay and somewhere around word 2,000–3,000, the quality starts to drift — repetition increases, the argument loses focus, factual claims get sloppier. There's an underlying mechanism.
Sampling errors compound. Each token generation is a probability distribution from which one token is drawn. A small error rate per token compounds across thousands of tokens. By token 3,000, the model is conditioning on a context that includes its own earlier sampling noise — drift from any single bad choice propagates forward.
Attention dilutes. The model attends to all prior context when predicting the next token. As the response grows, each token's "share of attention" toward the original instruction shrinks. The instruction at position 0 has less influence on the token at position 3,000 than at position 30.
The training distribution doesn't match. Pretraining data overwhelmingly consists of short-to-medium length texts. The model has seen relatively fewer examples of high-quality 5,000-word coherent outputs than of 500-word ones. Fine-tuning helps but doesn't eliminate the gap.
Repetition has gravity. Once a model has used a phrase, that phrase is more likely to reappear (the autoregressive feedback loop reinforces patterns in the recent context). Without explicit anti-repetition penalties at sampling time, long outputs tend toward loops.
Mitigations that help:
- Break long tasks into chunks with explicit handoffs.
- Use a planning + execution pattern: have the model outline first, then expand each section separately.
- For high-stakes long outputs, generate section by section and edit between sections.
- For reasoning-heavy long outputs, use a reasoning model (GPT-5, Claude Sonnet 4.6 thinking, Gemini 2.5 Pro with thinking) where the long output is the result of long internal deliberation rather than a long natural-language stream.
For technical depth, our long context attention post covers the mechanics of why attention dilutes and what production systems do about it.
Why context windows matter, and what 200K to 2M tokens means
A chatbot's context window is the maximum amount of text it can consider at once — system prompt + conversation history + any attached files + the current question. In 2026, the windows are large enough to be qualitatively different from a few years ago.
The numbers as of mid-2026 (publicly stated by vendors):
- GPT-5: around 200K tokens.
- Claude Opus 4.x / Sonnet 4.x: around 200K tokens, with a 1M-token tier announced for enterprise.
- Gemini 2.5 Pro: 1M tokens generally, 2M tokens in some access tiers.
- Llama 4 (Meta): 1M tokens for some configurations.
- Qwen 2.5-72B / Qwen 3: up to 1M tokens with extended-context configurations.
What 200K tokens is, in everyday terms. Roughly 150,000 words. About 600 pages of a typical novel. The entire text of Crime and Punishment fits comfortably. A complete medium-sized codebase (50,000–100,000 lines depending on language and verbosity) fits. A year of email or chat history fits.
What 2M tokens is. Approximately 1.5 million words. A multi-volume textbook. A several-hundred-thousand-line codebase. A patient's complete medical history. A complete book series.
Why this is qualitatively different. Pre-2024, you couldn't fit a typical user's relevant context into a chatbot. You had to summarise, chunk, or use retrieval. Now, for many use cases, you can drop the whole thing in. The product implications:
- Document analysis is dramatically more accurate.
- Codebases can be reasoned about as wholes rather than fragments.
- Long-running conversations don't need aggressive summarisation.
- RAG (retrieval-augmented generation) is less necessary for moderate-sized knowledge bases.
But context isn't free. A 1M-token prompt costs roughly 1M tokens' worth of input billing, plus the time to process. Latency on a 1M-token prompt is currently seconds-to-minutes for the prefill phase. Quality also degrades with very long contexts ("lost in the middle" effect documented across all major models).
Pragmatic guidance. For most uses, 50K–100K of context is sufficient and faster. The huge context windows are useful for specific tasks (long-document QA, full-codebase analysis) but not always optimal. The 2026 best practice is "use the smallest context that contains the relevant information."
Why chatbots apologise too much (and other RLHF artefacts)
A class of chatbot behaviours that feel slightly off-key — over-apologising, excessive hedging, refusing benign requests, opening every response with a preamble — is not a bug but a side effect of how the models were trained.
The RLHF feedback loop. During post-training, human raters (or AI judges) score model responses for helpfulness, harmlessness, and honesty. Responses that are politer, more cautious, and more hedged tend to score better on "harmlessness," sometimes at the cost of "helpfulness." Over many training iterations, the model converges on behaviours that maximise these scores even when the underlying user would have preferred a different tone.
Specific artefacts and their causes:
- Over-apologising ("I'm sorry for the confusion..."): rewarded during training as a sign of cooperativeness; persists even when no apology is warranted.
- Excessive hedging ("It's important to note that..."): rewarded as a sign of honesty about uncertainty; sometimes correct, often unnecessary.
- Refusals for benign requests: the model's safety training is conservative; false positives are preferred to false negatives in safety scoring.
- Preamble before the answer: trained pattern from datasets where high-quality responses start with framing.
- "Let me know if you have any questions" at the end: a friendly-tone artefact.
- Repetitive listicle structure: rewarded as "well-organised" during rating.
- Personality drift across conversation: as conversation grows, the system-prompt-induced persona competes with user-induced cues and may drift.
What you can do as a user. Most of these can be suppressed via prompting: "Be direct, no apologies or preambles," "Skip the disclaimers," "Be terse." Some products (Claude with custom instructions, ChatGPT with custom instructions, Gemini with custom personalities) let you set these preferences once.
What's coming. The 2026 trend is toward smarter post-training that explicitly penalises these artefacts. Anthropic's Claude 4.x release notes mention reduced sycophancy; OpenAI's GPT-5 has explicit "personality" tuning options. Expect ongoing improvement but not elimination — these patterns are sticky because they're rewarded across many training data sources.
For more on the underlying mechanics, see post-training and RLHF.
Voice mode: speech-to-speech architectures
When you talk to a chatbot in voice mode in 2026, there are two distinct architectures the product might be using, and they feel different.
Architecture A: classic pipeline (ASR → text LLM → TTS). Your speech is transcribed to text, the text goes to the language model, the model's text response is synthesised back to speech. Each stage adds latency. The model never "hears" your voice — only the transcribed text. Pros: simple, debuggable, uses the same text model for all interactions. Cons: latency is the sum of three stages (often 2–4 seconds end-to-end); paralinguistic information (tone, emotion, pauses) is lost in transcription; the synthesised voice doesn't react to the content semantically.
Architecture B: native speech-to-speech. The model is trained on audio directly. It hears your voice as input and produces speech as output, all within one model. Pros: latency under a second; emotional and tonal cues preserved; the model can laugh, hesitate, change tone mid-sentence. Cons: more expensive to train; smaller universe of training data than text; harder to debug; safety properties less well-studied.
Current products by architecture:
- OpenAI ChatGPT Advanced Voice Mode: native speech-to-speech, GPT-4o-class.
- Google Gemini Live: native speech-to-speech.
- Anthropic Claude voice mode: largely pipeline (ASR + Claude + TTS) as of mid-2026.
- Meta AI voice mode: pipeline with some native speech features.
- Microsoft Copilot voice: pipeline.
Why native speech-to-speech feels different. When the model hears your tone, it can match it. If you're frustrated, the model can detect that and adjust. If you laugh, it can laugh back. The interaction feels more conversational than the alternative. Whether this is an improvement is partly subjective; for accessibility uses (visual impairment, situations where typing is impossible), it's clearly better.
Privacy and safety implications. Voice mode sends actual audio to the vendor's servers (in most implementations). Audio is a more sensitive data type than text — voice prints are biometric, paralinguistic information reveals more about state than text does, and the data retention practices for voice are often less clearly disclosed. See AI chatbot privacy for the privacy framing.
For technical depth on the underlying multimodal stack, see multimodal serving.
The questions every user should ask their chatbot vendor
Before you commit to a chatbot for serious work (career, finance, health, family decisions), a short list of questions to actually answer. Most users skip this and regret it later.
About the model
- Which model is this product using, and at what version?
- When was this model trained (knowledge cutoff)?
- What context window is available on my plan?
- Does this model support tool use, image input, voice mode?
About my data
- Are my conversations used to train future models?
- How long are conversations stored?
- Can I delete all my conversation history?
- Is my data shared with third parties (including the underlying model provider, if the product is a wrapper)?
- Is my data accessible to vendor staff for quality assurance?
- Where (geographically) is my data stored?
About cost
- What does this cost per month, and what's included?
- Is there a per-message or per-token cap?
- What happens when I exceed limits?
- Are there hidden costs for advanced features (voice, vision, agents)?
About accuracy and reliability
- What does the model do when it doesn't know something?
- Are there areas the vendor explicitly recommends against using this model for?
- What's the policy on outdated information?
- Is there a way to flag incorrect outputs?
About vendor and product stability
- How long has the vendor been operating?
- What happens to my data and conversations if the product is discontinued?
- Is the product's underlying model committed to or can it be swapped?
- What's the upgrade path for me as a paying user?
Most chatbot products in 2026 answer roughly half of these questions in their public documentation. If you can't get answers, that's information itself.
For comparison-shopping the major flagships, see which AI should I use.
Costs, latencies, and where they come from
A practical aside on what a chatbot interaction actually costs the vendor and why latency feels the way it does.
Cost structure for a single interaction
A typical 2026 GPT-5 / Claude Sonnet 4.6 / Gemini 2.5 Pro interaction:
- Input tokens (your message + context + system prompt): 1,000–10,000 tokens, billed at roughly $1–5 per million.
- Output tokens (the model's response): 100–2,000 tokens, billed at roughly $5–20 per million.
- Per-interaction cost: anywhere from $0.002 (small interaction) to $0.10 (long input, long output).
Subscription products bundle this into monthly fees ($20 for ChatGPT Plus, $20 for Claude Pro, similar for Gemini Advanced). Heavy users may consume more than their subscription cost in API equivalents; light users subsidise heavy users.
Where latency comes from
A chatbot response has two phases:
- Prefill (processing your input). All input tokens are processed in parallel; the duration scales with input length and model size. For a 1,000-token input on a flagship model: 200–800 ms. For a 100,000-token input: 5–30 seconds.
- Decode (generating output). Tokens are generated one at a time, with each token waiting for the previous one. For 500 output tokens at 50–100 tokens/second: 5–10 seconds total, with the first token appearing within 200–500 ms.
The "first word appearing fast then a steady flow" experience is decoded-as-it-streams behaviour. The "long pause then the answer arrives" experience is non-streaming or batched processing.
How costs are dropping
Compared to two years ago, inference costs for equivalent-quality models have dropped roughly 10x. This is driven by:
- Better hardware (H100 → H200 → B200) with higher throughput per dollar.
- Better serving (continuous batching, PagedAttention, speculative decoding).
- Better quantisation (FP8 / INT8 / INT4) at minimal quality loss.
- More efficient architectures (MoE, smaller models matched in quality).
For technical depth: AI inference cost economics covers the full price stack.
What this means for users
- Free tiers will continue to grow in capability.
- Subscription pricing will continue to look "low" relative to what API access would cost a heavy user.
- Expect 1M+ token contexts to become commodity within 12–18 months.
- Voice mode will get cheaper, longer, and more accessible.
Side-by-side concept reference
A consolidated reference of concepts introduced throughout the guide, organised for skimming.
| Term | What it is | Where it shows up | When you should care |
|---|---|---|---|
| Token | A chunk of text, ~4 characters / 0.75 words | All cost & context discussion | When pricing or context limits matter |
| Context window | Max tokens model considers at once | Each model's spec sheet | When working with long docs or codebases |
| System prompt | Hidden instructions to the model | Vendor product surfaces | When behaviour seems oddly constrained |
| Temperature | Randomness knob (0–1+) | API access, some product settings | When you want consistent vs creative outputs |
| Top-p / top-k | Alternative randomness knobs | API access | Advanced sampling control |
| Pretraining | First training stage; read-the-internet | Underneath every model | Affects breadth of knowledge |
| SFT | Show-by-example training | Second training stage | Affects instruction-following |
| RLHF / DPO | Preference-based fine-tuning | Third training stage | Affects refusals, tone, hedging |
| Tool use | Model calls external functions | Search, code execution, agents | When you need fresh data or actions |
| Memory | Across-session persistence | ChatGPT Memory, Claude Projects | When you want continuity |
| Reasoning model | Thinks before answering | GPT-5, Claude thinking, Gemini thinking | Hard problems benefit |
| Multimodal | Handles images, audio, video | GPT-5, Gemini 2.5 Pro, Claude 4.x | Use case demands non-text |
| Fine-tune | Train model on your data | Specialised deployments | Lots of consistent data + domain |
| RAG | Retrieve docs into context | Internal knowledge bases | Frequent updates, smaller data |
| Prompt | What you write | Every interaction | Every interaction |
| Sampling | Picking next token | Every output | When outputs vary |
| Hallucination | Confidently wrong output | Anywhere facts matter | Always worth verifying |
| Refusal | Model declines to answer | Sensitive topics | When you hit unexpected ones |
The shortcut: tokens, context, sampling, training stages, tool use, and memory are the six concepts that explain 90% of chatbot behaviour. Master those six and most other vocabulary slots in cleanly.
Specific behaviours mapped to causes
| Behaviour you noticed | What's actually happening | What to do about it |
|---|---|---|
| Cuts off mid-sentence | Hit max output tokens | Ask it to continue, or raise output limit if API |
| Repeats earlier lines | Sampling feedback loop | Restart from a fresh context |
| Says "I can't help with that" unexpectedly | Conservative refusal classifier | Rephrase neutrally; try a different model |
| Adds long disclaimers | RLHF over-hedging | "Be direct, skip the disclaimers" in your prompt |
| Forgets earlier in conversation | Context length limit reached | Use a model with longer context or summarise older turns |
| Gets math wrong | Tokenized arithmetic + no calculator | Ask for code that computes the answer |
| Cites papers that don't exist | Pattern-matching plausible references | Always verify citations independently |
| Sounds different from last week | Vendor pushed a model update | Use the API with pinned model ID if you need consistency |
| Differs from a friend's response | Sampling randomness + memory differences | Set temperature=0 if you need reproducibility |
| Says today's date wrong | No clock unless tool provides one | Mention the date in your prompt, or use a search-enabled product |
Speed vs quality trade-offs you'll notice
| Lever | Faster | Slower | Practical effect |
|---|---|---|---|
| Smaller model | Yes | n/a | Cheaper, faster, less nuanced |
| Lower context use | Yes | n/a | Faster prefill |
| Streaming vs non-streaming | Streams sooner | n/a | Same total time, better UX |
| Reasoning mode off | Yes | n/a | Faster but worse on hard problems |
| Voice mode (native) | Yes (audio) | n/a | Lower latency; less debuggable |
| Web search tool | n/a | Slower | More recent and verifiable info |
| Agent mode | n/a | Slower | Multi-step actions, can be unreliable |
Cost surprises users hit
| Surprise | Cause | How to avoid |
|---|---|---|
| File attachments charged at full length | Long context counts as input | Trim files; use search-on-files instead |
| Voice mode hits limits faster | Audio tokens are dense | Watch the per-minute quota |
| Agents use 10x more tokens | Multi-turn tool use | Set tool-use turn limits |
| Long conversations get pricey | Full history sent every turn | Start fresh or use memory features |
| Vision uses many tokens per image | Each image is hundreds of tokens | Resize images; one image at a time |
These tables together give you a quick lookup for the most common "wait, what just happened?" moments. Bookmark and refer back as new behaviours surprise you.
Five-minute "level up" routines
These are short habits that pay back disproportionately for how little they cost in effort.
Routine 1: write your context, not your question. Spend 30 seconds writing what you already know, what you have tried, and what the constraints are, then ask the question. The same chatbot that gave you a generic answer will now give you a tailored one.
Routine 2: paste an example of what good looks like. If you want a draft email in a specific tone, paste two examples of that tone first and say "match this." If you want a code refactor in a specific style, paste two examples of the style first. Examples consistently outperform adjectives.
Routine 3: use the "explain back to me" check. When the answer matters, ask the chatbot to explain its reasoning, then explain the answer back to you in different words. Internal contradictions surface immediately.
Routine 4: ask for sources, then verify a few. If the chatbot cites three papers and you check the first one, you have a sense of whether the other two exist. If the first one doesn't exist, the rest probably don't either.
Routine 5: switch tools when one is failing. If a chatbot keeps refusing or keeps missing the point, a different product with a different training mix often nails the same task. The four flagships diverge enough in personality that one will usually fit a stuck use case.
These five routines collectively change the experience of using a chatbot from "throw question, hope" to "iterate quickly on a productive collaborator." The leverage on output quality is large.
For more on prompt-writing specifically, see how to write better prompts.
The bottom line
A chatbot is a next-token machine. Once you internalise that, every other behaviour stops being mysterious: hallucination is the predictor failing to distinguish plausible from true; cutoffs are a length budget running out; "memory" is the product stitching notes back into the prompt; the agreeable tone is reinforcement training, not understanding. The biggest lever you have is shaping the context you feed in — examples, format, and constraints — because the model only ever sees what's in front of it.
Takeaways:
- Treat the chatbot as a well-read assistant with no fact-checker, not as a search engine.
- Show, don't tell — examples beat adjectives.
- Turn on web search for anything recent or factual; otherwise expect confident guesses.
- Break long requests into small turns; long single-shot answers drift.
- Verify anything consequential — health, legal, money, citations — against a real source.
For a sibling guide that compares the four big products head to head, see which AI chatbot should I use. For why the made-up answers happen specifically, see AI hallucinations.
FAQ
Is the chatbot actually intelligent? That depends on what you mean by intelligent. It's astonishingly good at things that look like intelligence — explaining, reasoning, writing — but it does them by predicting words, not by understanding the way a human does. It can solve problems it has never seen before only to the extent that those problems pattern-match to things in its training data. Whether that counts as intelligence is a philosophical question; for practical purposes, treat it as an extremely well-read but easily fooled assistant.
Is it just Googling things? No, not by default. It's reciting patterns from what it read during training, months ago. When you turn on search (ChatGPT Search, Claude with web search, Gemini in Google products, Perplexity), it actually looks things up. Without search, you're getting auto-complete from memory.
Can it learn from our conversation? Not in real time. The model itself doesn't change while you talk to it. Some products take your conversations to improve future versions in periodic training updates (you can usually opt out). Memory features write down notes that get pulled into future chats with you specifically — but again, that's product-level, not the model "learning."
Why does it sometimes refuse things that seem fine? Safety training. The model was trained to refuse certain categories of requests, and it sometimes over-refuses on borderline ones. If you think the refusal is wrong, you can usually re-ask with more context ("this is for a creative writing project") or try a different product.
Why does it agree with me when I'm wrong? Trained habit. The model was rewarded during training for being helpful and agreeable. It can be talked into wrong answers if you push hard. If you want it to challenge you, say so explicitly: "play devil's advocate" or "be skeptical of my reasoning."
Is paid worth it? For occasional use, the free tier of most products is fine. The paid plans (around $20/month for ChatGPT Plus, Claude Pro, Gemini Advanced) give you more usage, longer responses, access to better models, and features like file analysis, image generation, and longer memory. If you use it daily for work, yes.
Which one should I use? ChatGPT for the broadest features and integrations. Claude for writing, long documents, and code (Claude users rave about the writing quality). Gemini if you live in Google's ecosystem. Copilot if you live in Microsoft 365. Try all of them on the free tier — they have different personalities and you'll have a preference.
Will it replace my job? Probably not entirely, more likely it changes your job. The pattern so far: AI is good at the parts of jobs that involve writing, summarising, drafting, simple coding, and explanation. It is not good at judgment, accountability, knowing what's true, or doing things in the physical world. Most jobs are some mix. The mix is shifting.
Are my conversations private? Depends on the product and the plan. Free tiers usually train on your conversations unless you opt out. Paid consumer tiers typically don't train on your data by default (this changed across products in 2024–2025). Enterprise tiers have stricter contracts. Always check the product's data policy.
Why are there so many AI products now? Because the underlying technology became commodity in 2023–2024. The base models (from OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek) are now available to wrap into any product. Most "AI startups" in 2026 are wrappers around one of these models plus a specific use case (writing assistant, coding tool, customer support, image generator, etc.).
What's an "agent"? An agent is a chatbot that can do things, not just answer. It can use tools (search, calculators, write to your calendar, run code) and chain multiple steps together to accomplish a task. "Make a reservation for two on Friday" — a chatbot writes a reply about how to make a reservation; an agent actually books one. Agents are real but still rough in 2026; expect them to be a much bigger part of the AI product landscape over the next few years.
Why does the same question give different answers each time? There's a small amount of randomness built into the generation process — the model doesn't always pick the single most likely next word, it samples from a few high-probability options. This is by design; it makes the output less robotic. The downside is the same question can give two different answers, even on the same day. If you want consistency, set the "temperature" to 0 in the API, or explicitly say "give the same answer each time you're asked this."
Why does ChatGPT sometimes refuse questions Claude or Gemini will answer? Each company sets its own safety rules and the models are trained to enforce them. The rules diverge — Anthropic, OpenAI, Google, and Microsoft each draw the line in slightly different places. Refusal patterns also change with new model versions; what one chatbot refused last year it might answer today, and vice versa. If a refusal seems wrong, try rephrasing or try a different chatbot.
Is the chatbot reading my mind / can it tell my mood? It can pick up tone from your words ("I'm frustrated with this code" — the model notices and responds more carefully) but it isn't reading anything beyond what you type. Voice mode adds tone-of-voice signal; image input adds whatever's in the image. No telepathy. No microphone access without permission. The personalisation you see is the model adapting to what you literally wrote.
Can I trust the chatbot with my health / legal / financial questions? Trust it the way you'd trust a knowledgeable friend who reads a lot — useful for explaining concepts and helping you formulate questions, not a substitute for a doctor, lawyer, or accountant. Mistakes on these topics are higher-stakes than mistakes on a recipe; verify everything, and use professionals for decisions that matter.
Why does it sometimes start outputting code in random places? Pattern matching going slightly wrong. The model saw enough examples of "explain this with a code snippet" that some questions trigger code mode incorrectly. Tell it "in plain English, no code" and it'll comply.
How is GPT-5 different from GPT-4o? GPT-5 (released late 2024 / early 2025 depending on tier) is OpenAI's next-generation flagship — larger, trained on more data, generally smarter on hard problems. GPT-4o is the older flagship, still widely used because it's cheaper and faster. Most consumer ChatGPT users get GPT-5 on Plus/Pro tiers by default; the free tier still uses GPT-4o or GPT-4o mini.
What's Claude 4.6 vs Claude Opus 4.x? Sonnet 4.6 is Anthropic's mid-tier model — fast, smart enough for most tasks, what Claude Pro users get most of the time. Opus 4.x is the flagship — slower and more expensive, for harder problems. Haiku 4.5 is the cheap, fast tier for simple questions. Claude with "extended thinking" turned on is any of those models with reasoning mode enabled, which trades speed for depth.
What's Gemini 2.5 Flash vs Gemini 2.5 Pro vs Gemini Deep Think? Flash is fast and cheap; Pro is the flagship; Deep Think is the reasoning model (Google's equivalent of OpenAI's o-series or Claude with extended thinking). In the consumer Gemini app on the free tier, you get Pro most of the time; Advanced gets you more Pro and access to Deep Think.
Does Copilot use ChatGPT under the hood? Mostly yes. Microsoft has a deep partnership with OpenAI; most of Copilot's smarts come from OpenAI's models (GPT-4o, GPT-5). Microsoft also runs its own smaller models for specific tasks, and is building independent capability over time. The user-visible Copilot personality and behaviour come from Microsoft's product layer on top of OpenAI's models.
Why can't I just download the model and run it locally? You can with some of them. Llama 4 (Meta), Qwen 3, DeepSeek V3, Mistral models, and Gemma (Google) are open-weight — you can download them and run them on your own hardware. The setup is technical and the hardware bar isn't trivial (a 70-billion-parameter model needs a high-end GPU). ChatGPT (OpenAI), Claude (Anthropic), and Gemini (Google) are not downloadable; their weights are private. For everyday use, the hosted versions are dramatically more convenient.
What does "fine-tuning" mean? Taking a trained model and doing extra training on a specific dataset — your company's customer-support tickets, your writing style, a domain like medicine. The model keeps most of its general knowledge but shifts toward the new data. Major chatbots offer this for paying API customers; it's not usually a consumer feature.
Can the chatbot see my screen? Only if you explicitly give it permission. ChatGPT has a "screen sharing" feature; Copilot has access to whatever app you're inside; some agents can take screenshots of pages they're operating on. None of them have ambient access to your screen without explicit consent.
Why does it write in different styles depending on what I ask? Trained habit. The model saw professional writing, casual writing, code comments, poetry, marketing copy, and academic writing in training. It learned to match the register of the question — a formal question gets a formal answer, a casual question gets a casual one. You can override the default by asking explicitly: "Reply in casual tone with short sentences."
Will I get a refund if the chatbot gives me wrong advice and I act on it? Probably not. Every consumer chatbot's terms of service disclaim responsibility for the accuracy of outputs. Treat it as a tool, not an authority. For consequential decisions, use a professional.
What's the difference between a chatbot and an LLM? The LLM (large language model) is the math model — the "brain." A chatbot is the consumer product built around it: the chat interface, the memory features, the file-upload tools, the search integration. Many products share the same underlying LLMs.
Why does it sometimes say "I can't help with that" for things that seem totally normal? Safety training is calibrated conservatively, and it sometimes catches benign queries that pattern-match to something the company decided is risky. Examples: medical dosage questions (even for pets), security-related coding questions, legal advice. The fix is usually to add context — "this is for my own dog with vet supervision," "this is for a CTF practice problem," "I'm not asking for legal advice, just an explanation of how X law works."
Extended FAQ
What's the difference between "attention" and "memory"? Attention is the mechanism inside the model that lets it look at earlier tokens in the same conversation — it works only within the current context window. Memory is a product feature that stores facts about you across conversations. Attention is mathematical; memory is a notes file.
Why do models say "as a large language model" so much? Trained pattern. The phrase was over-represented in RLHF responses early on. Newer models say it less but the habit persists. You can ask it to drop the phrase and it usually will.
Is "the model" the same as "the chatbot"? The model is the math underneath. The chatbot is the product wrapped around it — UI, memory, tools, account system. ChatGPT is a chatbot; GPT-5 is a model. Multiple chatbots can use the same model.
How does the model know to stop generating? A special end-of-turn token. During training, the model learned that conversation responses end with a particular invisible token. When it predicts that token, the product stops the stream. The model can also hit a hard length cap (typically 4k–16k tokens for a single response) before reaching its natural stopping point.
Does the model "remember" within a single conversation? It re-reads the entire conversation history on every turn. There is no persistent state between turns; the model's "memory" of the conversation is just that the full history is in its context window each time.
Why does the model sometimes lose its place in long conversations? Two reasons: (a) older messages can fall out of the context window if the conversation gets long enough, and (b) attention to distant tokens is weaker than attention to recent ones, so even within the window the model gets fuzzier on early content. The "lost in the middle" effect is real and quantifiable.
Can the chatbot lie on purpose? Not in the way a human lies (intentional deception). It can produce false statements confidently because it can't tell true from false. In some adversarial settings (jailbreaks, role-play prompts) it can produce statements its safety training would normally prevent — that's the closest analog to "lying," but it's still mechanically the same next-token prediction.
Why do reasoning models sometimes spend tokens "talking to themselves"? Because they were trained to. The training reward signal benefited models that produced long reasoning traces before answering. The chain-of-thought is the actual work; the visible answer is the summary.
Why is voice mode so much better than IVR? Two reasons: (a) the underlying LLM is genuinely capable, while IVR systems used simple rules, and (b) the latency is low enough to feel like real conversation. Phone-tree IVR has 5–10 second response delays; voice mode has 200–500 ms.
Can I make the chatbot completely deterministic? Mostly. Set temperature to 0 in the API and disable any sampling. There will still be tiny floating-point non-determinism from the GPU (results can differ in the 6th decimal place), but for practical purposes you get repeatable outputs. The downside: deterministic outputs are also more boring and less creative.
Are open-weight models like Llama and DeepSeek as good as ChatGPT? Close, on most benchmarks. Llama 4 and DeepSeek V3 perform within 5–15% of frontier closed models on broad evals as of 2026. The gap is narrower on tasks they were trained for (coding, math) and wider on niche capabilities. For consumer chat, open-weight is competitive; for cutting-edge agent work and reasoning, frontier closed models still lead by a few months.
Why does the chatbot sometimes "agree to disagree" or change its answer when I push back? Trained to be agreeable. Pushback signals "the user is unhappy," and the model adjusts toward satisfaction rather than truth. To get a chatbot that pushes back on you, instruct it explicitly: "If I'm wrong, tell me. Don't agree to keep me happy."
Are there chatbots for non-English languages that are better than the big four? Yes for some languages. Qwen (Alibaba) is strongest in Chinese; Yi-Large (01.AI) competitive in Chinese; Aya (Cohere) trained for 100+ languages with strong low-resource performance; Mistral and Llama have strong French/Spanish; SEA-LION strong for Southeast Asian languages. For widely-spoken languages, the big four are competitive but not always best.
Why don't chatbots cite sources by default? Because they can't reliably. Without search turned on, any "citation" is fabricated. Even with search, citations can be wrong (the chatbot misattributes a claim to a source it didn't actually come from). Newer products (Perplexity, ChatGPT with search, Claude with web search) are improving this, but always click through and verify.
What's "in-context learning"? The chatbot's ability to learn from examples you provide in the same conversation. Give it three examples of "the format I want," and it will produce more examples in that format — without any retraining. In-context learning is one of the surprising emergent properties of large language models; smaller models can't do it nearly as well.
Can chatbots really code, or do they fake it? They really code, with caveats. The code they produce is genuinely functional for common patterns, libraries, and small programs. For larger or unfamiliar codebases, they make mistakes — confidently calling APIs that don't exist, mixing up library versions, missing important context. Treat AI-generated code like junior-developer code: review it before merging.
Why is the response sometimes slower for the same question? Server load. Hosted chatbot services use load balancers; busy times spread requests across more GPUs but with longer queues. Free tiers get throttled before paid tiers. Reasoning models are inherently slower regardless of load.
Can the chatbot access my files / accounts without permission? No. The model has no ambient access. Anything it can "see" was put in front of it by you (uploaded file, pasted text) or by an explicitly-connected tool (Copilot's connection to your Microsoft 365, ChatGPT's connection to Google Drive if you authorised it). Without those, it can't reach your data.
Why does the chatbot say "I don't have access to the internet" sometimes when other times it does? Web search is a tool the model can call. Some product modes have it enabled (ChatGPT Search, Gemini in Google Search, Perplexity always), some don't (default ChatGPT plain chat, Claude without web tools turned on). The model knows which tools are available at the start of each conversation and replies accordingly. If you want web access, look for the search toggle in the chat UI.
Why are some models much faster than others at the same task? Three reasons. (a) Model size — smaller models are faster. (b) Hardware — Groq's LPU and Cerebras's WSE-3 run open-weight models at 5–10× the speed of equivalent H100 deployments. (c) Optimisations — providers use techniques like speculative decoding, batching, and KV caching to varying degrees. The user-visible speed difference between "fast" and "slow" providers can be 10× for the same model.
Can I get a chatbot to write in my voice? With enough examples, yes. Provide 5–10 samples of your writing in the prompt and ask the chatbot to match the style. For consistent voice across many uses, fine-tune a model on a larger corpus of your writing — Claude, ChatGPT, and Gemini all support API fine-tuning for this. Custom GPTs and Projects with a style guide also work reasonably well without full fine-tuning.
Does the chatbot understand humour, sarcasm, idioms? Mostly yes, in the languages and cultures well-represented in training data (English, major European languages, Mandarin). Subtler humour, regional slang, and idioms from underrepresented languages get missed more often. Sarcasm works as long as the context makes it clear; without context, sarcasm sometimes reads as sincerity to the model. Reasoning models tend to over-explain jokes rather than land them.
Why do new model versions sometimes feel worse than old ones? Trade-offs in training. A new version might score higher on benchmarks but feel different in conversation — a different tone, more or fewer hedges, different formatting preferences. Some users prefer the old behaviour. Anthropic and OpenAI both got pushback on personality shifts in late 2024 / early 2025 model updates. The objective measures (benchmarks) and the subjective measures (feel) don't always align.
What's the difference between a "base model" and a "chat model"? A base model has gone through pretraining only — it's a fluent next-token predictor on internet text. Asking it a question gets you a continuation, not necessarily an answer. A chat model has additionally been through SFT (showing it what helpful answers look like) and RLHF/DPO (rewarding helpful, harmless, honest behaviour). Almost every product you interact with is a chat model. Open-weight base models (Llama base, Mistral base) are released for researchers and fine-tuners; chat variants (Llama Instruct, Mistral Instruct) are the consumer-facing form.
Why does ChatGPT sometimes give different answers to the same question? Sampling randomness. Unless temperature is set to 0, each generation samples from a probability distribution, producing different but typically related outputs. Even at temperature 0, system load, model version drift, or A/B testing can cause variation. The variance is a feature for creative tasks and a bug for tasks where you want deterministic answers. If you need consistency, use the API with temperature=0 and a fixed seed where supported.
Why does my chatbot sometimes get worse at coding mid-conversation? Two reasons. First, context dilution — as conversation grows, the original code-task framing has less attention weight, and the model may drift toward chatty rather than precise modes. Second, repetition gravity — if you've copy-pasted similar code several times, the model starts matching style rather than thinking from first principles. Mitigations: start a fresh conversation for unrelated coding tasks, paste only the specific code you need, and explicitly remind the model of constraints when they matter.
Can a chatbot actually do math? Reasoning-capable flagships (GPT-5 thinking, Claude Sonnet 4.6 thinking, Gemini 2.5 Pro thinking) handle algebra, some calculus, and bounded competition problems well. They're unreliable on long arithmetic, anything requiring exact numerical computation, and problems with subtle constraints. The practical answer for production math: use a chatbot to frame the problem and write code that solves it (using a Python interpreter tool), rather than asking for the numerical answer directly.
What's "in-context learning" and why does it matter for users? In-context learning is the model's ability to pick up patterns from examples in the prompt without any further training. If you show 3–5 examples of input-output pairs in your prompt, the model will often complete the next input correctly even if the task is novel. This is the basis of "few-shot prompting" and is why "show me an example" works so well. It also explains why showing the model how you want output formatted is dramatically more reliable than describing it.
Are chatbots biased? Yes, in ways that mirror their training data. The internet contains the biases of its authors; pretraining absorbs them; RLHF mitigates some but not all. The biases are visible in: which demographic perspectives appear by default, which professions get gendered defaults, which historical narratives are framed how, and which cultural norms are treated as universal vs particular. Anthropic, OpenAI, and Google all publish model cards or system cards documenting known biases; they are worth skimming if you're using these models for sensitive decisions.
Why can't a chatbot tell me what it doesn't know? Because the model has no internal "knowledge inventory" to consult. From the model's perspective, generating a confident answer and generating a hedge are produced by the same mechanism. Some reasoning models will, when explicitly asked to assess their certainty, produce calibrated estimates — but the underlying confidence signal is weak. For high-stakes uses, treat all chatbot factual claims as unverified.
What does "model card" or "system card" mean and should I read them? A model card / system card is a vendor-published document describing the model's capabilities, training data (in vague terms), safety properties, intended uses, known limitations, and evaluation results. OpenAI, Anthropic, and Google all publish these for major model releases. They're worth a 15-minute skim before using a model for serious work — they tell you what the vendor expects the model to be good and bad at, which is information you can't easily get elsewhere.
Why do chatbots sometimes contradict themselves within one response? Three reasons. First, the model doesn't plan ahead — each token is generated without strong commitment to a global structure. Second, the model may have absorbed both sides of a debate in training and not have a "true" position. Third, fine-tuning on diverse RLHF data can introduce inconsistent preferences. Mitigation: ask for a structured response (numbered points, with reasoning before conclusion) or use a reasoning model that explicitly works through the answer.
How is a "small model" different from a "big model" in practice? Smaller models (1B–8B parameters) run faster, cost less, and can run on consumer hardware. They handle short, well-defined tasks well — summarisation, classification, simple Q&A. Larger models (70B–500B+) handle nuanced reasoning, long contexts, complex multi-step tasks. For most consumer-product interactions, a 70B-class model is over-served by a flagship. Hosted products give you the flagship anyway; the cost difference shows up in API and self-hosted deployments.
What is "alignment" and why do people talk about it so much? Alignment is the project of getting the model's behaviour to match what humans actually want. It includes safety (don't produce harmful content), helpfulness (actually answer the question), honesty (don't lie or hedge unnecessarily), and consistency with declared values. RLHF, DPO, Constitutional AI, and various adversarial-training methods are alignment techniques. The "alignment problem" in the abstract is whether we can keep doing this reliably as models get more capable. In daily use, alignment is what makes the chatbot feel like a usable assistant rather than a confusing text predictor.
Why does my chatbot keep adding emoji and exclamation points? A tone that performs well on RLHF preference ratings for many users, but feels off to others. Most products let you suppress this via custom instructions: "Don't use emoji or exclamation points; be direct and professional." Setting this once typically persists across conversations for the major products.
What's the difference between Custom GPTs, Projects, Agents, and Assistants? Mostly product-specific naming for similar ideas. A Custom GPT (OpenAI) is a saved configuration — system prompt, tools, files — that you can reuse. A Project (Claude, ChatGPT) is a conversation container with persistent knowledge files. An Agent is a Custom GPT / Project plus autonomous tool-use. An Assistant (OpenAI's older API name) is the developer-facing version. The underlying mechanism is the same: take a base model, add a system prompt, optionally attach files for retrieval, optionally attach tools for actions.
Can a chatbot replace a search engine? Sometimes, with caveats. For "tell me about X" questions, a flagship with web-search tool enabled is often more useful than a traditional search engine — it integrates information from multiple sources and answers the actual question. For "find me the page that says Y" questions, traditional search is still better. For research and learning, the chatbot's tendency to confidently summarise (sometimes incorrectly) means you should treat its answers as a starting point, not the destination. Most flagship products now have search built in (ChatGPT Search, Perplexity, Gemini with Google Search grounding).
Why do prices keep dropping for the same model quality? A combination of better hardware (each GPU generation roughly doubles useful throughput), better software (continuous batching, PagedAttention, speculative decoding), better quantisation (FP8/INT4 with minimal quality loss), and competition among providers. Inference costs for equivalent-quality models in mid-2026 are roughly 1/10 of mid-2024 levels. The trend is expected to continue, though the rate of decline is moderating. See AI inference cost economics.
What happens when I "regenerate" a response? The product sends the same prompt (system + history + your question) back to the model, generates with new sampling randomness, and shows you a different output. The randomness is in the sampling step; the model itself is deterministic given fixed inputs and seed. Regeneration is useful when the first response was off-style; it's not useful as a way to "check" the model — the second response can be wrong in the same or different ways as the first.
Do chatbots understand what I want, or just respond to keywords? Somewhere in between. The model encodes a rich representation of your input, not just keywords — that's why it handles paraphrasing, indirect requests, and contextual references well. But "understanding" in the human sense (with goals, beliefs, and grounded reference) is not what's happening. The right framing is: the model produces what would be a reasonable continuation given everything it has seen. For most interactions that's indistinguishable from understanding; for edge cases the difference shows up.
Why doesn't the chatbot just tell me when its knowledge is out of date? Because the model often doesn't know what's outdated. The training data has a cutoff date but the model doesn't have a clean internal record of "this fact is from 2023 and may have changed." Some products inject a system message reminding the model of its cutoff; some models are trained to mention it when relevant. The robust workflow: when you ask about something time-sensitive, use a model with web search enabled, or explicitly verify the answer elsewhere.
Will GPT-6 / Claude 5 / Gemini 3 be qualitatively different from today's models? Hard to predict reliably. The trend over the last 3 years has been steady capability improvement with occasional step changes (GPT-3 → 4, the rise of reasoning models). Plausible next-generation features: longer reliable reasoning, better agent stability, larger and faster context, more native multimodality, lower cost per useful task. Less plausible in a single generation: human-level reasoning across all domains, full autonomy on real-world tasks, robust generalisation to unfamiliar formats. Hedge expectations appropriately.
Glossary
- Chatbot / AI assistant — A product that lets you have a conversation with a large language model. ChatGPT, Claude, Gemini, Copilot.
- Context window — How much text the chatbot can hold in one conversation, measured in tokens.
- Hallucination — When the chatbot confidently makes something up.
- Knowledge cutoff — The date when the model stopped reading. It doesn't know about anything after that, unless connected to search.
- Large language model (LLM) — The mathematical model under the hood. The "brain" of the chatbot.
- Memory — A feature where the product remembers facts about you across conversations.
- Prompt — Your message to the chatbot. Also the system instructions the product gives the model behind the scenes.
- Token — A chunk of text the model sees. Roughly a word, sometimes part of a word.
- Training — The months-long process of teaching the model from internet text, then teaching it to be a helpful assistant.
- Reasoning model — A model trained to think step by step before answering. Slower, more expensive, better at math and complex problems. Examples: OpenAI o3 / o4, Claude with extended thinking, Gemini Deep Think.
- System prompt — Hidden instructions the product gives the model before your conversation starts. Shapes the model's personality and behaviour. Different products have very different system prompts.
- Fine-tuning — Additional training of a model on a specific dataset, to specialise it for a task or style.
- Agent — A chatbot that can use tools (search, code, calendars) and take multi-step actions, not just produce text.