What an LLM Is and How It Predicts the Next Word â€" Xap.es

ChatGPT, Claude, Gemini, Llama, Mistral. They are all variants of the same technological family: Large Language Models, or LLMs. The most important question you can ask about them is not how many parameters they have or how much they cost to train. It is this: what are they actually doing when they generate text?

The answer is simpler than it seems, and more important than most people assume.

The central idea

An LLM is, in its most stripped-down essence, a text-completion machine. Given a fragment of text — the context — it predicts what the most probable continuation is.

That is all. It does not reason, understand, or consult a database of verified facts. It takes the text it has and calculates, for each next position, a probability distribution over all possible words (or tokens). Then it chooses one — sometimes the most probable, sometimes one of the most probable with some randomness — and adds it to the context. Then it repeats the process.

This iterative prediction mechanism, applied with sufficient sophistication and trained on enough data, produces text that can seem reasoned, informed, even brilliant. But the mechanism has not changed: it is token prediction.

From tokens to probabilities

The LLM does not process words directly. It processes tokens: text fragments of variable length defined by a process called tokenisation. A token might be a complete word, part of a word, or even punctuation. As a rough general rule, 1 token ≈ 0.75 words in English.

A typical model’s vocabulary contains between 30,000 and 100,000 tokens. For each position in the sequence, the model calculates a score for every token in the vocabulary, converts them to probabilities (using a function called softmax), and samples from that distribution.

Temperature is the parameter that controls how much randomness is injected into that sampling. With temperature 0, the model always chooses the most probable token — very predictable and repetitive outputs. With high temperature, the model samples from flatter distributions — more varied and creative outputs, but also more prone to errors.

Simplified token prediction example:

Context: "The capital of France is"
Probabilities (simplified):
  "Paris"    → 94.2%
  "Lyon"     → 1.8%
  "a"        → 1.1%
  "the"      → 0.9%
  [others]   → 2.0%

The model chooses "Paris". The new context is:
"The capital of France is Paris"
→ The process repeats for the next token.

The transformer inside

The architecture that makes all this possible is the transformer. Its critical component is the attention mechanism: a system that allows the model, when predicting each token, to take into account all previous tokens in the context and weight which ones are most relevant to that prediction.

When the model processes “the bank where I keep my money was closed,” the attention mechanism can associate “closed” with “bank” (the financial institution) and not confuse it with a riverbank. It can make that association even when the tokens are separated by several words.

Modern transformers have many attention “heads” in parallel, each learning to capture different types of relationships between tokens: syntactic, semantic, referential. The output of all those heads is combined to produce a rich representation of each token in its context.

Scale: the secret ingredient

What turned transformers from a promising architecture into the technology that is redefining entire sectors was scale.

OpenAI researchers published a study in 2020 on “scaling laws” that showed something surprising: the performance of language models improves in a predictable and continuous way when three dimensions are increased in parallel: number of parameters, amount of training data, and computation.

This was different from what had happened with previous architectures, where gains plateaued beyond a certain point. With transformers, more was more. That triggered the model race that characterises 2020–2024.

GPT-3 (175 billion parameters) showed something no previous model had shown: emergence. Capabilities not present in smaller models appeared without being explicitly trained for them. Arithmetic, analogies, basic reasoning. The model was not taught to do sums: it simply could do them, because the pattern was in the training text.

What prediction cannot be

Understanding that an LLM predicts tokens has practical consequences that matter every time you use one of these models.

There is no fact-checking. The model does not consult any database when generating text. Its “knowledge” is encoded in its parameters, which are fixed since training. If it tells you something incorrect with total confidence, it is not that it is “wrong” in the human sense — it simply chose the most probable tokens given the context, and those tokens did not happen to correspond to the facts.

There is no guaranteed reasoning. When the model “reasons” out loud, what it does is generate text that resembles human reasoning, because human reasoning was in its training data. Sometimes that process produces the correct answer. Sometimes it does not. There is no guarantee.

There is sensitivity to context. What the model generates depends strongly on how the context — the prompt — is formulated. The same question formulated differently can produce radically different responses. That is not a bug: it is a direct consequence of the context-based prediction mechanism.

The next time you use an LLM and are surprised by its apparent intelligence, remember: it is predicting tokens. That this produces outputs that seem intelligent is extraordinary. That it is not the same as real intelligence also matters.