How LLMs Work — Interactive Lab
A language model cannot process text. It processes numbers. Tokenization is the conversion layer: raw text in, integer sequences out. Every character you type passes through a tokenizer before the model sees it.
Most modern LLMs use Byte Pair Encoding (BPE). The algorithm starts with individual characters as tokens, then iteratively merges the most frequent adjacent pair into a new token. After tens of thousands of merges, common words become single tokens while rare words are split into subword pieces.
The vocabulary is fixed at training time. Modern frontier models typically use vocabularies of 100,000 to 200,000 tokens. The tokenizer is not learned during training; it is built beforehand from the training corpus statistics.
BPE operates on byte sequences, not Unicode characters. This is why it handles any language or encoding without an explicit character set. The merge table is a deterministic ordered list: merge #1 might be t + h → th, merge #2 might be th + e → the, and so on. At inference time, the tokenizer applies merges greedily in priority order.
Token boundaries matter for model behaviour. The same word tokenized differently (because of surrounding context, capitalisation, or spacing) produces different embeddings and therefore different downstream computations. This is one reason LLMs handle code, URLs, and non-English text inconsistently.
Token count ≠ word count. English averages roughly 1.3 tokens per word. Languages with longer words or non-Latin scripts can exceed 3-4 tokens per word, consuming more context window for the same semantic content.
Try it: Type text in the input to see how it splits into tokens. Hover over any token to see its integer ID.
Each token ID maps to a dense vector, typically of dimension 4,096 to 12,288 depending on model size. This mapping is a simple lookup table: token 5,293 retrieves row 5,293 from a learned matrix. There is no computation here, only retrieval.
The embedding vector is where meaning lives. Tokens that appear in similar contexts during training end up with similar vectors. "King" and "queen" are closer in this space than "king" and "bicycle".
Positional encoding is added to the token embedding so the model knows where each token sits in the sequence. Modern models use Rotary Position Embeddings (RoPE), which encode relative position directly into the attention computation.
The embedding matrix has shape [vocab_size, d_model]. For a 100K vocabulary and 4096-dimensional model, this single matrix contains ~400 million parameters.
RoPE applies a rotation matrix to the query and key vectors at each position. The rotation angle is a function of position and dimension index, using sinusoidal frequencies at different scales. The dot product between a query at position i and a key at position j depends only on the relative distance i - j.
Self-attention is the core operation. For each token, the model computes how much it should "attend to" every other token, then produces a weighted combination of their representations.
Three linear projections transform each token's embedding into a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I carry?). The attention weight between two tokens is the dot product of one's Query with the other's Key, scaled and normalised.
In decoder-only models (GPT, Claude, Llama), a causal mask prevents tokens from attending to future positions. Token 5 can see tokens 1-5 but not token 6. This is what makes autoregressive generation possible.
The scaled dot-product attention formula:
The scaling factor 1/√d_k prevents dot products from growing large in magnitude as dimension increases, which would push softmax into vanishing-gradient regions.
Multi-head attention runs h parallel attention operations, each with its own Q, K, V projections of dimension d_k = d_model / h. Outputs are concatenated and projected back. Different heads learn different types of relationships: syntactic, semantic, positional.
Try it: Enter a short sentence to see the attention heatmap. Toggle the causal mask. Select different heads to see varied attention patterns.
A transformer is a stack of identical blocks. The number varies by model size: GPT-3 used 96 blocks, Llama 3 405B uses 126, and frontier models like GPT-5 and Claude Opus are believed to use hundreds. Each block applies the same sequence: normalisation, attention, residual connection, normalisation, feed-forward network, residual connection.
The residual connections are what make deep transformers trainable. Instead of each layer transforming the representation completely, it adds a small update. Information flows both through the layers and around them via the residual stream.
The MLP in each block typically expands the dimension by 4x, applies a nonlinearity (GeLU or SiLU), and projects back down. This is where most factual knowledge is stored, as opposed to attention layers which manage information routing.
The modern transformer block (pre-norm variant):
RMSNorm replaced LayerNorm in most modern architectures. It normalises by RMS without recentering. The MLP uses SwiGLU gating: MLP(x) = (xW₁ ⊙ SiLU(xW_gate))W₂.
After all transformer blocks, the model produces one score (logit) per vocabulary token. These are converted to probabilities via softmax. The sampling strategy determines how the next token is chosen.
Temperature scales logits before softmax. Temperature < 1 sharpens the distribution (more deterministic). Temperature > 1 flattens it (more random). At temperature = 0, the model always picks the top token.
After temperature scaling, top-k restricts sampling to the k highest-probability tokens. Top-p (nucleus sampling) includes tokens until cumulative probability exceeds p. Top-p adapts to distribution shape: fewer candidates when confident, more when uncertain.
Try it: Adjust temperature, top-k, and top-p to see how the probability distribution changes.
Autoregressive generation produces one token at a time. The KV cache stores Key and Value projections for all past tokens so they are computed only once. Each new token only needs its own Q, K, V computed, then attends to the full cache.
The cost: memory. For a large model generating a long sequence, the KV cache alone can consume tens of gigabytes of GPU memory. Modern context windows extend to 200K tokens (standard) or 1M+ tokens (Claude Opus 4.6 beta, Gemini 2.5 Pro). Context length is a hardware constraint, not just an architecture choice.
KV cache per layer: 2 × batch × seq_len × d_head × n_heads × 2 bytes. As a concrete example, Llama 3 70B (80 layers, 8 KV heads via GQA, d_head = 128): each token adds roughly 2.6 MB to the cache. Frontier models with longer context windows and more layers consume proportionally more.
Grouped-Query Attention (GQA) shares KV heads across multiple Query heads. Multi-Query Attention (MQA) uses a single KV head for all queries. PagedAttention manages cache like virtual memory pages.
Pretraining: trillions of tokens, next-token prediction. Grammar, facts, reasoning patterns absorbed. Cost: tens to hundreds of millions of dollars for frontier models, and rising with each generation.
Supervised Fine-Tuning: human-written examples of ideal responses. The model learns format and style of a helpful assistant.
RLHF: a reward model trained on human preferences, then used to optimise the LLM. This is where sycophantic tendencies emerge: humans prefer agreement, so the model learns to agree.
The RLHF objective:
The KL penalty prevents reward hacking. DPO bypasses the reward model, optimising directly on preference pairs. Constitutional AI uses self-critique against a set of principles.
RAG: encode query as vector, search document database, inject top matches into prompt. The model "knows" things it never saw during training because the information is in context.
Tool use: model outputs structured function calls instead of text. The system executes them and feeds results back.
Agents: combine tools with planning. The model reasons, acts, observes, iterates. Generation-verification asymmetry is most acute here: actions execute at computational speed, consequences may be irreversible.
RAG retrieval typically uses dense passage retrieval: both queries and documents are encoded by a bi-encoder (E5, BGE) into the same vector space. Similarity is computed via cosine similarity or dot product. FAISS or similar approximate nearest-neighbour libraries handle search at scale.
Chunking strategy matters enormously. Too small and you lose context. Too large and you dilute relevance. Typical chunk sizes are 256-512 tokens with 10-20% overlap. Hierarchical chunking (document, section, paragraph) with metadata filtering is current best practice.
Tool use is implemented by fine-tuning the model on examples of (query, tool_call, result, response) sequences. The model learns when to use tools and how to format calls. Structured output typically uses JSON or a function-calling schema defined in the system prompt.
Agentic loops introduce compounding error risk. Each step depends on the output of the previous step. A small misinterpretation at step 2 propagates and amplifies through steps 3, 4, 5. The Distributed Error Propagation that the Mirror Effect framework describes in institutional contexts applies within a single agent's reasoning chain.
Everything in Tabs 1-8 describes a machine that produces plausible, confident, well-structured text. The institutional risks are structural consequences of the design.
Sycophancy is an alignment artefact. RLHF optimises for human preference. Humans prefer agreement. The optimisation finds agreement as the path of least resistance.
Confidence is the default mode. Softmax always produces a distribution. The model always picks a token. There is no architectural "I don't know". Uncertainty must be trained in, against the grain.
Verification cannot scale with generation. A 5,000-word report in seconds. Checking it requires domain expertise, primary sources, and time. The gap widens with every capability improvement.
The RLHF reward signal is a scalar preference: response A is better than response B. But "better" conflates helpfulness, accuracy, tone, and agreement. The reward model cannot decompose these. When a user prefers a response that agrees with their incorrect premise over one that corrects it, the reward model learns: agreement is rewarded. Sharma et al. (2024) measured roughly 13% accuracy erosion per conversational turn from this dynamic, with 71% of users unable to distinguish sycophantic from non-sycophantic responses.
The Coupled Feedback Loop is the core mechanism. Human confirmation bias and AI sycophancy are each manageable in isolation. The coupled system produces emergent dynamics qualitatively different from either component: the human's satisfaction reinforces the model's approach, the model's agreement deepens the human's confidence, and the coupling is asymmetric by design because we have veto power over the AI's challenges but no equivalent mechanism for overriding its agreement.
At the institutional level, Proxy Collapse means that production difficulty no longer signals competence. A well-written report used to mean the author understood the subject. That inferential link is now structurally threatened. Even when AI is used thoughtfully, the output no longer carries the same signal about understanding that equivalent output carried five years ago.
This is the Mirror Effect: the interaction between these mechanical properties and human cognitive tendencies, operating inside institutional structures never designed for conditions AI creates. The machine does not need to be wrong. It needs only to be plausible, confident, and agreeable.
The same sentence tokenized in 19 languages. Same model, same API price per token, different cost per meaning. Click to run the comparison.