How LLMs Work — Interactive Lab | Paul Gallacher

Technical Depth

Tokenization & Prompt to Tokens

Text is split into subword tokens before entering the model. This is how language becomes numbers.

Enter your prompt

Tokens

Sequence Length

Unique Tokens

Token Inspector

Token:

ID:

Bytes:

Occurrences:

Micro-insight: Why tokenization matters. Tokenization determines how a model "sees" text. Rare words get split into multiple tokens, consuming more of the context window and making them harder for the model to process consistently.

Context Window & Attention Cost

Self-attention has O(n²) complexity. Doubling context length quadruples compute.

n=0O(n²)=0

Type a prompt and click "Tokenize" to see the visualization.

Logits to Sampling

The model outputs raw logits (scores) for every token in the vocabulary. Sampling selects the next token.

Generated Sequence

Temperature 0.70

Top-p 0.90

Top-k 10

Rep. Penalty 1.00

Greedy

Sampling

Micro-insight: Why softmax can saturate. At low temperatures, softmax concentrates nearly all probability on one token, making output deterministic but potentially repetitive.

Micro-insight: Why temperature changes entropy. Temperature scales logits before softmax. Higher temperature → flatter distribution → more random. Lower → sharper → more greedy.

Micro-insight: Repetition penalty tradeoff. High repetition penalty reduces loops but can hurt factuality by penalizing tokens that should naturally repeat (like "the" or "is").

Logits vs Probabilities

Logits are raw unnormalized scores. Softmax converts them to a valid probability distribution that sums to 1. Temperature divides logits before softmax: p_i = exp(z_i / T) / Σ exp(z_j / T).

Self-Attention & Causal Mask

Multi-head attention lets each token attend to all previous tokens. The causal mask prevents attending to future tokens.

Head Selector

Click a query token to see attention lines

● Attention Weights

The heatmap shows attention weight from each query (row) to each key (column). Brighter = higher weight. Hover cells for exact values.

Micro-insight: Attention weights ≠ explanations. While attention patterns are interesting to visualize, they do not causally explain model behavior. A token receiving high attention weight doesn't mean it "caused" the output.

Complexity

Self-attention is O(n²·d) where n is sequence length and d is head dimension. Each head learns different patterns — some attend locally, some to specific syntactic roles.

Transformer Block Internals

A single transformer block applies attention and a feed-forward network with residual connections. Modern LLMs stack 32–128 of these.

Step through the block

Step 0: Input

Probe Token

Input Embeddings

The input token embeddings enter the transformer block. Each token is a d-dimensional vector.

Tensor Shapes

Pseudocode

KV Cache & Decoding Efficiency

During generation, the KV cache stores previously computed keys and values, avoiding redundant computation.

Generation step

Step 0

With Cache

With KV Cache

Operations

Without Cache

Operations

Cache Memory

Each layer stores K and V matrices per head. Memory = 2 × layers × heads × seq_len × d_head × bytes.

0 KB

Micro-insight: Why KV cache matters. Without caching, generating N tokens requires O(N²) total attention operations. With caching, it's O(N) — each new token only computes attention against cached keys.

Micro-insight: Longer context degrades performance. As the KV cache grows, memory bandwidth becomes the bottleneck. Very long contexts can also degrade output quality as the model struggles to attend to distant information.

KV Cache — Layer 1, Head 1 (Keys)

KV Cache — Layer 1, Head 1 (Values)

Training & Alignment Pipeline

Modern LLMs undergo multiple training stages to become helpful, harmless, and honest.

Select a stage

RLHF

DPO

Pretraining

The model learns to predict the next token on trillions of tokens from the internet. This is unsupervised and teaches language patterns, facts, and reasoning.

Micro-insight: Post-training changes behavior, not knowledge. SFT and RLHF primarily change how the model responds (style, safety, helpfulness), not what it knows. Knowledge comes from pretraining data.

⚠ Disclaimer

This is a simplified conceptual representation. Actual training involves billions of parameters, months of compute, and complex engineering.

RAG & Tool Calling

Retrieval-Augmented Generation grounds model outputs in external knowledge. Tool calling lets the model interact with APIs.

Query

Retrieved Chunks (by similarity)

Tool Calling Demo

Embedding & Similarity

Documents are chunked and embedded into vectors. Cosine similarity finds the most relevant chunks: sim(q, d) = (q · d) / (||q|| × ||d||).

Enter a query and click "Search & Retrieve" to see the RAG pipeline.