Tokenization & Prompt to Tokens
Text is split into subword tokens before entering the model. This is how language becomes numbers.
Self-attention has O(n²) complexity. Doubling context length quadruples compute.
Logits to Sampling
The model outputs raw logits (scores) for every token in the vocabulary. Sampling selects the next token.
Logits are raw unnormalized scores. Softmax converts them to a valid probability distribution that sums to 1. Temperature divides logits before softmax: p_i = exp(z_i / T) / Σ exp(z_j / T).
Self-Attention & Causal Mask
Multi-head attention lets each token attend to all previous tokens. The causal mask prevents attending to future tokens.
The heatmap shows attention weight from each query (row) to each key (column). Brighter = higher weight. Hover cells for exact values.
Self-attention is O(n²·d) where n is sequence length and d is head dimension. Each head learns different patterns — some attend locally, some to specific syntactic roles.
Transformer Block Internals
A single transformer block applies attention and a feed-forward network with residual connections. Modern LLMs stack 32–128 of these.
The input token embeddings enter the transformer block. Each token is a d-dimensional vector.
KV Cache & Decoding Efficiency
During generation, the KV cache stores previously computed keys and values, avoiding redundant computation.
With KV Cache
Without Cache
Each layer stores K and V matrices per head. Memory = 2 × layers × heads × seq_len × d_head × bytes.
Training & Alignment Pipeline
Modern LLMs undergo multiple training stages to become helpful, harmless, and honest.
The model learns to predict the next token on trillions of tokens from the internet. This is unsupervised and teaches language patterns, facts, and reasoning.
This is a simplified conceptual representation. Actual training involves billions of parameters, months of compute, and complex engineering.
RAG & Tool Calling
Retrieval-Augmented Generation grounds model outputs in external knowledge. Tool calling lets the model interact with APIs.
Documents are chunked and embedded into vectors. Cosine similarity finds the most relevant chunks: sim(q, d) = (q · d) / (||q|| × ||d||).

