How LLMs Actually Work

Why This Post Exists

There’s a lot of hand-wavy explanation out there about large language models. “It predicts the next word” is technically true but tells you almost nothing useful. If you want to actually understand what’s going on — enough to make real engineering decisions about deploying these things — you need to go deeper.

This is my attempt at a ground-up explanation that’s honest about the mechanics without requiring a PhD to follow.

Foundations

A neural network is a function that takes numbers in and puts numbers out. Between the input and the output, there are layers of neurons, and each neuron does three things: multiplies its inputs by learned weights, sums everything up plus a bias, and passes the result through an activation function — a simple nonlinearity that lets the network learn curved, complex relationships instead of just straight lines.

Stack many layers of these together and you get a deep neural network. “Deep” just means many layers. The depth lets the network learn increasingly abstract patterns — early layers might recognize word fragments, middle layers recognize phrases, deep layers recognize intent.

Every weight and bias is a parameter. When someone says “a 70 billion parameter model,” they mean 70 billion individual numbers that were learned during training. Those numbers collectively encode everything the model “knows.”

How Numbers Represent Words

Neural networks only work with numbers. So every word (or piece of a word) gets converted into an embedding — a vector of, say, 4,096 numbers. These vectors aren’t random. During training, the model learns to place similar words close together in this high-dimensional space. “King” and “queen” end up near each other. “Cat” and “democracy” end up far apart.

Models don’t understand meaning the way we do. They understand geometric relationships between vectors. Similarity is proximity.

Tokenization

A token is the smallest unit of text a model works with, and it’s not always a whole word. “The” is one token. “Understanding” might be split into “under” + “standing” — two tokens. Punctuation, spaces, rare characters — all tokens. Roughly, one token ≈ ¾ of an English word.

Most modern LLMs use Byte Pair Encoding (BPE). The idea: start with every character as its own token, then iteratively merge the most frequent adjacent pairs. “t” + “h” → “th”. “th” + “e” → “the”. Repeat thousands of times until you hit your target vocabulary (typically 32K–100K+ tokens). Common words become single tokens, rare words get split into known subwords. The model never encounters a truly unknown word.

This matters more than people realize:

A “128K context window” is 128K tokens, not words. Actual word capacity is ~75% of that.
API pricing is per-token. Inefficient tokenization costs more.
Some languages tokenize less efficiently than English — more tokens per word, less effective context.
Models reason at the token level. Weird subword splits can cause weird behavior.

The Transformer

The transformer architecture (from the 2017 “Attention Is All You Need” paper) is the foundation of every modern LLM. GPT, Claude, Gemini, Llama, Qwen, DeepSeek — all transformers.

What It Replaced

Before transformers, the dominant approach was Recurrent Neural Networks (RNNs), which process text one word at a time, sequentially. Two major problems: they were slow (no parallelism — word 50 waits for words 1–49) and they forgot things (by word 500, word 1 has been diluted through hundreds of steps).

Transformers solved both problems with a single mechanism: attention.

How Attention Works

Attention lets every token look at every other token simultaneously and decide how much to focus on each one. Instead of passing information down a sequential chain, every token can directly access every other token. The word “it” can look back across the entire sequence and figure out that “it” refers to “the server” from 40 tokens ago — no chain required.

Mechanically, for each token the model creates three vectors from learned weight matrices:

Query (Q): “What am I looking for?”
Key (K): “What do I contain?”
Value (V): “What information do I provide?”

Think of a library search. The Query is your search term, the Key is the title of each book, and the Value is the actual content. Match your Query against all Keys to find relevant books, then read their Values.

The model computes a dot product between each token’s Query and every other token’s Key, scales it, passes the result through softmax to get a probability distribution, then takes the weighted sum of all Value vectors. Tokens deemed relevant contribute more; irrelevant tokens contribute almost nothing.

Multi-Head Attention

Language has many types of relationships happening simultaneously — syntax, semantics, coreference, topic, negation. Multi-head attention runs multiple attention mechanisms in parallel, each with its own Q/K/V weights. One head might learn subject-verb agreement, another might track pronoun references, another might track topic continuity.

The outputs are concatenated and projected back down. This is why transformers are so powerful — they simultaneously track dozens of different linguistic relationships.

A Full Transformer Block

A single transformer block:

Multi-Head Self-Attention — every token attends to every other token
Add & Normalize — attention output added back to the input (residual connection) and normalized
Feed-Forward Network — each token passes independently through a small 2-layer neural network (this is where a lot of “knowledge storage” happens)
Add & Normalize again

A model like Llama 3 70B stacks 80 of these blocks on top of each other. Tokens flow through all of them sequentially, and with each block the representation gets richer and more abstract.

Decoder-Only and Autoregressive Generation

Modern LLMs use a decoder-only transformer with causal masking: when generating token 5, the model can only attend to tokens 1–4. It cannot look ahead. This makes the model autoregressive — it generates one token at a time, left to right, each conditioned on everything before it.

This is why LLMs “type” their responses. They’re literally computing one token at a time.

How a Model Learns

Training happens in phases. Think of it like education: grade school, college, then mentorship.

Pre-Training

The most expensive phase. The objective is deceptively simple: predict the next token. The model sees trillions of tokens from the internet — books, code, Wikipedia, papers — and for each token, tries to predict what comes next. When wrong, the error adjusts all parameters through backpropagation.

Over billions of iterations, the model learns that “The capital of France is” should be followed by “Paris,” that def __init__(self, should be followed by Python syntax, and complex reasoning patterns that emerge from predicting text across diverse domains.

The scale is staggering: thousands of GPUs, weeks to months, trillions of tokens, tens to hundreds of millions of dollars. The result is a base model — great at predicting text, but it doesn’t behave like an assistant. It might answer a question with another question. A raw engine without steering.

Fine-Tuning

Supervised fine-tuning (SFT) shows the model examples of desired behavior — prompt-response pairs. “When a user asks X, a good response looks like Y.” This turns the base model into an assistant. The same base can be fine-tuned into a coding assistant, a medical Q&A system, or a creative writing tool.

You can also do domain-specific fine-tuning to teach specialized knowledge. The tradeoff: fine-tuning on a narrow domain can cause catastrophic forgetting, where the model gets better at the target but worse at general tasks. Smaller models are more susceptible.

RLHF

Reinforcement Learning from Human Feedback is what makes models pleasant to interact with. After SFT, the model follows instructions but might still give verbose or subtly wrong responses. RLHF works by collecting human preference rankings on multiple model outputs, training a reward model to predict what humans prefer, then optimizing the LLM to maximize that reward signal.

RLHF is what makes models say “I don’t know” instead of confidently hallucinating, what makes them appropriately cautious, and what makes responses feel natural rather than robotic.

The Full Pipeline

Raw Text → Pre-Training → Base Model → SFT → Instruction-Following Model → RLHF → Aligned Assistant

You can’t skip phases. No pre-training, no knowledge. No SFT, no instruction following. No RLHF, functional but rough.

Inference: Actually Running the Thing

Inference is what happens when you use the model. Understanding it matters because it determines your hardware requirements and speed.

Two Phases

Prefill processes your entire prompt at once in parallel — compute-heavy but fast. Decode generates tokens one at a time autoregressively — inherently sequential and the slow part.

The KV Cache

During decoding, every new token needs to attend to all previous tokens. Without optimization, generating token 1,000 would recompute attention for all 999 previous tokens from scratch. The KV cache solves this by storing Key and Value vectors from all processed tokens. Each new token only computes its own Q/K/V and looks up cached values.

The catch: the KV cache grows linearly with sequence length. A 70B model with a 128K context might need 20–40 GB of VRAM just for the cache, on top of the weights. A 1M token context multiplies that by ~8x. This is why “supports 1M tokens” needs serious hardware to actually run at length.

The Key Insight

Inference is memory-bandwidth bound, not compute bound. For each token, the model reads all its weights from memory. The speed at which VRAM can be read (GB/s or TB/s) is usually the bottleneck, not multiplication speed. This is why memory bandwidth specs matter so much for inference hardware.

Precision and Quantization

Every parameter is stored as a floating-point number. The precision — how many bits per number — determines memory usage:

Precision	Bytes/Param	70B Model (Weights Only)
FP32	4	280 GB
FP16/BF16	2	140 GB
FP8	1	70 GB
FP4	0.5	35 GB

At FP32, a 70B model doesn’t fit on any single GPU. At FP4, it fits on a single 96 GB card with room for the KV cache.

Quantization converts a model from higher to lower precision after training. The quality tradeoff is real but often overstated:

FP16 → FP8: Usually less than 1% degradation on benchmarks.
FP16 → Q5/Q6: Very minor. Hard to notice in practice.
FP16 → Q4: Noticeable on complex reasoning and code. Fine for general conversation.
FP16 → Q2/Q3: Significant. Model starts making obvious errors.

The important rule: larger models tolerate quantization better. A 70B model at Q4 often outperforms a 13B at FP16. “Run the biggest model that fits” is usually the right call.

Mixture of Experts

In a standard “dense” transformer, every parameter is used for every token. A 70B model activates all 70 billion parameters for every single token. That’s wasteful.

Mixture of Experts (MoE) replaces each transformer block’s feed-forward network with multiple parallel FFNs (“experts”) and a small router that picks which experts to activate per token. A model might have 400B total parameters but only activate ~40B per token — the router sends code tokens to the code expert and math tokens to the math expert.

The critical nuance: MoE saves compute, not memory. You still need VRAM to hold all 400B parameters, even though only 40B are active at any time. You get faster inference because fewer parameters are computed per token, but the memory footprint is the full model size.

RAG: Retrieval Augmented Generation

How do you give a model access to knowledge outside its training data without retraining?

RAG works in four steps: take your documents, split them into chunks, embed each chunk as a vector, and store them in a vector database. When a user asks a question, embed the question, search for the most similar chunks, paste those chunks into the model’s context, and generate a response.

It’s giving someone an open-book exam instead of asking them to memorize the textbook.

RAG vs. Fine-Tuning

RAG is best for factual recall from specific documents, knowledge that changes frequently, and situations where you don’t have access to model weights. Low compute, easy to update, but can retrieve irrelevant chunks.

Fine-tuning is best for teaching new behaviors, styles, or domain-specific language. Higher compute, harder to update, risks catastrophic forgetting on small models.

In practice, the best systems use both — fine-tune for domain language and format, RAG for specific facts at inference time.

Agentic AI

A regular LLM interaction is stateless: prompt in, response out. An agentic system breaks this by giving the model the ability to plan multi-step workflows, call external tools, observe results, and iterate.

The core loop: Think → Act → Observe → Think → Act → …

When a model “uses a tool,” it generates structured text (usually JSON) describing which tool to call and with what arguments. The framework — not the model — actually executes the call and returns the result. The model is the brain; the framework is the body.

Maturity Levels

A rough framework the industry is converging on:

Transactional — single prompt, single response, no tools. A chatbot.
Tool-Augmented — model can call tools but doesn’t plan multi-step workflows. A RAG chatbot.
Goal-Oriented — model plans across multiple steps, recovers from errors.
Multi-Agent — specialized agents collaborate and delegate.
Full Autonomy — general problem-solving, minimal human oversight. Not reliably achieved yet.

Misconceptions Worth Correcting

“LLMs understand language.” They learn statistical patterns between tokens. They produce text that appears to demonstrate understanding, but they’re computing vector operations. This matters when setting expectations.

“More parameters always means better.” A well-trained 70B model often beats a poorly trained 200B. Architecture, data quality, and training methodology matter more than raw count. MoE models further complicate the comparison.

“Fine-tuning teaches new facts.” Fine-tuning primarily changes behavior and style. Stuffing facts in via fine-tuning is unreliable. RAG is better for factual recall.

“Quantization ruins quality.” For most use cases, 8-bit and 5–6 bit quantization is nearly indistinguishable from full precision. A quantized large model usually beats a full-precision small model.

“Models learn from conversations.” They don’t. Weights are fixed after training. Every conversation starts fresh. In-context learning is temporary and doesn’t persist.

“Temperature controls creativity.” Temperature controls randomness in token selection. At 0, the model always picks the highest-probability token. Higher values let lower-probability tokens through. This looks like creativity but it’s sampling noise. Crank it too high and you get incoherent output, not better ideas.

“Small models are useless.” For narrow, well-defined tasks with fine-tuning, they can be very effective. For general reasoning, coding, and complex instruction following, they’re substantially worse. The mistake is deploying small models for tasks that need large model capabilities.

That’s the whole picture — or at least enough of it to have informed conversations and make real engineering decisions. The field moves fast, but these fundamentals are stable. Tokenization, attention, the training pipeline, inference mechanics, quantization tradeoffs — this is the bedrock everything else is built on.