Hands-on introduction to LLMs
A short four-week introduction to how modern decoder-only LLMs work end-to-end. Students will build a minimal GPT step by step and experiment with open-weight models to understand architecture design choices and post-training objectives. Lectures take an implementation-first approach and include guided coding exercises in PyTorch.
Learning outcomes
- End-to-end understanding of decoder LLM training and inference.
- Familiarity with architectural design choices in open models.
- Understand inference and long-context bottlenecks in practice.
- Run a minimal post-training pipeline (SFT + preferences).
Simon Vary
Email: simon.vary@stats.ox.ac.uk
Web: simonvary.github.io
Place: Mathematical Institute, Univ. of Oxford
Bring: laptop with Python + PyTorch.
Registration link: forms.gle/VQ7zM99qwAeQ8YRD9.
Registration link: forms.gle/VQ7zM99qwAeQ8YRD9.
Schedule
| Date | Time / Room | Lecture | Materials |
|---|---|---|---|
| Wed 4th March | 15:00–17:00 L4, MI |
1 — simpleGPT & basics History, tokenizer, tensors, causal self-attention, training, metrics |
Slides • Code |
| Wed 11th March | 15:00–17:00 L4, MI |
2 — Architecture design choices / open-weight models Recap of MHA + Transformer, Position encoding RoPE, normalization (pre/post-norm), dimensions |
Slides • Code |
| Wed 18th March | 15:00–17:00 L5, MI |
3 — Inference: KV-cache & long context Prefill vs decode, KV-cache (compute vs memory), attention variants (MGA, GQA, MLA), long-context bottlenecks, speculative decoding |
Slides • |
| Wed 25th March | 15:00–17:00 L4, MI |
4 — Post-training: objectives, PEFT, and preferences SFT, parameter-efficient fine-tuning (LoRA), preference learning (DPO), verifier-based rewards (RLVR), Chain-of-thought (CoT) |
Slides • |
Lectures
Lecture 1 — simpleGPT & basics
- Topics: A short history of probabilistic language models; Tokenization (BPE); Attention and seq2seq; Transformers and causal self-attention; training, cross-entropy, perplexity.
- Code examples: Tokenizing text with
tiktoken; PyTorch tensor basics; masking and scaled dot-product attention; implementing multi-head self-attention; building a minimal GPT-style model. - References: Andrej Karpathy: Let's build the GPT Tokenizer, CS336 Lecture 1 [slides, lecture], tiktokenizer, Sutton's Bitter Lesson, A History of Large Language Models, The Illustrated Transformer
Lecture 2 — Architecture design choices / open-weight models
- Topics: Recap of Transformer + MHA; Positional encodings (RoPE); pre-norm, LayerNorm, RMSNorm; Activations (ReLU, GeLU, GLU, SwiGLU); common design ratios for width, depth, heads, and FFN size.
- Code examples: Adding RoPE to multi-head attention; Inspecting Qwen, generating text and examining the KV cache.
- References: CS336 Lecture 3 on Architectures [lecture], RoFormer [paper], On Layer Normalization in the Transformer Architecture [paper], Scaling Laws for Neural Language Models [paper]
Lecture 3 — Inference: KV-cache & long context
- Topics: Prefill vs decode; inference metrics (TTFT, latency, throughput); KV-cache and MQA, GQA, MLA, CLA; speculative decoding; sparse/local attention; extending context with YaRN.
- References: CS336 Lecture 10 [lecture], GQA [paper], Speculative Decoding [paper], MQA [paper], H2O [paper], YaRN [paper]
Lecture 4 — Post-training: objectives, PEFT, and preferences
References
- Stanford CS336 (Spring 2025): cs336.stanford.edu/spring2025/
- nanoGPT (Karpathy): github.com/karpathy/nanoGPT