Hands-on introduction to LLMs

A short four-week introduction to how modern decoder-only LLMs work end-to-end. Students will build a minimal GPT step by step and experiment with open-weight models to understand architecture design choices and post-training objectives. Lectures take an implementation-first approach and include guided coding exercises in PyTorch.

Learning outcomes

End-to-end understanding of decoder LLM training and inference.
Familiarity with architectural design choices in open models.
Understand inference and long-context bottlenecks in practice.
Run a minimal post-training pipeline (SFT + preferences).

Simon Vary

Email: simon.vary@stats.ox.ac.uk
Web: simonvary.github.io
Place: Mathematical Institute, Univ. of Oxford

Bring: laptop with Python + PyTorch.
Registration link: forms.gle/VQ7zM99qwAeQ8YRD9.

Schedule

Date	Time / Room	Lecture	Materials
Wed 4th March	15:00–17:00 L4, MI	1 — simpleGPT & basics History, tokenizer, tensors, causal self-attention, training, metrics	Slides • Code
Wed 11th March	15:00–17:00 L4, MI	2 — Architecture design choices / open-weight models Recap of MHA + Transformer, Position encoding RoPE, normalization (pre/post-norm), dimensions	Slides • Code
Wed 18th March	15:00–17:00 L5, MI	3 — Inference: KV-cache & long context Prefill vs decode, KV-cache (compute vs memory), attention variants (MGA, GQA, MLA), long-context bottlenecks, speculative decoding	Slides •
Wed 25th March	15:00–17:00 L4, MI	4 — Post-training: objectives, PEFT, and preferences SFT, parameter-efficient fine-tuning (LoRA), preference learning (DPO), verifier-based rewards (RLVR), Chain-of-thought (CoT)	Slides •

Lectures

Lecture 1 — simpleGPT & basics

Topics: A short history of probabilistic language models; Tokenization (BPE); Attention and seq2seq; Transformers and causal self-attention; training, cross-entropy, perplexity.
Code examples: Tokenizing text with tiktoken; PyTorch tensor basics; masking and scaled dot-product attention; implementing multi-head self-attention; building a minimal GPT-style model.
References: Andrej Karpathy: Let's build the GPT Tokenizer, CS336 Lecture 1 [slides, lecture], tiktokenizer, Sutton's Bitter Lesson, A History of Large Language Models, The Illustrated Transformer

Lecture 2 — Architecture design choices / open-weight models

Topics: Recap of Transformer + MHA; Positional encodings (RoPE); pre-norm, LayerNorm, RMSNorm; Activations (ReLU, GeLU, GLU, SwiGLU); common design ratios for width, depth, heads, and FFN size.
Code examples: Adding RoPE to multi-head attention; Inspecting Qwen, generating text and examining the KV cache.
References: CS336 Lecture 3 on Architectures [lecture], RoFormer [paper], On Layer Normalization in the Transformer Architecture [paper], Scaling Laws for Neural Language Models [paper]

Lecture 3 — Inference: KV-cache & long context

Topics: Prefill vs decode; inference metrics (TTFT, latency, throughput); KV-cache and MQA, GQA, MLA, CLA; speculative decoding; sparse/local attention; extending context with YaRN.
References: CS336 Lecture 10 [lecture], GQA [paper], Speculative Decoding [paper], MQA [paper], H₂O [paper], YaRN [paper]

Lecture 4 — Post-training: objectives, PEFT, and preferences

References

Stanford CS336 (Spring 2025): cs336.stanford.edu/spring2025/
nanoGPT (Karpathy): github.com/karpathy/nanoGPT