LLM Introduction Course (MI, University of Oxford)

Hands-on introduction to LLMs

A short four-week introduction to how modern decoder-only LLMs work end-to-end. Students will build a minimal GPT step by step and experiment with open-weight models to understand architecture design choices and post-training objectives. Lectures take an implementation-first approach and include guided coding exercises in PyTorch.

Learning outcomes

  • End-to-end understanding of decoder LLM training and inference.
  • Familiarity with architectural design choices in open models.
  • Understand inference and long-context bottlenecks in practice.
  • Run a minimal post-training pipeline (SFT + preferences).

Simon Vary

Email: simon.vary@stats.ox.ac.uk
Web: simonvary.github.io
Place: Mathematical Institute, Univ. of Oxford


Bring: laptop with Python + PyTorch.
Registration link: forms.gle/VQ7zM99qwAeQ8YRD9.

Schedule

Date Time / Room Lecture Materials
Wed 4th March 15:00–17:00
L4, MI
1 — simpleGPT & basics
History, tokenizer, tensors, causal self-attention, training, metrics
SlidesCode
Wed 11th March 15:00–17:00
L4, MI
2 — Architecture design choices / open-weight models
Recap of MHA + Transformer, Position encoding RoPE, normalization (pre/post-norm), dimensions
SlidesCode
Wed 18th March 15:00–17:00
L5, MI
3 — Inference: KV-cache & long context
Prefill vs decode, KV-cache (compute vs memory), attention variants (MGA, GQA, MLA), long-context bottlenecks, speculative decoding
Slides
Wed 25th March 15:00–17:00
L4, MI
4 — Post-training: objectives, PEFT, and preferences
SFT, parameter-efficient fine-tuning (LoRA), preference learning (DPO), verifier-based rewards (RLVR), Chain-of-thought (CoT)
Slides

Lectures

Lecture 1 — simpleGPT & basics

Lecture 2 — Architecture design choices / open-weight models

  • Topics: Recap of Transformer + MHA; Positional encodings (RoPE); pre-norm, LayerNorm, RMSNorm; Activations (ReLU, GeLU, GLU, SwiGLU); common design ratios for width, depth, heads, and FFN size.
  • Code examples: Adding RoPE to multi-head attention; Inspecting Qwen, generating text and examining the KV cache.
  • References: CS336 Lecture 3 on Architectures [lecture], RoFormer [paper], On Layer Normalization in the Transformer Architecture [paper], Scaling Laws for Neural Language Models [paper]

Lecture 3 — Inference: KV-cache & long context

  • Topics: Prefill vs decode; inference metrics (TTFT, latency, throughput); KV-cache and MQA, GQA, MLA, CLA; speculative decoding; sparse/local attention; extending context with YaRN.
  • References: CS336 Lecture 10 [lecture], GQA [paper], Speculative Decoding [paper], MQA [paper], H2O [paper], YaRN [paper]

Lecture 4 — Post-training: objectives, PEFT, and preferences

References