๐ฏ Pre-training Objectives
Pre-training is the phase where a model learns general language understanding from massive, unlabelled text corpora. The training objective defines what the model is optimised to predict โ this choice fundamentally shapes what the model learns and what it is naturally suited for.
| Objective | Model Type | Data Needed | Strength |
|---|---|---|---|
| Causal LM (CLM) / Next-token prediction | Decoder-only (GPT, Llama, Mistral) | Any text โ predict next token given all previous | Natural fit for generation; any text is training signal; scales extremely well; self-supervised on raw internet data |
| Masked Language Modelling (MLM) | Encoder-only (BERT, RoBERTa) | Randomly mask ~15% of tokens; predict the masked tokens using both left and right context | Rich bidirectional representations; excellent for understanding and classification tasks; sample-efficient (multiple predictions per sequence) |
| Span Corruption | Encoder-decoder (T5) | Replace contiguous spans with sentinel tokens; decoder reconstructs the original spans | Trains both understanding (encoder) and generation (decoder) simultaneously; naturally denoising; strong on structured output tasks |
| Replaced Token Detection (RTD) | Encoder-only (ELECTRA) | A small generator replaces some tokens; discriminator predicts which tokens are fake โ harder task, every token is labelled | More efficient than MLM โ 4ร training efficiency; better downstream performance per FLOP; smaller models can match much larger BERT models |
Why CLM Dominates Modern LLMs
Next-token prediction has a beautiful property: any text document is automatically a training example with no annotation required. The model predicts token N+1 given tokens 1 through N. The "label" is always the next token in the document. This means petabytes of internet text, books, code, and scientific papers can all be training data with zero human labelling. It also turns out that accurately predicting the next token requires learning a deep model of the world โ grammar, facts, reasoning, and style all help predict what comes next.
๐๏ธ The Pre-training Process
Pre-training an LLM is one of the most compute-intensive tasks in existence. The process involves curating hundreds of terabytes of text, processing it through distributed training across thousands of GPUs, and monitoring for instabilities over weeks or months.
Dataset Scale & Sources
Modern LLMs are trained on trillions of tokens drawn from diverse sources:
- Common Crawl: petabyte-scale web text snapshots โ largest source but noisiest
- Books (Books3, Gutenberg): long-form coherent reasoning; literary language
- GitHub / Stack: code; reasoning in structured, executable form
- Wikipedia: factual, encyclopedic; high signal-to-noise ratio
- ArXiv / PubMed: scientific reasoning; specialised domains
- StackExchange, Reddit: Q&A and conversational patterns
Data Cleaning Pipeline
Raw web data is extremely noisy. Pre-processing steps typically include:
- Language detection: filter to target languages; balance multilingual mix
- Quality filtering: perplexity filtering, heuristic rules (min token count, max repeat ratio)
- Deduplication: exact and near-duplicate removal (MinHash, suffix arrays); prevents memorisation of repeated content
- PII removal: strip emails, phone numbers, credentials
- Toxicity filtering: reduce harmful content in training data
Training Instability
Pre-training long runs encounter instabilities: loss spikes (sudden jumps in training loss), gradient explosions, and catastrophic divergence. Common mitigations:
- Gradient clipping: cap gradient norm to 1.0
- Learning rate warmup: slowly ramp up LR from 0 over ~2000 steps
- Checkpoint averaging: average weights from recent checkpoints
- Loss spike recovery: roll back to last stable checkpoint and skip bad data batch
- Low precision issues: BF16 preferred over FP16 (wider exponent range)
Chinchilla Scaling Laws and Beyond
Hoffmann et al. (DeepMind, 2022) published the landmark "Chinchilla" paper showing that earlier large models were significantly undertrained relative to their size. The optimal compute allocation โ for a given training budget โ should use roughly 20 tokens of training data per parameter. A 7B parameter model is optimally trained on ~140B tokens. GPT-3 (175B params) was trained on 300B tokens โ far fewer than the ~3.5T tokens the Chinchilla law suggests. Llama 3 8B was trained on over 15 trillion tokens, far beyond Chinchilla-optimal, because the resulting model performs better at inference โ even if it cost more to train. DeepSeek-V3 was trained on approximately 14.8 trillion tokens; Qwen2.5 models used 18 trillion tokens. The 2024โ2025 consensus: for models that will be deployed at inference time (rather than used only during training), significant "over-training" beyond Chinchilla-optimal is beneficial โ a lesson confirmed across Llama, Mistral, Phi-4, and DeepSeek model families.
๐ Instruction Fine-tuning (SFT)
A pre-trained base model is a powerful text completion engine โ it will continue any prompt in a plausible way. But it isn't naturally an assistant. Given "What is the capital of France?", a base model might continue: "What is the capital of Germany? What is the capital of Spain?" โ completing a list-style question format it has seen, rather than answering. Instruction fine-tuning (supervised fine-tuning, SFT) teaches the model to respond helpfully to instructions.
The SFT Data Format
SFT training uses (instruction, response) pairs โ demonstrations of the desired behaviour formatted as conversations. Modern chat models use special tokens to delimit turns:
<|system|> You are a helpful assistant. <|user|> Explain gradient descent in simple terms. <|assistant|> Gradient descent is an optimisation algorithm that finds the minimum of a function by repeatedly taking small steps in the direction that reduces the function value most steeply... <|end|>
Only the assistant turns contribute to the loss โ the model learns to predict the high-quality response given the instruction.
Dataset Quality vs Quantity
Early SFT research (Alpaca, Dolly) showed that even small high-quality datasets (~52k examples) dramatically improve instruction following. Zhou et al. (2023, LIMA) demonstrated that 1,000 carefully curated examples can produce competitive results โ the base model already "knows" the information; SFT just teaches the format.
Key quality signals for SFT data: diverse task coverage, factual accuracy, appropriate length, no contradictions with other examples, and natural conversational flow. Popular open-source datasets: OpenHermes 2.5, Ultrachat, WizardLM/Evol-Instruct, ShareGPT.
Fine-tuning Hyperparameters
SFT uses much lower learning rates than pre-training to avoid catastrophic forgetting of the base model's knowledge:
- Learning rate: 1e-5 to 2e-5 (vs 3e-4 for pre-training)
- Epochs: 1โ3 (more risks overfitting on small datasets)
- Cosine decay: LR decays to ~10% of peak by end
- Batch size: 64โ256 sequences, often with sequence packing
- Max length: match pre-training context length
๐ค RLHF & Alignment
Instruction fine-tuning teaches a model to respond to prompts, but it doesn't teach the model to be good โ helpful, harmless, and honest. SFT on human demonstrations is limited by the quality of those demonstrations and the difficulty of articulating what "good" means as data. Reinforcement Learning from Human Feedback (RLHF) instead learns a model of human preferences and optimises for it.
Step 1: Reward Model Training
Human labellers compare pairs of model responses to the same prompt and indicate which is better. A separate reward model (typically a smaller transformer) is trained on these preference pairs to predict human preference scores. The reward model learns a scalar measure of response quality that correlates with human judgement across diverse scenarios: helpfulness, accuracy, harmlessness, and style.
# Preference pair example
prompt: "Write a poem about AI"
response_A: "Circuits hum and data streams,
A mind of silicon dreams..."
response_B: "AI AI AI AI AI AI AI..."
# Human label: A is better
# Reward model learns: score(A) > score(B)
Step 2: PPO Optimisation
The SFT model is used as the starting policy. PPO (Proximal Policy Optimisation) generates completions, scores them with the reward model, and updates the policy to produce higher-scoring outputs. A KL divergence penalty against the original SFT model prevents the policy from deviating too far (reward hacking: outputting nonsensical text that fools the reward model but is worthless to humans).
PPO for LLMs is complex and unstable: it requires loading the policy model, reward model, value model, and reference model simultaneously โ 4 copies of model weights.
DPO: Simpler Alternative
Direct Preference Optimisation (Rafailov et al., 2023) reformulates RLHF as a simple supervised learning problem. Given preference pairs (chosen, rejected), DPO directly updates the language model to increase the probability of chosen responses and decrease rejected ones โ using a mathematical equivalence between the RLHF objective and a classification loss. No reward model, no PPO, no value function โ just two model copies and cross-entropy loss. DPO has become the default alignment method for most open-source models.
| Property | SFT Only | SFT + RLHF (PPO) | SFT + DPO |
|---|---|---|---|
| Data required | Demonstration pairs (instruction โ response) | Demonstrations + preference comparisons + reward model labels | Demonstrations + preference comparison pairs |
| Implementation complexity | Low โ standard supervised fine-tuning | Very high โ 4 model copies, reward model training, PPO stabilisation | Low-medium โ 2 model copies, standard gradient descent |
| Memory requirement | 1ร model | 4ร model + reward model | 2ร model (policy + reference) |
| Result quality | Good instruction following; may still produce harmful outputs | Highest alignment; nuanced preference satisfaction; ChatGPT, Claude use this | Good alignment; slightly less expressive than PPO but much cheaper; Llama 3, Mistral Instruct |
GRPO: Reasoning Model Training (DeepSeek-R1)
Group Relative Policy Optimisation (GRPO), introduced in DeepSeek's R1 paper (2025), is a PPO variant specifically designed to train reasoning models. The key innovation: instead of using a separate value/critic network, GRPO estimates the baseline by sampling a group of responses for each prompt and computing their average reward. Each response is then optimised relative to its group peers โ high-reward responses are reinforced, low-reward ones are suppressed.
DeepSeek-R1 demonstrated that applying GRPO directly on a base model without SFT warm-up causes emergent reasoning behaviour: the model spontaneously develops long chain-of-thought patterns, self-reflection, and backtracking โ purely from the reward signal on correctness-verifiable tasks (maths, code). This "cold start" RL approach challenges the assumption that supervised demonstration data is required to bootstrap reasoning. OpenAI's o1/o3 series is believed to use a similar RL-based process, though the exact method is undisclosed. The practical implication: for domains where answers can be automatically verified (math, code execution), RL training can produce far superior reasoning than SFT on human demonstrations alone.
๐ง Parameter-Efficient Fine-tuning (PEFT)
Full fine-tuning updates every parameter in the model. For a 70B model, that means 70 billion gradient updates per step, requiring multiple 80GB GPUs just to hold the optimizer states. PEFT methods dramatically reduce the number of trained parameters while achieving comparable results.
LoRA: Low-Rank Adaptation
Hu et al. (2021) observed that the weight updates during fine-tuning have low intrinsic rank โ the change ฮW can be well-approximated by a low-rank matrix product BA, where B โ โdรr and A โ โrรk with r โช min(d,k).
# Original weight: W (frozen) # LoRA adds: ฮW = B ร A # Forward pass: h = Wยทx + (BยทA)ยทx ยท (alpha/r) # Only B and A are trained # W stays frozen โ original knowledge preserved # r=8: 2ร(4096ร8) = 65,536 params # vs full W: 4096ร4096 = 16.7M params # Reduction: 256ร
At inference, LoRA can be merged back: W' = W + BA, so there is zero extra latency. Multiple LoRA adapters can be swapped at runtime for different task specialisations.
QLoRA: Quantisation + LoRA
Dettmers et al. (2023) combined LoRA with 4-bit quantisation of the frozen base model weights. The base model is stored in NF4 (Normally distributed Float 4-bit), consuming ~4GB for a 7B model (vs ~28GB in FP16). LoRA adapters remain in full BF16 precision. Double quantisation further reduces memory by quantising the quantisation constants.
Result: fine-tuning a 65B Llama model on a single 48GB GPU โ previously requiring 8ร A100s. QLoRA democratised fine-tuning for researchers and practitioners without access to multi-GPU infrastructure.
| Method | Trainable Params | Memory vs Full FT | Quality | Best For |
|---|---|---|---|---|
| Full Fine-tuning | 100% (all parameters) | Baseline (highest) | Best โ upper bound | When you have compute budget and want maximum quality |
| LoRA (r=16) | ~0.1โ1% of total | ~3โ4ร reduction in optimizer states | Near-full quality on most tasks | Multi-GPU fine-tuning; task adaptation; style tuning |
| QLoRA (4-bit + r=64) | ~0.1โ1% of total | ~8โ10ร reduction total | Slightly below LoRA; still strong | Single GPU fine-tuning; consumer hardware; rapid prototyping |
| Prompt Tuning / Prefix Tuning | <0.01% (just soft prompt tokens) | Minimal โ base model frozen | Decent for some tasks; poor generalisation | Multi-tenant serving: swap prompt tokens per task without changing model |
LoRA Is the Default for Local Fine-tuning
The combination of Hugging Face's peft library, trl (for SFT, DPO, and GRPO training), and bitsandbytes (for quantisation) has made QLoRA fine-tuning accessible to anyone with a single GPU with 8GB+ VRAM. A typical workflow: download a Llama 3.3 70B or Phi-4 14B model, apply 4-bit NF4 quantisation, attach LoRA adapters to the Q and V projection matrices (and optionally K, O, FFN up/gate), train for 1โ3 epochs on a domain-specific dataset, optionally merge the adapters back into the weights, and deploy. The entire process requires no custom CUDA code and can run on a consumer RTX 4090. As of 2025, DoRA (Weight-Decomposed Low-Rank Adaptation) is an increasingly popular extension that decomposes weight updates into magnitude and direction components, further improving convergence stability over standard LoRA.