โฑ 12 min read ๐Ÿ“Š Intermediate ๐Ÿ—“ Updated Jan 2025

๐ŸŽฏ Pre-training Objectives

Pre-training is the phase where a model learns general language understanding from massive, unlabelled text corpora. The training objective defines what the model is optimised to predict โ€” this choice fundamentally shapes what the model learns and what it is naturally suited for.

Objective Model Type Data Needed Strength
Causal LM (CLM) / Next-token prediction Decoder-only (GPT, Llama, Mistral) Any text โ€” predict next token given all previous Natural fit for generation; any text is training signal; scales extremely well; self-supervised on raw internet data
Masked Language Modelling (MLM) Encoder-only (BERT, RoBERTa) Randomly mask ~15% of tokens; predict the masked tokens using both left and right context Rich bidirectional representations; excellent for understanding and classification tasks; sample-efficient (multiple predictions per sequence)
Span Corruption Encoder-decoder (T5) Replace contiguous spans with sentinel tokens; decoder reconstructs the original spans Trains both understanding (encoder) and generation (decoder) simultaneously; naturally denoising; strong on structured output tasks
Replaced Token Detection (RTD) Encoder-only (ELECTRA) A small generator replaces some tokens; discriminator predicts which tokens are fake โ€” harder task, every token is labelled More efficient than MLM โ€” 4ร— training efficiency; better downstream performance per FLOP; smaller models can match much larger BERT models

Why CLM Dominates Modern LLMs

Next-token prediction has a beautiful property: any text document is automatically a training example with no annotation required. The model predicts token N+1 given tokens 1 through N. The "label" is always the next token in the document. This means petabytes of internet text, books, code, and scientific papers can all be training data with zero human labelling. It also turns out that accurately predicting the next token requires learning a deep model of the world โ€” grammar, facts, reasoning, and style all help predict what comes next.

๐Ÿ‹๏ธ The Pre-training Process

Pre-training an LLM is one of the most compute-intensive tasks in existence. The process involves curating hundreds of terabytes of text, processing it through distributed training across thousands of GPUs, and monitoring for instabilities over weeks or months.

Dataset Scale & Sources

Modern LLMs are trained on trillions of tokens drawn from diverse sources:

  • Common Crawl: petabyte-scale web text snapshots โ€” largest source but noisiest
  • Books (Books3, Gutenberg): long-form coherent reasoning; literary language
  • GitHub / Stack: code; reasoning in structured, executable form
  • Wikipedia: factual, encyclopedic; high signal-to-noise ratio
  • ArXiv / PubMed: scientific reasoning; specialised domains
  • StackExchange, Reddit: Q&A and conversational patterns

Data Cleaning Pipeline

Raw web data is extremely noisy. Pre-processing steps typically include:

  • Language detection: filter to target languages; balance multilingual mix
  • Quality filtering: perplexity filtering, heuristic rules (min token count, max repeat ratio)
  • Deduplication: exact and near-duplicate removal (MinHash, suffix arrays); prevents memorisation of repeated content
  • PII removal: strip emails, phone numbers, credentials
  • Toxicity filtering: reduce harmful content in training data

Training Instability

Pre-training long runs encounter instabilities: loss spikes (sudden jumps in training loss), gradient explosions, and catastrophic divergence. Common mitigations:

  • Gradient clipping: cap gradient norm to 1.0
  • Learning rate warmup: slowly ramp up LR from 0 over ~2000 steps
  • Checkpoint averaging: average weights from recent checkpoints
  • Loss spike recovery: roll back to last stable checkpoint and skip bad data batch
  • Low precision issues: BF16 preferred over FP16 (wider exponent range)

Chinchilla Scaling Laws and Beyond

Hoffmann et al. (DeepMind, 2022) published the landmark "Chinchilla" paper showing that earlier large models were significantly undertrained relative to their size. The optimal compute allocation โ€” for a given training budget โ€” should use roughly 20 tokens of training data per parameter. A 7B parameter model is optimally trained on ~140B tokens. GPT-3 (175B params) was trained on 300B tokens โ€” far fewer than the ~3.5T tokens the Chinchilla law suggests. Llama 3 8B was trained on over 15 trillion tokens, far beyond Chinchilla-optimal, because the resulting model performs better at inference โ€” even if it cost more to train. DeepSeek-V3 was trained on approximately 14.8 trillion tokens; Qwen2.5 models used 18 trillion tokens. The 2024โ€“2025 consensus: for models that will be deployed at inference time (rather than used only during training), significant "over-training" beyond Chinchilla-optimal is beneficial โ€” a lesson confirmed across Llama, Mistral, Phi-4, and DeepSeek model families.

๐ŸŽ“ Instruction Fine-tuning (SFT)

A pre-trained base model is a powerful text completion engine โ€” it will continue any prompt in a plausible way. But it isn't naturally an assistant. Given "What is the capital of France?", a base model might continue: "What is the capital of Germany? What is the capital of Spain?" โ€” completing a list-style question format it has seen, rather than answering. Instruction fine-tuning (supervised fine-tuning, SFT) teaches the model to respond helpfully to instructions.

The SFT Data Format

SFT training uses (instruction, response) pairs โ€” demonstrations of the desired behaviour formatted as conversations. Modern chat models use special tokens to delimit turns:

<|system|>
You are a helpful assistant.
<|user|>
Explain gradient descent in simple terms.
<|assistant|>
Gradient descent is an optimisation algorithm
that finds the minimum of a function by
repeatedly taking small steps in the direction
that reduces the function value most steeply...
<|end|>

Only the assistant turns contribute to the loss โ€” the model learns to predict the high-quality response given the instruction.

Dataset Quality vs Quantity

Early SFT research (Alpaca, Dolly) showed that even small high-quality datasets (~52k examples) dramatically improve instruction following. Zhou et al. (2023, LIMA) demonstrated that 1,000 carefully curated examples can produce competitive results โ€” the base model already "knows" the information; SFT just teaches the format.

Key quality signals for SFT data: diverse task coverage, factual accuracy, appropriate length, no contradictions with other examples, and natural conversational flow. Popular open-source datasets: OpenHermes 2.5, Ultrachat, WizardLM/Evol-Instruct, ShareGPT.

Fine-tuning Hyperparameters

SFT uses much lower learning rates than pre-training to avoid catastrophic forgetting of the base model's knowledge:

  • Learning rate: 1e-5 to 2e-5 (vs 3e-4 for pre-training)
  • Epochs: 1โ€“3 (more risks overfitting on small datasets)
  • Cosine decay: LR decays to ~10% of peak by end
  • Batch size: 64โ€“256 sequences, often with sequence packing
  • Max length: match pre-training context length

๐Ÿค RLHF & Alignment

Instruction fine-tuning teaches a model to respond to prompts, but it doesn't teach the model to be good โ€” helpful, harmless, and honest. SFT on human demonstrations is limited by the quality of those demonstrations and the difficulty of articulating what "good" means as data. Reinforcement Learning from Human Feedback (RLHF) instead learns a model of human preferences and optimises for it.

Step 1: Reward Model Training

Human labellers compare pairs of model responses to the same prompt and indicate which is better. A separate reward model (typically a smaller transformer) is trained on these preference pairs to predict human preference scores. The reward model learns a scalar measure of response quality that correlates with human judgement across diverse scenarios: helpfulness, accuracy, harmlessness, and style.

# Preference pair example
prompt: "Write a poem about AI"

response_A: "Circuits hum and data streams,
             A mind of silicon dreams..."

response_B: "AI AI AI AI AI AI AI..."

# Human label: A is better
# Reward model learns: score(A) > score(B)

Step 2: PPO Optimisation

The SFT model is used as the starting policy. PPO (Proximal Policy Optimisation) generates completions, scores them with the reward model, and updates the policy to produce higher-scoring outputs. A KL divergence penalty against the original SFT model prevents the policy from deviating too far (reward hacking: outputting nonsensical text that fools the reward model but is worthless to humans).

PPO for LLMs is complex and unstable: it requires loading the policy model, reward model, value model, and reference model simultaneously โ€” 4 copies of model weights.

DPO: Simpler Alternative

Direct Preference Optimisation (Rafailov et al., 2023) reformulates RLHF as a simple supervised learning problem. Given preference pairs (chosen, rejected), DPO directly updates the language model to increase the probability of chosen responses and decrease rejected ones โ€” using a mathematical equivalence between the RLHF objective and a classification loss. No reward model, no PPO, no value function โ€” just two model copies and cross-entropy loss. DPO has become the default alignment method for most open-source models.

SimplerMemory efficient
Property SFT Only SFT + RLHF (PPO) SFT + DPO
Data required Demonstration pairs (instruction โ†’ response) Demonstrations + preference comparisons + reward model labels Demonstrations + preference comparison pairs
Implementation complexity Low โ€” standard supervised fine-tuning Very high โ€” 4 model copies, reward model training, PPO stabilisation Low-medium โ€” 2 model copies, standard gradient descent
Memory requirement 1ร— model 4ร— model + reward model 2ร— model (policy + reference)
Result quality Good instruction following; may still produce harmful outputs Highest alignment; nuanced preference satisfaction; ChatGPT, Claude use this Good alignment; slightly less expressive than PPO but much cheaper; Llama 3, Mistral Instruct

GRPO: Reasoning Model Training (DeepSeek-R1)

Group Relative Policy Optimisation (GRPO), introduced in DeepSeek's R1 paper (2025), is a PPO variant specifically designed to train reasoning models. The key innovation: instead of using a separate value/critic network, GRPO estimates the baseline by sampling a group of responses for each prompt and computing their average reward. Each response is then optimised relative to its group peers โ€” high-reward responses are reinforced, low-reward ones are suppressed.

DeepSeek-R1 demonstrated that applying GRPO directly on a base model without SFT warm-up causes emergent reasoning behaviour: the model spontaneously develops long chain-of-thought patterns, self-reflection, and backtracking โ€” purely from the reward signal on correctness-verifiable tasks (maths, code). This "cold start" RL approach challenges the assumption that supervised demonstration data is required to bootstrap reasoning. OpenAI's o1/o3 series is believed to use a similar RL-based process, though the exact method is undisclosed. The practical implication: for domains where answers can be automatically verified (math, code execution), RL training can produce far superior reasoning than SFT on human demonstrations alone.

๐Ÿ”ง Parameter-Efficient Fine-tuning (PEFT)

Full fine-tuning updates every parameter in the model. For a 70B model, that means 70 billion gradient updates per step, requiring multiple 80GB GPUs just to hold the optimizer states. PEFT methods dramatically reduce the number of trained parameters while achieving comparable results.

LoRA: Low-Rank Adaptation

Hu et al. (2021) observed that the weight updates during fine-tuning have low intrinsic rank โ€” the change ฮ”W can be well-approximated by a low-rank matrix product BA, where B โˆˆ โ„dร—r and A โˆˆ โ„rร—k with r โ‰ช min(d,k).

# Original weight: W (frozen)
# LoRA adds: ฮ”W = B ร— A

# Forward pass:
h = Wยทx + (BยทA)ยทx ยท (alpha/r)

# Only B and A are trained
# W stays frozen โ€” original knowledge preserved

# r=8: 2ร—(4096ร—8) = 65,536 params
# vs full W: 4096ร—4096 = 16.7M params
# Reduction: 256ร—

At inference, LoRA can be merged back: W' = W + BA, so there is zero extra latency. Multiple LoRA adapters can be swapped at runtime for different task specialisations.

QLoRA: Quantisation + LoRA

Dettmers et al. (2023) combined LoRA with 4-bit quantisation of the frozen base model weights. The base model is stored in NF4 (Normally distributed Float 4-bit), consuming ~4GB for a 7B model (vs ~28GB in FP16). LoRA adapters remain in full BF16 precision. Double quantisation further reduces memory by quantising the quantisation constants.

Result: fine-tuning a 65B Llama model on a single 48GB GPU โ€” previously requiring 8ร— A100s. QLoRA democratised fine-tuning for researchers and practitioners without access to multi-GPU infrastructure.

4-bit baseSingle GPU
Method Trainable Params Memory vs Full FT Quality Best For
Full Fine-tuning 100% (all parameters) Baseline (highest) Best โ€” upper bound When you have compute budget and want maximum quality
LoRA (r=16) ~0.1โ€“1% of total ~3โ€“4ร— reduction in optimizer states Near-full quality on most tasks Multi-GPU fine-tuning; task adaptation; style tuning
QLoRA (4-bit + r=64) ~0.1โ€“1% of total ~8โ€“10ร— reduction total Slightly below LoRA; still strong Single GPU fine-tuning; consumer hardware; rapid prototyping
Prompt Tuning / Prefix Tuning <0.01% (just soft prompt tokens) Minimal โ€” base model frozen Decent for some tasks; poor generalisation Multi-tenant serving: swap prompt tokens per task without changing model

LoRA Is the Default for Local Fine-tuning

The combination of Hugging Face's peft library, trl (for SFT, DPO, and GRPO training), and bitsandbytes (for quantisation) has made QLoRA fine-tuning accessible to anyone with a single GPU with 8GB+ VRAM. A typical workflow: download a Llama 3.3 70B or Phi-4 14B model, apply 4-bit NF4 quantisation, attach LoRA adapters to the Q and V projection matrices (and optionally K, O, FFN up/gate), train for 1โ€“3 epochs on a domain-specific dataset, optionally merge the adapters back into the weights, and deploy. The entire process requires no custom CUDA code and can run on a consumer RTX 4090. As of 2025, DoRA (Weight-Decomposed Low-Rank Adaptation) is an increasingly popular extension that decomposes weight updates into magnitude and direction components, further improving convergence stability over standard LoRA.