โฑ 12 min read ๐Ÿ“Š Intermediate ๐Ÿ—“ Updated Jan 2025

๐ŸŸข OpenAI GPT Series

OpenAI's GPT (Generative Pre-trained Transformer) series established the decoder-only, autoregressive pre-training paradigm that now dominates the field. Since 2024 OpenAI has split its lineup into two categories: standard models (GPT-4o family) optimised for speed and cost, and reasoning models (o1, o3, o4-mini) that perform extended chain-of-thought before answering, dramatically improving performance on complex tasks.

Model Params (est.) Context Window Key Capability / Milestone
GPT-1 (2018) 117M 512 tokens First demonstration that unsupervised pre-training on text followed by fine-tuning outperforms task-specific models across NLP benchmarks
GPT-2 (2019) 1.5B 1,024 tokens Zero-shot text generation quality that OpenAI initially withheld as "too dangerous"; demonstrated in-context learning; open-sourced after concerns proved manageable
GPT-3 (2020) 175B 2,048 tokens Few-shot learning emerges at scale; code generation, complex reasoning; launched the era of LLM-as-a-service APIs; required no task-specific fine-tuning
GPT-3.5 / ChatGPT (2022) ~175B (tuned) 4,096 tokens RLHF-aligned conversational assistant; reached 100 million users in 2 months โ€” fastest product adoption in history; instruction following dramatically improved
GPT-4 (2023) Undisclosed (MoE) 8k โ†’ 128k tokens Multimodal (vision + text); passes bar exam (top 10%), SAT, medical licensing; significant safety improvements; GPT-4 Turbo with 128k context
GPT-4o (May 2024) Undisclosed 128k tokens "Omni" โ€” natively multimodal (text, audio, vision in single model); faster and cheaper than GPT-4 Turbo; real-time voice mode; current flagship API model; supports structured outputs and function calling
o1 (Sep 2024) Undisclosed 128k tokens First OpenAI reasoning model โ€” performs extended internal chain-of-thought before responding; PhD-level performance on science benchmarks; AIME competition math; reasoning tokens are hidden from users
o3-mini (Jan 2025) Undisclosed 200k tokens Efficient reasoning model with adjustable thinking effort (low/medium/high); outperforms o1 on coding and math at lower cost; strong performance on SWE-bench (software engineering)
o3 (Apr 2025) Undisclosed 200k tokens Full reasoning model successor to o1; top scores on ARC-AGI benchmark; frontier-level performance on complex multi-step reasoning, research-grade science, and advanced coding tasks
o4-mini (Apr 2025) Undisclosed 200k tokens Efficient reasoning model with multimodal input (image + text reasoning); best cost-to-capability ratio in OpenAI's lineup for STEM tasks; native tool use including code execution

Reasoning Models: A New Paradigm

OpenAI's o1/o3/o4 series introduced a fundamentally different inference strategy: the model is trained (via reinforcement learning) to generate a long internal "thinking" scratchpad before producing its final answer. This test-time compute scaling allows the model to explore multiple solution paths, verify intermediate steps, and backtrack from errors โ€” behaviour that emerges from RL training rather than being hand-programmed. The key insight: for hard problems, spending more tokens on reasoning at inference time yields better answers than simply using a larger base model. This trades token cost for quality, with adjustable depth of thinking.

GPT-4: Mixture of Experts (Widely Believed)

OpenAI has not officially disclosed GPT-4's architecture. However, multiple credible leaks and analyses suggest GPT-4 is a Mixture of Experts (MoE) model with approximately 8 experts and ~220B parameters per expert (~1.8T total), activating ~2 experts per token for ~440B active parameters per forward pass. MoE allows a much larger total parameter count (more stored knowledge) while keeping per-token compute similar to a dense 440B model โ€” an efficient trade-off confirmed by Meta with Mixtral and DeepSeek with their V3 model.

๐Ÿฆ™ Meta Llama Series

Meta's Llama models are the most impactful open-weights LLMs in history. By releasing model weights publicly, Meta enabled a global community of researchers and developers to fine-tune, study, and build on frontier-class models without API costs or usage restrictions.

Model Sizes Context Licence
Llama 1 (Feb 2023) 7B, 13B, 33B, 65B 2,048 tokens Research only (non-commercial)
Llama 2 (Jul 2023) 7B, 13B, 34B, 70B + Chat variants 4,096 tokens Llama 2 Community Licence (commercial use allowed, restrictions on competing AI services)
Code Llama (Aug 2023) 7B, 13B, 34B, 70B 16k โ€“ 100k tokens (infill) Llama 2 Community Licence
Llama 3 (Apr 2024) 8B, 70B + Instruct variants 8,192 tokens Llama 3 Community Licence (broadly permissive)
Llama 3.1 (Jul 2024) 8B, 70B, 405B + Instruct variants 128k tokens Llama 3.1 Community Licence (permissive, allows fine-tuning distillation from larger Llamas)
Llama 3.2 (Sep 2024) 1B, 3B (text); 11B, 90B (vision) 128k tokens Llama 3.2 Community Licence
Llama 3.3 (Dec 2024) 70B 128k tokens Llama 3.3 Community Licence (broadly permissive)

Architecture Advances: Llama 3

Llama 3 introduced several key improvements over Llama 2: 128k-vocabulary byte-level BPE tokeniser (vs 32k), Grouped-Query Attention (GQA) in both 8B and 70B models, and a 15T+ token training corpus โ€” far beyond Chinchilla-optimal to produce a better inference-time model. The result: Llama 3 8B outperforms Llama 2 70B on most benchmarks at ~9ร— smaller parameter count. Llama 3.3 70B further closes the gap with 405B performance through improved post-training.

GQA128k vocab

Impact on the Open-Source Ecosystem

Llama releases triggered an explosion of derivative models: Alpaca, Vicuna, WizardLM, CodeLlama, Mistral (Llama architecture), Phi-4, and thousands of community fine-tunes on Hugging Face. Frameworks like Ollama, llama.cpp, and vLLM were built specifically around Llama-compatible architectures, enabling efficient deployment from a Raspberry Pi to a 8ร—H100 cluster. Llama 3.3 70B delivers near-405B quality at a fraction of the inference cost.

EcosystemCommunity

Llama 3.1 405B and Llama 3.3 70B Rival GPT-4 Class

Llama 3.1 405B achieves near-parity with GPT-4 Turbo and Claude 3 Opus on major benchmarks: MMLU (87.3%), HumanEval (89.0%), MT-Bench (9.1). For the first time, a fully open-weights model competes with closed frontier models across general reasoning, coding, and multilingual tasks. Llama 3.3 70B narrows this further โ€” it matches most 405B benchmark scores using improved instruction tuning and alignment, while requiring ~6ร— less GPU memory to serve. This has profound implications for on-premise deployment, data privacy, and organisations that cannot send data to third-party APIs.

๐Ÿ”ท Google Models

Google has produced multiple model families across different size and deployment targets โ€” from research models (PaLM) to production APIs (Gemini) to open-weights releases (Gemma). Google's key differentiators are multimodal capability, extremely long context, and rapid iteration: the Gemini 2.x series delivered major quality and speed improvements within months of the 1.5 generation.

Model Size / Variants Multimodal Use Case
PaLM 2 (2023) Gecko, Otter, Bison, Unicorn (undisclosed sizes) Text only Multilingual reasoning; backs Google Workspace AI features (Duet AI); 100+ language support
Gemini 1.0 Ultra/Pro/Nano (2023โ€“2024) Ultra (~undisclosed), Pro (~mid-size), Nano (1.8Bโ€“3.25B) Text, image, audio, video Ultra: frontier tasks; Pro: API and Workspace; Nano: on-device (Pixel phones)
Gemini 1.5 Pro / Flash (2024) MoE architecture, ~1T total params (est.) Text, image, audio, video, PDF, code 1M token context window; long-document analysis; video understanding (1 hour+ videos); Pro for quality, Flash for speed/cost
Gemini 2.0 Flash (Feb 2025) Undisclosed (efficient MoE) Text, image, audio, video, code; native tool use General-purpose workhorse: 2ร— faster than 1.5 Flash with higher quality; 1M token context; multimodal output (can generate images); native agentic capabilities with built-in tool use; default model in Gemini API
Gemini 2.5 Pro (Mar 2025) Undisclosed Text, image, audio, video, code Extended thinking / reasoning model; 1M+ token context; top scores on coding benchmarks (SWE-bench); frontier-level multimodal reasoning; designed for complex agentic tasks and long-context document work
Gemma 2 (2024) / Gemma 3 (Mar 2025) Gemma 2: 9B, 27B; Gemma 3: 1B, 4B, 12B, 27B Gemma 3: text + image (vision); Gemma 2: text only Open weights; fine-tuning; local deployment; Gemma 3 27B competitive with models 3ร— its size; Gemma 3 supports 128k context; multilingual across 140+ languages

Gemini's 1M+ Token Context Window

Gemini 1.5 Pro introduced the first production 1M-token context window; Gemini 2.0 and 2.5 models maintain this. 1,000,000 tokens is ~750,000 words โ€” the entire Harry Potter series fits three times over. This is made possible by a Mixture of Experts architecture with efficient attention and Google's custom TPU infrastructure. Real-world capabilities: analysing entire 3-hour films frame by frame, processing complete codebases in-context, reasoning over all Apollo mission transcripts simultaneously. The implication: for many retrieval tasks, you can replace a complex RAG pipeline with direct in-context document loading. Gemini 2.5 Pro extends this further with reasoning capabilities on top of long context.

๐ŸŸฃ Anthropic Claude Series

Anthropic's Claude models are trained with Constitutional AI (CAI) and RLHF-based safety techniques, producing models that are notably strong at long-document analysis, instruction following, coding, and nuanced writing. Claude's large context windows (up to 200k tokens) and low hallucination rates have made it a popular choice for enterprise document workflows.

Model Context Window Key Capabilities Position
Claude 3 Haiku (Mar 2024) 200k tokens Fastest and cheapest Claude 3; strong on classification, summarisation, and simple instruction following; near-instant responses Cost-optimised / high-throughput
Claude 3 Sonnet (Mar 2024) 200k tokens Balanced speed and capability; strong coding and analysis; multimodal (image input); replaced by 3.5 Sonnet as the default Mid-tier (superseded)
Claude 3 Opus (Mar 2024) 200k tokens Highest capability of Claude 3 generation; top on complex reasoning and nuanced tasks; slow and expensive; multimodal Frontier (Claude 3 generation)
Claude 3.5 Sonnet (Jun / Oct 2024) 200k tokens Surpassed Opus on most benchmarks at Sonnet pricing; exceptional coding (top on SWE-bench verified); strong at multi-step agentic tasks; "computer use" capability (Oct 2024 version can control desktop UI); multimodal Flagship / recommended default
Claude 3.5 Haiku (Nov 2024) 200k tokens Fastest in the 3.5 generation; stronger than Claude 3 Sonnet at a fraction of the cost; strong coding for a small model; tool use and vision capable Fast / cost-optimised
Claude 3.7 Sonnet (Feb 2025) 200k tokens Hybrid reasoning model: standard response mode plus optional "extended thinking" mode where the model generates a long CoT scratchpad before answering; strongest Anthropic model; top SWE-bench scores; excellent at multi-step agentic coding workflows Current flagship (2025)

Claude 3.7 Extended Thinking

Claude 3.7 Sonnet introduced Anthropic's first extended thinking mode โ€” analogous to OpenAI's o1 reasoning tokens. When enabled (via the API with a thinking parameter and a budget_tokens limit), Claude generates an internal CoT scratchpad before producing its response. The thinking content is visible to the user (unlike OpenAI's hidden reasoning tokens), providing transparency into the model's reasoning process. Extended thinking significantly improves performance on competition math, complex coding, and multi-step scientific reasoning, with quality scaling with the thinking token budget. This positions Claude 3.7 as Anthropic's answer to the reasoning model category pioneered by OpenAI's o1.

โšก Specialised & Open Models

Beyond the major model families, a rich ecosystem of specialised and community models pushes the frontier in efficiency, coding, reasoning, and task-specific performance. The 2024โ€“2025 period saw dramatic advances from Chinese AI labs (DeepSeek, Alibaba) and continued refinement from Mistral and Microsoft โ€” many of these models punch well above their parameter count.

Model Organisation Specialty Licence
Mistral Small 3 (Jan 2025) Mistral AI 24B params; state-of-the-art for its size class; Apache 2.0 licensed; strong instruction following and coding; low latency; designed as a cost-efficient alternative to larger models for production workloads Apache 2.0 (fully open)
Mixtral 8ร—22B Mistral AI Sparse MoE: 141B total params, ~39B active per token; 64k context; strong multilingual; outperforms Llama 2 70B on most benchmarks; native function calling Apache 2.0
Phi-4 (Dec 2024) Microsoft Research 14B params; trained heavily on synthetic "textbook-quality" data; outperforms larger open models on STEM and reasoning benchmarks; strong on MMLU, MATH, HumanEval; designed for on-device and edge inference MIT Licence
DeepSeek-V3 (Dec 2024) DeepSeek AI MoE: 671B total / 37B active params; trained at remarkably low cost ($5.6M reported); Multi-head Latent Attention (MLA) for KV-cache compression; FP8 mixed-precision training; rivals GPT-4o on most benchmarks at open-source availability DeepSeek Licence (open weights, research & commercial use)
DeepSeek-R1 (Jan 2025) DeepSeek AI Reasoning model trained with GRPO (Group Relative Policy Optimisation) โ€” RL-only reasoning alignment without SFT warm-up; 671B MoE base; rivals o1 on AIME math, Codeforces, and GPQA science; fully open weights; distilled versions (1.5Bโ€“70B) retain strong reasoning at small scale MIT Licence (fully open)
Qwen2.5 (Sep 2024) Alibaba Cloud (Qwen Team) Dense models from 0.5B to 72B; strong multilingual (29+ languages); 128k context; Qwen2.5-72B competitive with Llama 3.1 405B on many benchmarks; Qwen2.5-Coder-32B competitive with GPT-4o on HumanEval; Qwen2.5-Math specialised for mathematical reasoning Apache 2.0 (most sizes โ‰ค72B)

DeepSeek: The Open-Source Inflection Point

DeepSeek-V3 and DeepSeek-R1 (released December 2024 โ€“ January 2025) caused significant industry disruption. DeepSeek reported training V3 for approximately $5.6M โ€” roughly 50โ€“100ร— less than comparable closed frontier models โ€” through aggressive efficiency innovations: FP8 mixed-precision, MLA attention, auxiliary-loss-free load balancing, and a highly optimised training pipeline on H800 GPUs. DeepSeek-R1 further demonstrated that reinforcement learning alone (without supervised fine-tuning warm-up) can produce emergent reasoning behaviour, challenging assumptions about what training signals are necessary for reasoning capability. Both models are fully open-weight and MIT-licensed, setting a new bar for what the open-source community can achieve.

Model Size Does Not Equal Capability

Phi-4 (14B parameters) outperforms many 70B open-source models on reasoning and STEM benchmarks. Microsoft's insight: instead of scaling parameters, scale data quality โ€” training on synthetic "textbook quality" curated data teaches the model to reason well in a small parameter budget. Similarly, DeepSeek-R1's distilled 70B version matches or exceeds much larger models on competition math problems. The lesson: parameter count is a proxy for capability, not a guarantee. Always benchmark on your specific task before choosing model size.

๐ŸŽฏ Choosing the Right Model

With dozens of competitive models available, the choice should be driven by your specific requirements: data privacy, latency, cost, task type, and available hardware. This decision table summarises common use cases.

Use Case Recommended Model(s) Why
Local / Private deployment (no data leaves your infrastructure) Llama 3.3 70B, Llama 3.1 8B, Mistral Small 3 (24B), Gemma 3 27B Open weights with permissive licences; run with Ollama or vLLM; no API calls; GDPR-friendly; llama.cpp enables CPU inference; Llama 3.3 70B delivers near-frontier quality locally
Code generation & completion GPT-4o, o4-mini, DeepSeek-V3, Qwen2.5-Coder-32B GPT-4o and o4-mini lead on complex coding tasks; DeepSeek-V3 rivals them open-source at very low API cost; Qwen2.5-Coder-32B is the best open-weight coding specialist
Complex reasoning / math / science o3, o4-mini, Gemini 2.5 Pro, DeepSeek-R1 Reasoning models dominate here; o3 and Gemini 2.5 Pro lead closed rankings; DeepSeek-R1 is the fully open-weight alternative; all use extended chain-of-thought at inference
Long document analysis (contracts, codebases, research papers) Gemini 2.0 Flash / 2.5 Pro (1M+ context), Claude 3.7 Sonnet (200k), GPT-4o (128k) Context window is the primary constraint; Gemini 2.x models lead with 1M tokens; Claude 3.7 at 200k is strong for legal/document work; GPT-4o 128k covers most cases
Cost-sensitive production API Gemini 2.0 Flash, GPT-4o mini, DeepSeek-V3 API, Mistral Small 3 (self-hosted) Gemini 2.0 Flash is extremely cheap with frontier-class quality; DeepSeek API costs a fraction of OpenAI pricing; self-hosted Mistral Small 3 has near-zero marginal cost at 24B scale
Best overall quality (no budget constraint) o3, Gemini 2.5 Pro, Claude 3.7 Sonnet All three consistently top LMSYS Chatbot Arena and major 2025 benchmarks; choice depends on task โ€” o3 for reasoning/code, Gemini 2.5 Pro for long context + multimodal, Claude 3.7 for writing and instruction following
On-device / edge inference (phone, laptop, IoT) Phi-4 (14B, quantised), Gemma 3 4B, Llama 3.2 1B/3B, Qwen2.5 1.5B Quantised to 4-bit, these fit in 2โ€“6GB; Phi-4 at 4-bit delivers remarkable reasoning quality for its size; Gemma 3 4B includes vision; Llama 3.2 1B specifically designed for edge deployment
Multilingual applications Qwen2.5 72B, Gemini 2.0 Flash, Gemma 3 27B (140+ languages) Qwen2.5 optimised for Chinese/English and 29+ languages; Gemma 3 explicitly supports 140+ languages; Gemini 2.0 Flash offers strong multilingual with very low cost

On Benchmarks and Their Limitations

Model selection is often guided by benchmark scores, but these have important caveats:

  • MMLU (Massive Multitask Language Understanding): 57 academic subjects; tests factual recall. Limitation: multiple-choice format doesn't capture generation quality or instruction following.
  • HumanEval: 164 Python programming problems; measures code correctness. Limitation: small, known set โ€” models may have trained on solutions; doesn't test complex multi-file projects.
  • MT-Bench: GPT-4-judged multi-turn conversation quality. Limitation: GPT-4 as judge has biases; may favour similar writing styles.
  • LMSYS Chatbot Arena: Human preference ratings via blind pairwise comparisons โ€” currently the most reliable real-world quality signal because it uses actual humans on diverse open-ended tasks.
  • Contamination risk: benchmark test sets may be present in training data, inflating scores. New "contamination-free" benchmarks (LiveBench, MMLU-Pro) are addressing this.

Bottom line: always evaluate on a sample of your specific task data rather than relying solely on public benchmarks. A model that ranks third on MMLU may rank first on your domain-specific task.

The MoE Revolution

Mixture of Experts models (Mixtral, GPT-4, Gemini 1.5/2.x, DeepSeek-V3) have changed the efficiency frontier. By activating only a fraction of parameters per token, MoE models can have 10โ€“100ร— more total parameters than a dense model while keeping inference compute similar. DeepSeek-V3 (671B total / 37B active) demonstrates this clearly โ€” near-GPT-4o quality while being far cheaper to serve than a dense model of comparable benchmark performance.

MoEEfficiency frontier

The Open vs Closed Trade-off

Closed models (GPT-4o, o3, Claude 3.7, Gemini 2.5) generally lead on the highest-capability tasks but require trusting a third party with your data, have ongoing API costs, and can change without notice. Open models (Llama 3.3, DeepSeek-V3/R1, Mistral Small 3, Gemma 3) have dramatically narrowed the gap in 2024โ€“2025 โ€” DeepSeek-R1 openly matches o1 on reasoning benchmarks. For many production use cases, open-weight models now deliver acceptable quality with full control, reproducibility, fine-tuning freedom, and on-premise deployment.

Open weightsAPI models