๐ข OpenAI GPT Series
OpenAI's GPT (Generative Pre-trained Transformer) series established the decoder-only, autoregressive pre-training paradigm that now dominates the field. Since 2024 OpenAI has split its lineup into two categories: standard models (GPT-4o family) optimised for speed and cost, and reasoning models (o1, o3, o4-mini) that perform extended chain-of-thought before answering, dramatically improving performance on complex tasks.
| Model | Params (est.) | Context Window | Key Capability / Milestone |
|---|---|---|---|
| GPT-1 (2018) | 117M | 512 tokens | First demonstration that unsupervised pre-training on text followed by fine-tuning outperforms task-specific models across NLP benchmarks |
| GPT-2 (2019) | 1.5B | 1,024 tokens | Zero-shot text generation quality that OpenAI initially withheld as "too dangerous"; demonstrated in-context learning; open-sourced after concerns proved manageable |
| GPT-3 (2020) | 175B | 2,048 tokens | Few-shot learning emerges at scale; code generation, complex reasoning; launched the era of LLM-as-a-service APIs; required no task-specific fine-tuning |
| GPT-3.5 / ChatGPT (2022) | ~175B (tuned) | 4,096 tokens | RLHF-aligned conversational assistant; reached 100 million users in 2 months โ fastest product adoption in history; instruction following dramatically improved |
| GPT-4 (2023) | Undisclosed (MoE) | 8k โ 128k tokens | Multimodal (vision + text); passes bar exam (top 10%), SAT, medical licensing; significant safety improvements; GPT-4 Turbo with 128k context |
| GPT-4o (May 2024) | Undisclosed | 128k tokens | "Omni" โ natively multimodal (text, audio, vision in single model); faster and cheaper than GPT-4 Turbo; real-time voice mode; current flagship API model; supports structured outputs and function calling |
| o1 (Sep 2024) | Undisclosed | 128k tokens | First OpenAI reasoning model โ performs extended internal chain-of-thought before responding; PhD-level performance on science benchmarks; AIME competition math; reasoning tokens are hidden from users |
| o3-mini (Jan 2025) | Undisclosed | 200k tokens | Efficient reasoning model with adjustable thinking effort (low/medium/high); outperforms o1 on coding and math at lower cost; strong performance on SWE-bench (software engineering) |
| o3 (Apr 2025) | Undisclosed | 200k tokens | Full reasoning model successor to o1; top scores on ARC-AGI benchmark; frontier-level performance on complex multi-step reasoning, research-grade science, and advanced coding tasks |
| o4-mini (Apr 2025) | Undisclosed | 200k tokens | Efficient reasoning model with multimodal input (image + text reasoning); best cost-to-capability ratio in OpenAI's lineup for STEM tasks; native tool use including code execution |
Reasoning Models: A New Paradigm
OpenAI's o1/o3/o4 series introduced a fundamentally different inference strategy: the model is trained (via reinforcement learning) to generate a long internal "thinking" scratchpad before producing its final answer. This test-time compute scaling allows the model to explore multiple solution paths, verify intermediate steps, and backtrack from errors โ behaviour that emerges from RL training rather than being hand-programmed. The key insight: for hard problems, spending more tokens on reasoning at inference time yields better answers than simply using a larger base model. This trades token cost for quality, with adjustable depth of thinking.
GPT-4: Mixture of Experts (Widely Believed)
OpenAI has not officially disclosed GPT-4's architecture. However, multiple credible leaks and analyses suggest GPT-4 is a Mixture of Experts (MoE) model with approximately 8 experts and ~220B parameters per expert (~1.8T total), activating ~2 experts per token for ~440B active parameters per forward pass. MoE allows a much larger total parameter count (more stored knowledge) while keeping per-token compute similar to a dense 440B model โ an efficient trade-off confirmed by Meta with Mixtral and DeepSeek with their V3 model.
๐ฆ Meta Llama Series
Meta's Llama models are the most impactful open-weights LLMs in history. By releasing model weights publicly, Meta enabled a global community of researchers and developers to fine-tune, study, and build on frontier-class models without API costs or usage restrictions.
| Model | Sizes | Context | Licence |
|---|---|---|---|
| Llama 1 (Feb 2023) | 7B, 13B, 33B, 65B | 2,048 tokens | Research only (non-commercial) |
| Llama 2 (Jul 2023) | 7B, 13B, 34B, 70B + Chat variants | 4,096 tokens | Llama 2 Community Licence (commercial use allowed, restrictions on competing AI services) |
| Code Llama (Aug 2023) | 7B, 13B, 34B, 70B | 16k โ 100k tokens (infill) | Llama 2 Community Licence |
| Llama 3 (Apr 2024) | 8B, 70B + Instruct variants | 8,192 tokens | Llama 3 Community Licence (broadly permissive) |
| Llama 3.1 (Jul 2024) | 8B, 70B, 405B + Instruct variants | 128k tokens | Llama 3.1 Community Licence (permissive, allows fine-tuning distillation from larger Llamas) |
| Llama 3.2 (Sep 2024) | 1B, 3B (text); 11B, 90B (vision) | 128k tokens | Llama 3.2 Community Licence |
| Llama 3.3 (Dec 2024) | 70B | 128k tokens | Llama 3.3 Community Licence (broadly permissive) |
Architecture Advances: Llama 3
Llama 3 introduced several key improvements over Llama 2: 128k-vocabulary byte-level BPE tokeniser (vs 32k), Grouped-Query Attention (GQA) in both 8B and 70B models, and a 15T+ token training corpus โ far beyond Chinchilla-optimal to produce a better inference-time model. The result: Llama 3 8B outperforms Llama 2 70B on most benchmarks at ~9ร smaller parameter count. Llama 3.3 70B further closes the gap with 405B performance through improved post-training.
Impact on the Open-Source Ecosystem
Llama releases triggered an explosion of derivative models: Alpaca, Vicuna, WizardLM, CodeLlama, Mistral (Llama architecture), Phi-4, and thousands of community fine-tunes on Hugging Face. Frameworks like Ollama, llama.cpp, and vLLM were built specifically around Llama-compatible architectures, enabling efficient deployment from a Raspberry Pi to a 8รH100 cluster. Llama 3.3 70B delivers near-405B quality at a fraction of the inference cost.
Llama 3.1 405B and Llama 3.3 70B Rival GPT-4 Class
Llama 3.1 405B achieves near-parity with GPT-4 Turbo and Claude 3 Opus on major benchmarks: MMLU (87.3%), HumanEval (89.0%), MT-Bench (9.1). For the first time, a fully open-weights model competes with closed frontier models across general reasoning, coding, and multilingual tasks. Llama 3.3 70B narrows this further โ it matches most 405B benchmark scores using improved instruction tuning and alignment, while requiring ~6ร less GPU memory to serve. This has profound implications for on-premise deployment, data privacy, and organisations that cannot send data to third-party APIs.
๐ท Google Models
Google has produced multiple model families across different size and deployment targets โ from research models (PaLM) to production APIs (Gemini) to open-weights releases (Gemma). Google's key differentiators are multimodal capability, extremely long context, and rapid iteration: the Gemini 2.x series delivered major quality and speed improvements within months of the 1.5 generation.
| Model | Size / Variants | Multimodal | Use Case |
|---|---|---|---|
| PaLM 2 (2023) | Gecko, Otter, Bison, Unicorn (undisclosed sizes) | Text only | Multilingual reasoning; backs Google Workspace AI features (Duet AI); 100+ language support |
| Gemini 1.0 Ultra/Pro/Nano (2023โ2024) | Ultra (~undisclosed), Pro (~mid-size), Nano (1.8Bโ3.25B) | Text, image, audio, video | Ultra: frontier tasks; Pro: API and Workspace; Nano: on-device (Pixel phones) |
| Gemini 1.5 Pro / Flash (2024) | MoE architecture, ~1T total params (est.) | Text, image, audio, video, PDF, code | 1M token context window; long-document analysis; video understanding (1 hour+ videos); Pro for quality, Flash for speed/cost |
| Gemini 2.0 Flash (Feb 2025) | Undisclosed (efficient MoE) | Text, image, audio, video, code; native tool use | General-purpose workhorse: 2ร faster than 1.5 Flash with higher quality; 1M token context; multimodal output (can generate images); native agentic capabilities with built-in tool use; default model in Gemini API |
| Gemini 2.5 Pro (Mar 2025) | Undisclosed | Text, image, audio, video, code | Extended thinking / reasoning model; 1M+ token context; top scores on coding benchmarks (SWE-bench); frontier-level multimodal reasoning; designed for complex agentic tasks and long-context document work |
| Gemma 2 (2024) / Gemma 3 (Mar 2025) | Gemma 2: 9B, 27B; Gemma 3: 1B, 4B, 12B, 27B | Gemma 3: text + image (vision); Gemma 2: text only | Open weights; fine-tuning; local deployment; Gemma 3 27B competitive with models 3ร its size; Gemma 3 supports 128k context; multilingual across 140+ languages |
Gemini's 1M+ Token Context Window
Gemini 1.5 Pro introduced the first production 1M-token context window; Gemini 2.0 and 2.5 models maintain this. 1,000,000 tokens is ~750,000 words โ the entire Harry Potter series fits three times over. This is made possible by a Mixture of Experts architecture with efficient attention and Google's custom TPU infrastructure. Real-world capabilities: analysing entire 3-hour films frame by frame, processing complete codebases in-context, reasoning over all Apollo mission transcripts simultaneously. The implication: for many retrieval tasks, you can replace a complex RAG pipeline with direct in-context document loading. Gemini 2.5 Pro extends this further with reasoning capabilities on top of long context.
๐ฃ Anthropic Claude Series
Anthropic's Claude models are trained with Constitutional AI (CAI) and RLHF-based safety techniques, producing models that are notably strong at long-document analysis, instruction following, coding, and nuanced writing. Claude's large context windows (up to 200k tokens) and low hallucination rates have made it a popular choice for enterprise document workflows.
| Model | Context Window | Key Capabilities | Position |
|---|---|---|---|
| Claude 3 Haiku (Mar 2024) | 200k tokens | Fastest and cheapest Claude 3; strong on classification, summarisation, and simple instruction following; near-instant responses | Cost-optimised / high-throughput |
| Claude 3 Sonnet (Mar 2024) | 200k tokens | Balanced speed and capability; strong coding and analysis; multimodal (image input); replaced by 3.5 Sonnet as the default | Mid-tier (superseded) |
| Claude 3 Opus (Mar 2024) | 200k tokens | Highest capability of Claude 3 generation; top on complex reasoning and nuanced tasks; slow and expensive; multimodal | Frontier (Claude 3 generation) |
| Claude 3.5 Sonnet (Jun / Oct 2024) | 200k tokens | Surpassed Opus on most benchmarks at Sonnet pricing; exceptional coding (top on SWE-bench verified); strong at multi-step agentic tasks; "computer use" capability (Oct 2024 version can control desktop UI); multimodal | Flagship / recommended default |
| Claude 3.5 Haiku (Nov 2024) | 200k tokens | Fastest in the 3.5 generation; stronger than Claude 3 Sonnet at a fraction of the cost; strong coding for a small model; tool use and vision capable | Fast / cost-optimised |
| Claude 3.7 Sonnet (Feb 2025) | 200k tokens | Hybrid reasoning model: standard response mode plus optional "extended thinking" mode where the model generates a long CoT scratchpad before answering; strongest Anthropic model; top SWE-bench scores; excellent at multi-step agentic coding workflows | Current flagship (2025) |
Claude 3.7 Extended Thinking
Claude 3.7 Sonnet introduced Anthropic's first extended thinking mode โ analogous to OpenAI's o1 reasoning tokens. When enabled (via the API with a thinking parameter and a budget_tokens limit), Claude generates an internal CoT scratchpad before producing its response. The thinking content is visible to the user (unlike OpenAI's hidden reasoning tokens), providing transparency into the model's reasoning process. Extended thinking significantly improves performance on competition math, complex coding, and multi-step scientific reasoning, with quality scaling with the thinking token budget. This positions Claude 3.7 as Anthropic's answer to the reasoning model category pioneered by OpenAI's o1.
โก Specialised & Open Models
Beyond the major model families, a rich ecosystem of specialised and community models pushes the frontier in efficiency, coding, reasoning, and task-specific performance. The 2024โ2025 period saw dramatic advances from Chinese AI labs (DeepSeek, Alibaba) and continued refinement from Mistral and Microsoft โ many of these models punch well above their parameter count.
| Model | Organisation | Specialty | Licence |
|---|---|---|---|
| Mistral Small 3 (Jan 2025) | Mistral AI | 24B params; state-of-the-art for its size class; Apache 2.0 licensed; strong instruction following and coding; low latency; designed as a cost-efficient alternative to larger models for production workloads | Apache 2.0 (fully open) |
| Mixtral 8ร22B | Mistral AI | Sparse MoE: 141B total params, ~39B active per token; 64k context; strong multilingual; outperforms Llama 2 70B on most benchmarks; native function calling | Apache 2.0 |
| Phi-4 (Dec 2024) | Microsoft Research | 14B params; trained heavily on synthetic "textbook-quality" data; outperforms larger open models on STEM and reasoning benchmarks; strong on MMLU, MATH, HumanEval; designed for on-device and edge inference | MIT Licence |
| DeepSeek-V3 (Dec 2024) | DeepSeek AI | MoE: 671B total / 37B active params; trained at remarkably low cost ($5.6M reported); Multi-head Latent Attention (MLA) for KV-cache compression; FP8 mixed-precision training; rivals GPT-4o on most benchmarks at open-source availability | DeepSeek Licence (open weights, research & commercial use) |
| DeepSeek-R1 (Jan 2025) | DeepSeek AI | Reasoning model trained with GRPO (Group Relative Policy Optimisation) โ RL-only reasoning alignment without SFT warm-up; 671B MoE base; rivals o1 on AIME math, Codeforces, and GPQA science; fully open weights; distilled versions (1.5Bโ70B) retain strong reasoning at small scale | MIT Licence (fully open) |
| Qwen2.5 (Sep 2024) | Alibaba Cloud (Qwen Team) | Dense models from 0.5B to 72B; strong multilingual (29+ languages); 128k context; Qwen2.5-72B competitive with Llama 3.1 405B on many benchmarks; Qwen2.5-Coder-32B competitive with GPT-4o on HumanEval; Qwen2.5-Math specialised for mathematical reasoning | Apache 2.0 (most sizes โค72B) |
DeepSeek: The Open-Source Inflection Point
DeepSeek-V3 and DeepSeek-R1 (released December 2024 โ January 2025) caused significant industry disruption. DeepSeek reported training V3 for approximately $5.6M โ roughly 50โ100ร less than comparable closed frontier models โ through aggressive efficiency innovations: FP8 mixed-precision, MLA attention, auxiliary-loss-free load balancing, and a highly optimised training pipeline on H800 GPUs. DeepSeek-R1 further demonstrated that reinforcement learning alone (without supervised fine-tuning warm-up) can produce emergent reasoning behaviour, challenging assumptions about what training signals are necessary for reasoning capability. Both models are fully open-weight and MIT-licensed, setting a new bar for what the open-source community can achieve.
Model Size Does Not Equal Capability
Phi-4 (14B parameters) outperforms many 70B open-source models on reasoning and STEM benchmarks. Microsoft's insight: instead of scaling parameters, scale data quality โ training on synthetic "textbook quality" curated data teaches the model to reason well in a small parameter budget. Similarly, DeepSeek-R1's distilled 70B version matches or exceeds much larger models on competition math problems. The lesson: parameter count is a proxy for capability, not a guarantee. Always benchmark on your specific task before choosing model size.
๐ฏ Choosing the Right Model
With dozens of competitive models available, the choice should be driven by your specific requirements: data privacy, latency, cost, task type, and available hardware. This decision table summarises common use cases.
| Use Case | Recommended Model(s) | Why |
|---|---|---|
| Local / Private deployment (no data leaves your infrastructure) | Llama 3.3 70B, Llama 3.1 8B, Mistral Small 3 (24B), Gemma 3 27B | Open weights with permissive licences; run with Ollama or vLLM; no API calls; GDPR-friendly; llama.cpp enables CPU inference; Llama 3.3 70B delivers near-frontier quality locally |
| Code generation & completion | GPT-4o, o4-mini, DeepSeek-V3, Qwen2.5-Coder-32B | GPT-4o and o4-mini lead on complex coding tasks; DeepSeek-V3 rivals them open-source at very low API cost; Qwen2.5-Coder-32B is the best open-weight coding specialist |
| Complex reasoning / math / science | o3, o4-mini, Gemini 2.5 Pro, DeepSeek-R1 | Reasoning models dominate here; o3 and Gemini 2.5 Pro lead closed rankings; DeepSeek-R1 is the fully open-weight alternative; all use extended chain-of-thought at inference |
| Long document analysis (contracts, codebases, research papers) | Gemini 2.0 Flash / 2.5 Pro (1M+ context), Claude 3.7 Sonnet (200k), GPT-4o (128k) | Context window is the primary constraint; Gemini 2.x models lead with 1M tokens; Claude 3.7 at 200k is strong for legal/document work; GPT-4o 128k covers most cases |
| Cost-sensitive production API | Gemini 2.0 Flash, GPT-4o mini, DeepSeek-V3 API, Mistral Small 3 (self-hosted) | Gemini 2.0 Flash is extremely cheap with frontier-class quality; DeepSeek API costs a fraction of OpenAI pricing; self-hosted Mistral Small 3 has near-zero marginal cost at 24B scale |
| Best overall quality (no budget constraint) | o3, Gemini 2.5 Pro, Claude 3.7 Sonnet | All three consistently top LMSYS Chatbot Arena and major 2025 benchmarks; choice depends on task โ o3 for reasoning/code, Gemini 2.5 Pro for long context + multimodal, Claude 3.7 for writing and instruction following |
| On-device / edge inference (phone, laptop, IoT) | Phi-4 (14B, quantised), Gemma 3 4B, Llama 3.2 1B/3B, Qwen2.5 1.5B | Quantised to 4-bit, these fit in 2โ6GB; Phi-4 at 4-bit delivers remarkable reasoning quality for its size; Gemma 3 4B includes vision; Llama 3.2 1B specifically designed for edge deployment |
| Multilingual applications | Qwen2.5 72B, Gemini 2.0 Flash, Gemma 3 27B (140+ languages) | Qwen2.5 optimised for Chinese/English and 29+ languages; Gemma 3 explicitly supports 140+ languages; Gemini 2.0 Flash offers strong multilingual with very low cost |
On Benchmarks and Their Limitations
Model selection is often guided by benchmark scores, but these have important caveats:
- MMLU (Massive Multitask Language Understanding): 57 academic subjects; tests factual recall. Limitation: multiple-choice format doesn't capture generation quality or instruction following.
- HumanEval: 164 Python programming problems; measures code correctness. Limitation: small, known set โ models may have trained on solutions; doesn't test complex multi-file projects.
- MT-Bench: GPT-4-judged multi-turn conversation quality. Limitation: GPT-4 as judge has biases; may favour similar writing styles.
- LMSYS Chatbot Arena: Human preference ratings via blind pairwise comparisons โ currently the most reliable real-world quality signal because it uses actual humans on diverse open-ended tasks.
- Contamination risk: benchmark test sets may be present in training data, inflating scores. New "contamination-free" benchmarks (LiveBench, MMLU-Pro) are addressing this.
Bottom line: always evaluate on a sample of your specific task data rather than relying solely on public benchmarks. A model that ranks third on MMLU may rank first on your domain-specific task.
The MoE Revolution
Mixture of Experts models (Mixtral, GPT-4, Gemini 1.5/2.x, DeepSeek-V3) have changed the efficiency frontier. By activating only a fraction of parameters per token, MoE models can have 10โ100ร more total parameters than a dense model while keeping inference compute similar. DeepSeek-V3 (671B total / 37B active) demonstrates this clearly โ near-GPT-4o quality while being far cheaper to serve than a dense model of comparable benchmark performance.
The Open vs Closed Trade-off
Closed models (GPT-4o, o3, Claude 3.7, Gemini 2.5) generally lead on the highest-capability tasks but require trusting a third party with your data, have ongoing API costs, and can change without notice. Open models (Llama 3.3, DeepSeek-V3/R1, Mistral Small 3, Gemma 3) have dramatically narrowed the gap in 2024โ2025 โ DeepSeek-R1 openly matches o1 on reasoning benchmarks. For many production use cases, open-weight models now deliver acceptable quality with full control, reproducibility, fine-tuning freedom, and on-premise deployment.