Popular Large Language Models

🟢 OpenAI GPT Series

OpenAI's GPT (Generative Pre-trained Transformer) series established the decoder-only, autoregressive pre-training paradigm that now dominates the field. Since 2024 OpenAI has split its lineup into two categories: standard models (GPT-4o family) optimised for speed and cost, and reasoning models (o1, o3, o4-mini) that perform extended chain-of-thought before answering, dramatically improving performance on complex tasks.

Model	Params (est.)	Context Window	Key Capability / Milestone
GPT-1 (2018)	117M	512 tokens	First demonstration that unsupervised pre-training on text followed by fine-tuning outperforms task-specific models across NLP benchmarks
GPT-2 (2019)	1.5B	1,024 tokens	Zero-shot text generation quality that OpenAI initially withheld as "too dangerous"; demonstrated in-context learning; open-sourced after concerns proved manageable
GPT-3 (2020)	175B	2,048 tokens	Few-shot learning emerges at scale; code generation, complex reasoning; launched the era of LLM-as-a-service APIs; required no task-specific fine-tuning
GPT-3.5 / ChatGPT (2022)	~175B (tuned)	4,096 tokens	RLHF-aligned conversational assistant; reached 100 million users in 2 months — fastest product adoption in history; instruction following dramatically improved
GPT-4 (2023)	Undisclosed (MoE)	8k → 128k tokens	Multimodal (vision + text); passes bar exam (top 10%), SAT, medical licensing; significant safety improvements; GPT-4 Turbo with 128k context
GPT-4o (May 2024)	Undisclosed	128k tokens	"Omni" — natively multimodal (text, audio, vision in single model); faster and cheaper than GPT-4 Turbo; real-time voice mode; current flagship API model; supports structured outputs and function calling
o1 (Sep 2024)	Undisclosed	128k tokens	First OpenAI reasoning model — performs extended internal chain-of-thought before responding; PhD-level performance on science benchmarks; AIME competition math; reasoning tokens are hidden from users
o3-mini (Jan 2025)	Undisclosed	200k tokens	Efficient reasoning model with adjustable thinking effort (low/medium/high); outperforms o1 on coding and math at lower cost; strong performance on SWE-bench (software engineering)
o3 (Apr 2025)	Undisclosed	200k tokens	Full reasoning model successor to o1; top scores on ARC-AGI benchmark; frontier-level performance on complex multi-step reasoning, research-grade science, and advanced coding tasks
o4-mini (Apr 2025)	Undisclosed	200k tokens	Efficient reasoning model with multimodal input (image + text reasoning); best cost-to-capability ratio in OpenAI's lineup for STEM tasks; native tool use including code execution

Reasoning Models: A New Paradigm

OpenAI's o1/o3/o4 series introduced a fundamentally different inference strategy: the model is trained (via reinforcement learning) to generate a long internal "thinking" scratchpad before producing its final answer. This test-time compute scaling allows the model to explore multiple solution paths, verify intermediate steps, and backtrack from errors — behaviour that emerges from RL training rather than being hand-programmed. The key insight: for hard problems, spending more tokens on reasoning at inference time yields better answers than simply using a larger base model. This trades token cost for quality, with adjustable depth of thinking.

GPT-4: Mixture of Experts (Widely Believed)

OpenAI has not officially disclosed GPT-4's architecture. However, multiple credible leaks and analyses suggest GPT-4 is a Mixture of Experts (MoE) model with approximately 8 experts and ~220B parameters per expert (~1.8T total), activating ~2 experts per token for ~440B active parameters per forward pass. MoE allows a much larger total parameter count (more stored knowledge) while keeping per-token compute similar to a dense 440B model — an efficient trade-off confirmed by Meta with Mixtral and DeepSeek with their V3 model.

🦙 Meta Llama Series

Meta's Llama models are the most impactful open-weights LLMs in history. By releasing model weights publicly, Meta enabled a global community of researchers and developers to fine-tune, study, and build on frontier-class models without API costs or usage restrictions.

Model	Sizes	Context	Licence
Llama 1 (Feb 2023)	7B, 13B, 33B, 65B	2,048 tokens	Research only (non-commercial)
Llama 2 (Jul 2023)	7B, 13B, 34B, 70B + Chat variants	4,096 tokens	Llama 2 Community Licence (commercial use allowed, restrictions on competing AI services)
Code Llama (Aug 2023)	7B, 13B, 34B, 70B	16k – 100k tokens (infill)	Llama 2 Community Licence
Llama 3 (Apr 2024)	8B, 70B + Instruct variants	8,192 tokens	Llama 3 Community Licence (broadly permissive)
Llama 3.1 (Jul 2024)	8B, 70B, 405B + Instruct variants	128k tokens	Llama 3.1 Community Licence (permissive, allows fine-tuning distillation from larger Llamas)
Llama 3.2 (Sep 2024)	1B, 3B (text); 11B, 90B (vision)	128k tokens	Llama 3.2 Community Licence
Llama 3.3 (Dec 2024)	70B	128k tokens	Llama 3.3 Community Licence (broadly permissive)

Architecture Advances: Llama 3

Llama 3 introduced several key improvements over Llama 2: 128k-vocabulary byte-level BPE tokeniser (vs 32k), Grouped-Query Attention (GQA) in both 8B and 70B models, and a 15T+ token training corpus — far beyond Chinchilla-optimal to produce a better inference-time model. The result: Llama 3 8B outperforms Llama 2 70B on most benchmarks at ~9× smaller parameter count. Llama 3.3 70B further closes the gap with 405B performance through improved post-training.

GQA128k vocab

Impact on the Open-Source Ecosystem

Llama releases triggered an explosion of derivative models: Alpaca, Vicuna, WizardLM, CodeLlama, Mistral (Llama architecture), Phi-4, and thousands of community fine-tunes on Hugging Face. Frameworks like Ollama, llama.cpp, and vLLM were built specifically around Llama-compatible architectures, enabling efficient deployment from a Raspberry Pi to a 8×H100 cluster. Llama 3.3 70B delivers near-405B quality at a fraction of the inference cost.

EcosystemCommunity

Llama 3.1 405B and Llama 3.3 70B Rival GPT-4 Class

Llama 3.1 405B achieves near-parity with GPT-4 Turbo and Claude 3 Opus on major benchmarks: MMLU (87.3%), HumanEval (89.0%), MT-Bench (9.1). For the first time, a fully open-weights model competes with closed frontier models across general reasoning, coding, and multilingual tasks. Llama 3.3 70B narrows this further — it matches most 405B benchmark scores using improved instruction tuning and alignment, while requiring ~6× less GPU memory to serve. This has profound implications for on-premise deployment, data privacy, and organisations that cannot send data to third-party APIs.

🔷 Google Models

Google has produced multiple model families across different size and deployment targets — from research models (PaLM) to production APIs (Gemini) to open-weights releases (Gemma). Google's key differentiators are multimodal capability, extremely long context, and rapid iteration: the Gemini 2.x series delivered major quality and speed improvements within months of the 1.5 generation.

Model	Size / Variants	Multimodal	Use Case
PaLM 2 (2023)	Gecko, Otter, Bison, Unicorn (undisclosed sizes)	Text only	Multilingual reasoning; backs Google Workspace AI features (Duet AI); 100+ language support
Gemini 1.0 Ultra/Pro/Nano (2023–2024)	Ultra (~undisclosed), Pro (~mid-size), Nano (1.8B–3.25B)	Text, image, audio, video	Ultra: frontier tasks; Pro: API and Workspace; Nano: on-device (Pixel phones)
Gemini 1.5 Pro / Flash (2024)	MoE architecture, ~1T total params (est.)	Text, image, audio, video, PDF, code	1M token context window; long-document analysis; video understanding (1 hour+ videos); Pro for quality, Flash for speed/cost
Gemini 2.0 Flash (Feb 2025)	Undisclosed (efficient MoE)	Text, image, audio, video, code; native tool use	General-purpose workhorse: 2× faster than 1.5 Flash with higher quality; 1M token context; multimodal output (can generate images); native agentic capabilities with built-in tool use; default model in Gemini API
Gemini 2.5 Pro (Mar 2025)	Undisclosed	Text, image, audio, video, code	Extended thinking / reasoning model; 1M+ token context; top scores on coding benchmarks (SWE-bench); frontier-level multimodal reasoning; designed for complex agentic tasks and long-context document work
Gemma 2 (2024) / Gemma 3 (Mar 2025)	Gemma 2: 9B, 27B; Gemma 3: 1B, 4B, 12B, 27B	Gemma 3: text + image (vision); Gemma 2: text only	Open weights; fine-tuning; local deployment; Gemma 3 27B competitive with models 3× its size; Gemma 3 supports 128k context; multilingual across 140+ languages

Gemini's 1M+ Token Context Window

Gemini 1.5 Pro introduced the first production 1M-token context window; Gemini 2.0 and 2.5 models maintain this. 1,000,000 tokens is ~750,000 words — the entire Harry Potter series fits three times over. This is made possible by a Mixture of Experts architecture with efficient attention and Google's custom TPU infrastructure. Real-world capabilities: analysing entire 3-hour films frame by frame, processing complete codebases in-context, reasoning over all Apollo mission transcripts simultaneously. The implication: for many retrieval tasks, you can replace a complex RAG pipeline with direct in-context document loading. Gemini 2.5 Pro extends this further with reasoning capabilities on top of long context.

🟣 Anthropic Claude Series

Anthropic's Claude models are trained with Constitutional AI (CAI) and RLHF-based safety techniques, producing models that are notably strong at long-document analysis, instruction following, coding, and nuanced writing. Claude's large context windows (up to 200k tokens) and low hallucination rates have made it a popular choice for enterprise document workflows.

Model	Context Window	Key Capabilities	Position
Claude 3 Haiku (Mar 2024)	200k tokens	Fastest and cheapest Claude 3; strong on classification, summarisation, and simple instruction following; near-instant responses	Cost-optimised / high-throughput
Claude 3 Sonnet (Mar 2024)	200k tokens	Balanced speed and capability; strong coding and analysis; multimodal (image input); replaced by 3.5 Sonnet as the default	Mid-tier (superseded)
Claude 3 Opus (Mar 2024)	200k tokens	Highest capability of Claude 3 generation; top on complex reasoning and nuanced tasks; slow and expensive; multimodal	Frontier (Claude 3 generation)
Claude 3.5 Sonnet (Jun / Oct 2024)	200k tokens	Surpassed Opus on most benchmarks at Sonnet pricing; exceptional coding (top on SWE-bench verified); strong at multi-step agentic tasks; "computer use" capability (Oct 2024 version can control desktop UI); multimodal	Flagship / recommended default
Claude 3.5 Haiku (Nov 2024)	200k tokens	Fastest in the 3.5 generation; stronger than Claude 3 Sonnet at a fraction of the cost; strong coding for a small model; tool use and vision capable	Fast / cost-optimised
Claude 3.7 Sonnet (Feb 2025)	200k tokens	Hybrid reasoning model: standard response mode plus optional "extended thinking" mode where the model generates a long CoT scratchpad before answering; strongest Anthropic model; top SWE-bench scores; excellent at multi-step agentic coding workflows	Current flagship (2025)

Claude 3.7 Extended Thinking

Claude 3.7 Sonnet introduced Anthropic's first extended thinking mode — analogous to OpenAI's o1 reasoning tokens. When enabled (via the API with a thinking parameter and a budget_tokens limit), Claude generates an internal CoT scratchpad before producing its response. The thinking content is visible to the user (unlike OpenAI's hidden reasoning tokens), providing transparency into the model's reasoning process. Extended thinking significantly improves performance on competition math, complex coding, and multi-step scientific reasoning, with quality scaling with the thinking token budget. This positions Claude 3.7 as Anthropic's answer to the reasoning model category pioneered by OpenAI's o1.

⚡ Specialised & Open Models

Beyond the major model families, a rich ecosystem of specialised and community models pushes the frontier in efficiency, coding, reasoning, and task-specific performance. The 2024–2025 period saw dramatic advances from Chinese AI labs (DeepSeek, Alibaba) and continued refinement from Mistral and Microsoft — many of these models punch well above their parameter count.

Model	Organisation	Specialty	Licence
Mistral Small 3 (Jan 2025)	Mistral AI	24B params; state-of-the-art for its size class; Apache 2.0 licensed; strong instruction following and coding; low latency; designed as a cost-efficient alternative to larger models for production workloads	Apache 2.0 (fully open)
Mixtral 8×22B	Mistral AI	Sparse MoE: 141B total params, ~39B active per token; 64k context; strong multilingual; outperforms Llama 2 70B on most benchmarks; native function calling	Apache 2.0
Phi-4 (Dec 2024)	Microsoft Research	14B params; trained heavily on synthetic "textbook-quality" data; outperforms larger open models on STEM and reasoning benchmarks; strong on MMLU, MATH, HumanEval; designed for on-device and edge inference	MIT Licence
DeepSeek-V3 (Dec 2024)	DeepSeek AI	MoE: 671B total / 37B active params; trained at remarkably low cost ($5.6M reported); Multi-head Latent Attention (MLA) for KV-cache compression; FP8 mixed-precision training; rivals GPT-4o on most benchmarks at open-source availability	DeepSeek Licence (open weights, research & commercial use)
DeepSeek-R1 (Jan 2025)	DeepSeek AI	Reasoning model trained with GRPO (Group Relative Policy Optimisation) — RL-only reasoning alignment without SFT warm-up; 671B MoE base; rivals o1 on AIME math, Codeforces, and GPQA science; fully open weights; distilled versions (1.5B–70B) retain strong reasoning at small scale	MIT Licence (fully open)
Qwen2.5 (Sep 2024)	Alibaba Cloud (Qwen Team)	Dense models from 0.5B to 72B; strong multilingual (29+ languages); 128k context; Qwen2.5-72B competitive with Llama 3.1 405B on many benchmarks; Qwen2.5-Coder-32B competitive with GPT-4o on HumanEval; Qwen2.5-Math specialised for mathematical reasoning	Apache 2.0 (most sizes ≤72B)

DeepSeek: The Open-Source Inflection Point

DeepSeek-V3 and DeepSeek-R1 (released December 2024 – January 2025) caused significant industry disruption. DeepSeek reported training V3 for approximately $5.6M — roughly 50–100× less than comparable closed frontier models — through aggressive efficiency innovations: FP8 mixed-precision, MLA attention, auxiliary-loss-free load balancing, and a highly optimised training pipeline on H800 GPUs. DeepSeek-R1 further demonstrated that reinforcement learning alone (without supervised fine-tuning warm-up) can produce emergent reasoning behaviour, challenging assumptions about what training signals are necessary for reasoning capability. Both models are fully open-weight and MIT-licensed, setting a new bar for what the open-source community can achieve.

Model Size Does Not Equal Capability

Phi-4 (14B parameters) outperforms many 70B open-source models on reasoning and STEM benchmarks. Microsoft's insight: instead of scaling parameters, scale data quality — training on synthetic "textbook quality" curated data teaches the model to reason well in a small parameter budget. Similarly, DeepSeek-R1's distilled 70B version matches or exceeds much larger models on competition math problems. The lesson: parameter count is a proxy for capability, not a guarantee. Always benchmark on your specific task before choosing model size.

🎯 Choosing the Right Model

With dozens of competitive models available, the choice should be driven by your specific requirements: data privacy, latency, cost, task type, and available hardware. This decision table summarises common use cases.

Use Case	Recommended Model(s)	Why
Local / Private deployment (no data leaves your infrastructure)	Llama 3.3 70B, Llama 3.1 8B, Mistral Small 3 (24B), Gemma 3 27B	Open weights with permissive licences; run with Ollama or vLLM; no API calls; GDPR-friendly; llama.cpp enables CPU inference; Llama 3.3 70B delivers near-frontier quality locally
Code generation & completion	GPT-4o, o4-mini, DeepSeek-V3, Qwen2.5-Coder-32B	GPT-4o and o4-mini lead on complex coding tasks; DeepSeek-V3 rivals them open-source at very low API cost; Qwen2.5-Coder-32B is the best open-weight coding specialist
Complex reasoning / math / science	o3, o4-mini, Gemini 2.5 Pro, DeepSeek-R1	Reasoning models dominate here; o3 and Gemini 2.5 Pro lead closed rankings; DeepSeek-R1 is the fully open-weight alternative; all use extended chain-of-thought at inference
Long document analysis (contracts, codebases, research papers)	Gemini 2.0 Flash / 2.5 Pro (1M+ context), Claude 3.7 Sonnet (200k), GPT-4o (128k)	Context window is the primary constraint; Gemini 2.x models lead with 1M tokens; Claude 3.7 at 200k is strong for legal/document work; GPT-4o 128k covers most cases
Cost-sensitive production API	Gemini 2.0 Flash, GPT-4o mini, DeepSeek-V3 API, Mistral Small 3 (self-hosted)	Gemini 2.0 Flash is extremely cheap with frontier-class quality; DeepSeek API costs a fraction of OpenAI pricing; self-hosted Mistral Small 3 has near-zero marginal cost at 24B scale
Best overall quality (no budget constraint)	o3, Gemini 2.5 Pro, Claude 3.7 Sonnet	All three consistently top LMSYS Chatbot Arena and major 2025 benchmarks; choice depends on task — o3 for reasoning/code, Gemini 2.5 Pro for long context + multimodal, Claude 3.7 for writing and instruction following
On-device / edge inference (phone, laptop, IoT)	Phi-4 (14B, quantised), Gemma 3 4B, Llama 3.2 1B/3B, Qwen2.5 1.5B	Quantised to 4-bit, these fit in 2–6GB; Phi-4 at 4-bit delivers remarkable reasoning quality for its size; Gemma 3 4B includes vision; Llama 3.2 1B specifically designed for edge deployment
Multilingual applications	Qwen2.5 72B, Gemini 2.0 Flash, Gemma 3 27B (140+ languages)	Qwen2.5 optimised for Chinese/English and 29+ languages; Gemma 3 explicitly supports 140+ languages; Gemini 2.0 Flash offers strong multilingual with very low cost

On Benchmarks and Their Limitations

Model selection is often guided by benchmark scores, but these have important caveats:

MMLU (Massive Multitask Language Understanding): 57 academic subjects; tests factual recall. Limitation: multiple-choice format doesn't capture generation quality or instruction following.
HumanEval: 164 Python programming problems; measures code correctness. Limitation: small, known set — models may have trained on solutions; doesn't test complex multi-file projects.
MT-Bench: GPT-4-judged multi-turn conversation quality. Limitation: GPT-4 as judge has biases; may favour similar writing styles.
LMSYS Chatbot Arena: Human preference ratings via blind pairwise comparisons — currently the most reliable real-world quality signal because it uses actual humans on diverse open-ended tasks.
Contamination risk: benchmark test sets may be present in training data, inflating scores. New "contamination-free" benchmarks (LiveBench, MMLU-Pro) are addressing this.

Bottom line: always evaluate on a sample of your specific task data rather than relying solely on public benchmarks. A model that ranks third on MMLU may rank first on your domain-specific task.

The MoE Revolution

Mixture of Experts models (Mixtral, GPT-4, Gemini 1.5/2.x, DeepSeek-V3) have changed the efficiency frontier. By activating only a fraction of parameters per token, MoE models can have 10–100× more total parameters than a dense model while keeping inference compute similar. DeepSeek-V3 (671B total / 37B active) demonstrates this clearly — near-GPT-4o quality while being far cheaper to serve than a dense model of comparable benchmark performance.

MoEEfficiency frontier

The Open vs Closed Trade-off

Closed models (GPT-4o, o3, Claude 3.7, Gemini 2.5) generally lead on the highest-capability tasks but require trusting a third party with your data, have ongoing API costs, and can change without notice. Open models (Llama 3.3, DeepSeek-V3/R1, Mistral Small 3, Gemma 3) have dramatically narrowed the gap in 2024–2025 — DeepSeek-R1 openly matches o1 on reasoning benchmarks. For many production use cases, open-weight models now deliver acceptable quality with full control, reproducibility, fine-tuning freedom, and on-premise deployment.

Open weightsAPI models