π€ Why We Tokenise
Neural networks are mathematical functions over real numbers. Natural language is sequences of characters. Tokenisation is the process of mapping text to a discrete sequence of integer IDs that the model can look up in an embedding table. The granularity of these units β characters, words, or subwords β profoundly affects what the model can learn and how efficiently it learns it.
Character-Level Tokenisation
Split text into individual characters (a, b, c, space, punctuationβ¦). Very small vocabulary (~100β300 tokens), handles any word without out-of-vocabulary issues, but sequences become very long. A 100-word paragraph becomes ~500 character tokens. Attention must span far more steps to capture word-level meaning, and the model must learn word boundaries from scratch. Used in some early models and character-level language models, but rarely in LLMs.
Word-Level Tokenisation
Split on whitespace and punctuation. Short sequences, human-interpretable tokens. But vocabulary size explodes: English alone has 170,000+ words, and with inflections, compounds, technical jargon, and code, you quickly need millions of entries. Worse, any word not in the training vocabulary becomes Out-Of-Vocabulary (OOV) β the model sees it as [UNK] and loses all information about it. Morphologically rich languages (Finnish, Turkish) or mixed-language text are especially hurt.
Subword Tokenisation (The Sweet Spot)
Modern LLMs split text into subword units β fragments that are commonly occurring substrings. Common words stay as single tokens ("the", "is"); rare or novel words split into meaningful pieces ("unhappiness" β "un" + "happ" + "iness"). Vocabulary is manageable (32kβ130k tokens), sequences are reasonable length, and OOV is nearly eliminated because any word can be decomposed to individual bytes if necessary. BPE, WordPiece, and SentencePiece/Unigram are the three main approaches.
Vocabulary Size Trade-offs
A larger vocabulary means longer, more meaningful tokens (fewer tokens per sentence) but a larger embedding matrix (memory) and output projection layer. A smaller vocabulary means shorter tokens but longer sequences. The embedding matrix alone costs vocab_size Γ d_model Γ 4 bytes β for GPT-4's estimated 100k vocabulary and d_model of 12288, that's ~5GB just for embeddings. Llama 3 uses 128k tokens, trading memory for better multilingual coverage and fewer tokens per code snippet.
βοΈ Tokenisation Algorithms
All subword tokenisers follow the same basic idea: learn a vocabulary of common substrings from a large training corpus, then segment new text greedily using that vocabulary. They differ in how the vocabulary is constructed and how text is segmented.
| Algorithm | How it Works | Used By | Typical Vocab Size |
|---|---|---|---|
| Byte-Pair Encoding (BPE) | Start with all bytes as vocab. Iteratively merge the most frequent adjacent pair into a new token. Repeat until target vocab size reached. Segment new text by applying merges greedily in learned order. | GPT-2, GPT-3, GPT-4, Llama 2, Falcon, BART | 32k β 100k |
| WordPiece | Similar to BPE but merges are chosen to maximise the likelihood of the training data under a language model (not just frequency). Subword tokens that don't start a word are prefixed with ## to indicate continuation. | BERT, DistilBERT, ELECTRA, mBERT | 30k |
| SentencePiece + Unigram | Language-agnostic: operates on raw Unicode bytes, handles whitespace as regular characters (no pre-tokenisation needed). Unigram variant starts with a large vocabulary and prunes tokens that contribute least to training data likelihood. | Llama 1, T5, ALBERT, XLNet, many multilingual models | 32k |
| Byte-level BPE | BPE operating on UTF-8 bytes rather than characters. Guarantees zero OOV: any text can be encoded as raw bytes. GPT-4 and Llama 3 use this, enabling lossless encoding of any Unicode text, emoji, and binary-adjacent content. | GPT-2 onward (OpenAI), Llama 3, Mistral | 50k β 128k |
A concrete example of byte-level BPE tokenisation:
# Tokenising with GPT-2's BPE tokeniser (tiktoken cl100k_base)
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
text = "unhappiness"
tokens = enc.encode(text)
# tokens: [348, 81, 23318] (3 subword tokens)
# decoded pieces: ["un", "happ", "iness"]
text2 = "Transformers are amazing!"
tokens2 = enc.encode(text2)
# tokens2: [5996, 388, 527, 8056, 0] (5 tokens)
# decoded: ["Trans", "formers", " are", " amazing", "!"]
# Note the leading space is part of the token β BPE models
# treat " amazing" as a different token from "amazing"
text3 = "xkcd42_neural_net_zzqrpv" # invented word
tokens3 = enc.encode(text3)
# Falls back to character-level byte pieces β no OOV error
Why Tokenisation Affects Model Behaviour
Tokenisation choices have non-obvious downstream effects. Numbers are often split digit-by-digit ("1024" β ["1", "0", "2", "4"]), making arithmetic harder because the model never sees the number as a unit. Languages with fewer training examples get split into more tokens per word, effectively making them "harder" for the model. Code often tokenises efficiently because identifiers match learned subwords. Understanding your tokeniser helps debug unexpected model failures.
π’ Token Embeddings
After tokenisation produces a sequence of integer IDs, the model must convert each ID into a dense floating-point vector β an embedding. These vectors live in a high-dimensional space where geometric relationships encode semantic relationships.
The Embedding Matrix
The embedding layer is a lookup table: a matrix of shape vocab_size Γ d_model. Each row is the learnable embedding vector for one token. Given a token ID, the model simply retrieves that row. The entire matrix is trained end-to-end via gradient descent β no hand-crafting required. The same matrix is often reused (transposed) as the final output projection layer to reduce parameters, a technique called weight tying.
E = nn.Embedding(vocab_size=50257, d_model=768) # GPT-2 small: 50257 Γ 768 = 38.6M parameters # just in the embedding layer x_ids = [1234, 567, 890] # token IDs x_emb = E(x_ids) # shape: [3, 768]
Embedding Dimensions by Model
The embedding dimension d_model scales with model size:
- GPT-2 Small: d_model = 768
- GPT-2 XL: d_model = 1600
- GPT-3 175B: d_model = 12288
- Llama 3 8B: d_model = 4096
- Llama 3 70B: d_model = 8192
- Llama 3.1 405B: d_model = 16384
Larger d_model means a richer representational space β more "axes" to encode meaning along β but proportionally more compute and memory.
The KingβQueen Analogy
Word2Vec (2013) famously showed that static word embeddings encode semantic analogies as geometric relationships:
vec("king") - vec("man") + vec("woman")
β vec("queen")
vec("Paris") - vec("France") + vec("Germany")
β vec("Berlin")
This works because embeddings are trained to place similar words near each other. Directions in embedding space correspond to semantic dimensions (gender, geography, tense). LLM embeddings carry similar structure, but enriched by context.
Why Embeddings Are Learned (Not Hand-Designed)
Early NLP used hand-crafted features (bag-of-words, TF-IDF, WordNet synsets). Learned embeddings eliminated this: the model discovers what representational structure is useful for its task. The embedding layer parameters receive gradients through backpropagation and update to minimise the training loss β meaning the embedding space organises itself around what matters for predicting next tokens. This self-organisation produces surprisingly structured representations without any explicit supervision about word meaning.
π§ Contextual vs Static Embeddings
Static embeddings (Word2Vec, GloVe, FastText) assign every word a single fixed vector regardless of context. The word "bank" has one vector, whether it appears as a financial institution or a river bank. Contextual embeddings, produced by running text through a Transformer, assign vectors that depend on the surrounding sentence.
| Property | Static Embeddings (Word2Vec, GloVe) | Contextual Embeddings (BERT, GPT) |
|---|---|---|
| Polysemy handling | One vector per word-form β "bank" always maps to the same point regardless of meaning | Different vectors for the same word in different contexts β "bank" in "river bank" vs "investment bank" are separate points |
| OOV handling | Words not in training vocabulary have no embedding (OOV); subword models partially mitigate | Tokeniser splits unknown words; each subword gets an embedding, so nothing is truly OOV |
| Computation | O(1) lookup β just retrieve the row from the matrix | O(nΒ²) β must run the full Transformer forward pass to get contextual vectors |
| Downstream task quality | Good baseline but limited; fine-tuning is not standard | State of the art; fine-tuning pretrained contextual models dominates all NLP benchmarks |
| Storage | Small: vocab_size Γ d; ~1GB for 3M words at 300-dim | Large: full model; BERT-base is 440MB; Llama 3 8B is ~16GB |
Polysemy: The "Bank" Problem
The word "bank" has at least two major senses. In static embeddings, the single vector is a blurry average of all senses β useful for neither. With contextual embeddings from BERT or GPT, the vector for "bank" in "I deposited money at the bank" clusters with financial words (credit, account, loan), while "bank" in "we sat on the river bank" clusters with geographic words (river, shore, water). The surrounding context flows into the representation through attention.
How Context Enters the Embedding
In a Transformer, the initial embedding is a static lookup β identical to Word2Vec at layer 0. But each subsequent layer updates the representation by mixing in information from other positions via attention. By layer 12 of BERT-base, the "bank" token's vector has been enriched by attending to "river" or "money" depending on context. The final hidden state is the contextual embedding. Different layers capture different aspects: lower layers encode syntactic patterns, higher layers encode semantic/pragmatic content.
π Embedding Spaces & Similarity
Once text is embedded into vectors, we can do powerful things: find semantically similar documents, cluster texts by topic, retrieve relevant passages for a question, or detect anomalies. The key tool is measuring distance in the embedding space.
Cosine Similarity
The standard metric for comparing embedding vectors. Measures the angle between two vectors, ignoring magnitude. Ranges from β1 (opposite) to +1 (identical direction):
cos_sim(a, b) = (a Β· b) / (||a|| Γ ||b||)
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (
np.linalg.norm(a) * np.linalg.norm(b)
)
# Example:
# cosine_sim("dog", "cat") β 0.92 (close)
# cosine_sim("dog", "justice") β 0.21 (far)
Sentence Embeddings
Token-level embeddings represent individual tokens. For document-level tasks (semantic search, clustering), we need a single vector for an entire sentence or paragraph. Two common approaches:
- Mean Pooling: average all token embeddings in the sequence
- CLS Token: BERT prepends a special [CLS] token; its final embedding is used as the sequence representation
- Sentence-BERT: fine-tuned to produce high-quality sentence vectors via siamese networks
Nearest-Neighbour Search
Given a query embedding, find the k vectors in a database most similar to it. Naive brute-force is O(nΓd) β too slow at scale. Approximate Nearest Neighbour (ANN) algorithms like FAISS (Facebook), ScaNN (Google), and HNSW trade a small accuracy loss for massive speedups, enabling billion-scale search in milliseconds. These power the retrieval step in RAG (Retrieval-Augmented Generation) pipelines.
Dedicated Embedding Models (2024β2025)
While LLMs can produce embeddings from their hidden states, dedicated embedding models are trained specifically to maximise similarity quality for retrieval tasks. Key models to know as of 2025:
- text-embedding-3-small / text-embedding-3-large (OpenAI, Jan 2024): successor to ada-002; significantly better on MTEB benchmark; support variable-dimension output via Matryoshka Representation Learning (MRL) β you can truncate to 256 or 512 dims with minimal quality loss; 1536 / 3072 max dimensions
- Gemini text-embedding (Google, 2024): 768-dimension embeddings via the
models/text-embedding-004API; strong multilingual performance; integrated into Google Cloud Vertex AI for RAG pipelines - mxbai-embed-large-v1 (MixedBread AI, 2024): 335M param BERT-large based; tops MTEB leaderboard at its size class; fully open-source (Apache 2.0); 512 tokens max; excellent out-of-the-box quality for English retrieval
- nomic-embed-text-v1.5 (Nomic, 2024): fully open-source with reproducible training; supports Matryoshka embeddings (64β768 dims); long context variant (8192 tokens); competitive with OpenAI text-embedding-3-small at zero cost
- BGE-M3 (BAAI, 2024): multilingual (100+ languages), supports dense, sparse (BM25-style), and multi-vector retrieval in a single model; 8192 token context; the most versatile open embedding model
- all-MiniLM-L6-v2 (SBERT): tiny (22M params), fast, good quality β popular for on-device and low-latency use; best throughput-per-dollar for high-volume applications where top accuracy is not required
| Use Case | Embedding Approach | Typical Dimension | Notes |
|---|---|---|---|
| Semantic search / RAG | Sentence embedding model (BGE, E5, OpenAI) | 768 β 3072 | Query and passages embedded separately; cosine similarity |
| Duplicate detection | Sentence-BERT with cosine threshold | 384 β 768 | High similarity above 0.85 indicates near-duplicates |
| Topic clustering | Mean-pooled embeddings + K-means / UMAP + HDBSCAN | Any | Reduce to 2D with UMAP for visualisation |
| Anomaly / drift detection | Mean embedding of reference set; Mahalanobis distance | Any | Detect out-of-distribution inputs in production pipelines |