Tokenization & Embeddings | CyberHub AI & ML

🔤 Why We Tokenise

Neural networks are mathematical functions over real numbers. Natural language is sequences of characters. Tokenisation is the process of mapping text to a discrete sequence of integer IDs that the model can look up in an embedding table. The granularity of these units — characters, words, or subwords — profoundly affects what the model can learn and how efficiently it learns it.

Character-Level Tokenisation

Split text into individual characters (a, b, c, space, punctuation…). Very small vocabulary (~100–300 tokens), handles any word without out-of-vocabulary issues, but sequences become very long. A 100-word paragraph becomes ~500 character tokens. Attention must span far more steps to capture word-level meaning, and the model must learn word boundaries from scratch. Used in some early models and character-level language models, but rarely in LLMs.

Long sequencesTiny vocab

Word-Level Tokenisation

Split on whitespace and punctuation. Short sequences, human-interpretable tokens. But vocabulary size explodes: English alone has 170,000+ words, and with inflections, compounds, technical jargon, and code, you quickly need millions of entries. Worse, any word not in the training vocabulary becomes Out-Of-Vocabulary (OOV) — the model sees it as [UNK] and loses all information about it. Morphologically rich languages (Finnish, Turkish) or mixed-language text are especially hurt.

OOV problemHuge vocab

Subword Tokenisation (The Sweet Spot)

Modern LLMs split text into subword units — fragments that are commonly occurring substrings. Common words stay as single tokens ("the", "is"); rare or novel words split into meaningful pieces ("unhappiness" → "un" + "happ" + "iness"). Vocabulary is manageable (32k–130k tokens), sequences are reasonable length, and OOV is nearly eliminated because any word can be decomposed to individual bytes if necessary. BPE, WordPiece, and SentencePiece/Unigram are the three main approaches.

Best of bothNo OOV

Vocabulary Size Trade-offs

A larger vocabulary means longer, more meaningful tokens (fewer tokens per sentence) but a larger embedding matrix (memory) and output projection layer. A smaller vocabulary means shorter tokens but longer sequences. The embedding matrix alone costs vocab_size × d_model × 4 bytes — for GPT-4's estimated 100k vocabulary and d_model of 12288, that's ~5GB just for embeddings. Llama 3 uses 128k tokens, trading memory for better multilingual coverage and fewer tokens per code snippet.

⚙️ Tokenisation Algorithms

All subword tokenisers follow the same basic idea: learn a vocabulary of common substrings from a large training corpus, then segment new text greedily using that vocabulary. They differ in how the vocabulary is constructed and how text is segmented.

Algorithm	How it Works	Used By	Typical Vocab Size
Byte-Pair Encoding (BPE)	Start with all bytes as vocab. Iteratively merge the most frequent adjacent pair into a new token. Repeat until target vocab size reached. Segment new text by applying merges greedily in learned order.	GPT-2, GPT-3, GPT-4, Llama 2, Falcon, BART	32k – 100k
WordPiece	Similar to BPE but merges are chosen to maximise the likelihood of the training data under a language model (not just frequency). Subword tokens that don't start a word are prefixed with ## to indicate continuation.	BERT, DistilBERT, ELECTRA, mBERT	30k
SentencePiece + Unigram	Language-agnostic: operates on raw Unicode bytes, handles whitespace as regular characters (no pre-tokenisation needed). Unigram variant starts with a large vocabulary and prunes tokens that contribute least to training data likelihood.	Llama 1, T5, ALBERT, XLNet, many multilingual models	32k
Byte-level BPE	BPE operating on UTF-8 bytes rather than characters. Guarantees zero OOV: any text can be encoded as raw bytes. GPT-4 and Llama 3 use this, enabling lossless encoding of any Unicode text, emoji, and binary-adjacent content.	GPT-2 onward (OpenAI), Llama 3, Mistral	50k – 128k

A concrete example of byte-level BPE tokenisation:

# Tokenising with GPT-2's BPE tokeniser (tiktoken cl100k_base)
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")

text = "unhappiness"
tokens = enc.encode(text)
# tokens: [348, 81, 23318]  (3 subword tokens)
# decoded pieces: ["un", "happ", "iness"]

text2 = "Transformers are amazing!"
tokens2 = enc.encode(text2)
# tokens2: [5996, 388, 527, 8056, 0]  (5 tokens)
# decoded:  ["Trans", "formers", " are", " amazing", "!"]

# Note the leading space is part of the token — BPE models
# treat " amazing" as a different token from "amazing"

text3 = "xkcd42_neural_net_zzqrpv"  # invented word
tokens3 = enc.encode(text3)
# Falls back to character-level byte pieces — no OOV error

Why Tokenisation Affects Model Behaviour

Tokenisation choices have non-obvious downstream effects. Numbers are often split digit-by-digit ("1024" → ["1", "0", "2", "4"]), making arithmetic harder because the model never sees the number as a unit. Languages with fewer training examples get split into more tokens per word, effectively making them "harder" for the model. Code often tokenises efficiently because identifiers match learned subwords. Understanding your tokeniser helps debug unexpected model failures.

🔢 Token Embeddings

After tokenisation produces a sequence of integer IDs, the model must convert each ID into a dense floating-point vector — an embedding. These vectors live in a high-dimensional space where geometric relationships encode semantic relationships.

The Embedding Matrix

The embedding layer is a lookup table: a matrix of shape vocab_size × d_model. Each row is the learnable embedding vector for one token. Given a token ID, the model simply retrieves that row. The entire matrix is trained end-to-end via gradient descent — no hand-crafting required. The same matrix is often reused (transposed) as the final output projection layer to reduce parameters, a technique called weight tying.

E = nn.Embedding(vocab_size=50257, d_model=768)
# GPT-2 small: 50257 × 768 = 38.6M parameters
# just in the embedding layer

x_ids = [1234, 567, 890]   # token IDs
x_emb = E(x_ids)           # shape: [3, 768]

Embedding Dimensions by Model

The embedding dimension d_model scales with model size:

GPT-2 Small: d_model = 768
GPT-2 XL: d_model = 1600
GPT-3 175B: d_model = 12288
Llama 3 8B: d_model = 4096
Llama 3 70B: d_model = 8192
Llama 3.1 405B: d_model = 16384

Larger d_model means a richer representational space — more "axes" to encode meaning along — but proportionally more compute and memory.

The King–Queen Analogy

Word2Vec (2013) famously showed that static word embeddings encode semantic analogies as geometric relationships:

vec("king") - vec("man") + vec("woman")
  ≈ vec("queen")

vec("Paris") - vec("France") + vec("Germany")
  ≈ vec("Berlin")

This works because embeddings are trained to place similar words near each other. Directions in embedding space correspond to semantic dimensions (gender, geography, tense). LLM embeddings carry similar structure, but enriched by context.

Why Embeddings Are Learned (Not Hand-Designed)

Early NLP used hand-crafted features (bag-of-words, TF-IDF, WordNet synsets). Learned embeddings eliminated this: the model discovers what representational structure is useful for its task. The embedding layer parameters receive gradients through backpropagation and update to minimise the training loss — meaning the embedding space organises itself around what matters for predicting next tokens. This self-organisation produces surprisingly structured representations without any explicit supervision about word meaning.

🧠 Contextual vs Static Embeddings

Static embeddings (Word2Vec, GloVe, FastText) assign every word a single fixed vector regardless of context. The word "bank" has one vector, whether it appears as a financial institution or a river bank. Contextual embeddings, produced by running text through a Transformer, assign vectors that depend on the surrounding sentence.

Property	Static Embeddings (Word2Vec, GloVe)	Contextual Embeddings (BERT, GPT)
Polysemy handling	One vector per word-form — "bank" always maps to the same point regardless of meaning	Different vectors for the same word in different contexts — "bank" in "river bank" vs "investment bank" are separate points
OOV handling	Words not in training vocabulary have no embedding (OOV); subword models partially mitigate	Tokeniser splits unknown words; each subword gets an embedding, so nothing is truly OOV
Computation	O(1) lookup — just retrieve the row from the matrix	O(n²) — must run the full Transformer forward pass to get contextual vectors
Downstream task quality	Good baseline but limited; fine-tuning is not standard	State of the art; fine-tuning pretrained contextual models dominates all NLP benchmarks
Storage	Small: vocab_size × d; ~1GB for 3M words at 300-dim	Large: full model; BERT-base is 440MB; Llama 3 8B is ~16GB

Polysemy: The "Bank" Problem

The word "bank" has at least two major senses. In static embeddings, the single vector is a blurry average of all senses — useful for neither. With contextual embeddings from BERT or GPT, the vector for "bank" in "I deposited money at the bank" clusters with financial words (credit, account, loan), while "bank" in "we sat on the river bank" clusters with geographic words (river, shore, water). The surrounding context flows into the representation through attention.

How Context Enters the Embedding

In a Transformer, the initial embedding is a static lookup — identical to Word2Vec at layer 0. But each subsequent layer updates the representation by mixing in information from other positions via attention. By layer 12 of BERT-base, the "bank" token's vector has been enriched by attending to "river" or "money" depending on context. The final hidden state is the contextual embedding. Different layers capture different aspects: lower layers encode syntactic patterns, higher layers encode semantic/pragmatic content.

📐 Embedding Spaces & Similarity

Once text is embedded into vectors, we can do powerful things: find semantically similar documents, cluster texts by topic, retrieve relevant passages for a question, or detect anomalies. The key tool is measuring distance in the embedding space.

Cosine Similarity

The standard metric for comparing embedding vectors. Measures the angle between two vectors, ignoring magnitude. Ranges from −1 (opposite) to +1 (identical direction):

cos_sim(a, b) = (a · b) / (||a|| × ||b||)

import numpy as np
def cosine_similarity(a, b):
    return np.dot(a, b) / (
        np.linalg.norm(a) * np.linalg.norm(b)
    )

# Example:
# cosine_sim("dog", "cat")    ≈ 0.92  (close)
# cosine_sim("dog", "justice") ≈ 0.21  (far)

Sentence Embeddings

Token-level embeddings represent individual tokens. For document-level tasks (semantic search, clustering), we need a single vector for an entire sentence or paragraph. Two common approaches:

Mean Pooling: average all token embeddings in the sequence
CLS Token: BERT prepends a special [CLS] token; its final embedding is used as the sequence representation
Sentence-BERT: fine-tuned to produce high-quality sentence vectors via siamese networks

Nearest-Neighbour Search

Given a query embedding, find the k vectors in a database most similar to it. Naive brute-force is O(n×d) — too slow at scale. Approximate Nearest Neighbour (ANN) algorithms like FAISS (Facebook), ScaNN (Google), and HNSW trade a small accuracy loss for massive speedups, enabling billion-scale search in milliseconds. These power the retrieval step in RAG (Retrieval-Augmented Generation) pipelines.

FAISSHNSWRAG

Dedicated Embedding Models (2024–2025)

While LLMs can produce embeddings from their hidden states, dedicated embedding models are trained specifically to maximise similarity quality for retrieval tasks. Key models to know as of 2025:

text-embedding-3-small / text-embedding-3-large (OpenAI, Jan 2024): successor to ada-002; significantly better on MTEB benchmark; support variable-dimension output via Matryoshka Representation Learning (MRL) — you can truncate to 256 or 512 dims with minimal quality loss; 1536 / 3072 max dimensions
Gemini text-embedding (Google, 2024): 768-dimension embeddings via the models/text-embedding-004 API; strong multilingual performance; integrated into Google Cloud Vertex AI for RAG pipelines
mxbai-embed-large-v1 (MixedBread AI, 2024): 335M param BERT-large based; tops MTEB leaderboard at its size class; fully open-source (Apache 2.0); 512 tokens max; excellent out-of-the-box quality for English retrieval
nomic-embed-text-v1.5 (Nomic, 2024): fully open-source with reproducible training; supports Matryoshka embeddings (64–768 dims); long context variant (8192 tokens); competitive with OpenAI text-embedding-3-small at zero cost
BGE-M3 (BAAI, 2024): multilingual (100+ languages), supports dense, sparse (BM25-style), and multi-vector retrieval in a single model; 8192 token context; the most versatile open embedding model
all-MiniLM-L6-v2 (SBERT): tiny (22M params), fast, good quality — popular for on-device and low-latency use; best throughput-per-dollar for high-volume applications where top accuracy is not required

Use Case	Embedding Approach	Typical Dimension	Notes
Semantic search / RAG	Sentence embedding model (BGE, E5, OpenAI)	768 – 3072	Query and passages embedded separately; cosine similarity
Duplicate detection	Sentence-BERT with cosine threshold	384 – 768	High similarity above 0.85 indicates near-duplicates
Topic clustering	Mean-pooled embeddings + K-means / UMAP + HDBSCAN	Any	Reduce to 2D with UMAP for visualisation
Anomaly / drift detection	Mean embedding of reference set; Mahalanobis distance	Any	Detect out-of-distribution inputs in production pipelines