00 · Word Embeddings: From Tokens to Geometry

GPT-2 · Qwen3-0.8B · word2vec · nomic-embed

notebook
Author

Intro to LLMs

Published

April 14, 2026

What is an Embedding?

Core idea. Every token in a language model is a high-dimensional vector of real numbers. These vectors are learned during training so that semantically similar words end up geometrically close to each other. Geometry encodes meaning.


1 · A Toy 2-D Embedding Space

Before looking at a real model, let’s build intuition with a tiny 2-dimensional example. Imagine we train a vocabulary of 12 words on a text corpus; after training, each word has an (x, y) position. Words that appear in similar contexts cluster together.

Show the code
words = [
    "king",  "queen",  "man",   "woman",
    "dog",   "cat",    "puppy", "kitten",
    "Paris", "London", "Tokyo", "Berlin",
]
coords = np.array([
    [ 0.95,  0.90], [ 0.85,  0.70], [ 0.75,  0.80], [ 0.65,  0.60],
    [-0.80, -0.50], [-0.70, -0.65], [-0.90, -0.40], [-0.85, -0.75],
    [ 0.10, -0.90], [ 0.30, -0.80], [ 0.50, -0.95], [ 0.20, -0.70],
])
categories = ["Royalty / People"] * 4 + ["Animals"] * 4 + ["Cities"] * 4
cat_colors = {"Royalty / People": "#E63946", "Animals": "#2A9D8F", "Cities": "#E9C46A"}
colors = [cat_colors[c] for c in categories]

fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(coords[:, 0], coords[:, 1], c=colors, s=180, zorder=3,
           edgecolors="white", linewidths=1.2)
for i, word in enumerate(words):
    ax.annotate(word, coords[i], textcoords="offset points",
                xytext=(8, 4), fontsize=10, fontweight="bold")
legend_handles = [mpatches.Patch(color=v, label=k) for k, v in cat_colors.items()]
ax.legend(handles=legend_handles, loc="lower right", framealpha=0.9)
ax.set_title("Toy 2-D Embedding Space", fontsize=14, fontweight="bold", pad=12)
ax.set_xlabel("Dimension 1");  ax.set_ylabel("Dimension 2")
ax.axhline(0, color="lightgrey", lw=0.8);  ax.axvline(0, color="lightgrey", lw=0.8)
plt.tight_layout();  plt.show()

Toy 2-D embedding space. Colour = semantic category.

The direction from man → woman is approximately the same as king → queen — the famous word analogy property.


2 · The Four Models

# Model Type Dim Training objective
1 word2vec (Google News, 2013) Static word vectors 300 Predict surrounding words (skip-gram)
2 GPT-2 (117M, 2019) LLM token embeddings 768 Next-token prediction
3 Qwen3-0.8B (2025) LLM token embeddings 1024 Next-token prediction
4 nomic-embed-text-v1.5 (2024) Dedicated embedding model 768 Contrastive learning

Key distinction: word2vec and nomic-embed were designed for semantic vector arithmetic. GPT-2 and Qwen3 learned their token embeddings as a side effect of next-token prediction.

2a · GPT-2

Show the code
import torch
from transformers import GPT2Model, GPT2Tokenizer

def _load(cls, name, **kw):
    try:
        return cls.from_pretrained(name, local_files_only=True, **kw)
    except Exception:
        return cls.from_pretrained(name, **kw)

gpt2_tok   = _load(GPT2Tokenizer, "gpt2")
gpt2_model = _load(GPT2Model, "gpt2")
gpt2_model.eval()
W_gpt2 = gpt2_model.wte.weight.detach().to(torch.float32).numpy()
print(f"GPT-2:  vocab {W_gpt2.shape[0]:,}  dim {W_gpt2.shape[1]}  ({W_gpt2.nbytes/1e6:.0f} MB)")

gpt2_decode = lambda i: gpt2_tok.decode([i])

def gpt2_embed(word):
    ids = gpt2_tok.encode(" " + word, add_special_tokens=False) or \
          gpt2_tok.encode(word, add_special_tokens=False)
    return W_gpt2[ids].mean(axis=0)
GPT-2:  vocab 50,257  dim 768  (154 MB)

2b · Qwen3-0.8B

Show the code
from transformers import AutoModelForCausalLM, AutoTokenizer

qwen_name  = "Qwen/Qwen3.5-0.8B"
qwen_tok   = _load(AutoTokenizer, qwen_name)
qwen_model = _load(AutoModelForCausalLM, qwen_name, dtype=torch.float32, device_map="cpu")
qwen_model.eval()
W_qwen = qwen_model.model.embed_tokens.weight.detach().to(torch.float32).numpy()
print(f"Qwen3-0.8B:  vocab {W_qwen.shape[0]:,}  dim {W_qwen.shape[1]}  ({W_qwen.nbytes/1e6:.0f} MB)")

qwen_decode = lambda i: qwen_tok.decode([i])

def qwen_embed(word):
    ids = qwen_tok.encode(" " + word, add_special_tokens=False) or \
          qwen_tok.encode(word, add_special_tokens=False)
    return W_qwen[ids].mean(axis=0)
Qwen3-0.8B:  vocab 248,320  dim 1024  (1017 MB)

Architecture note. GPT-2 stores its embedding matrix at model.wte.weight; Qwen3 uses model.model.embed_tokens.weight. Both are lookup tables: row i is the 1-D vector for token i before the transformer sees it.

2c · word2vec (Google News)

Show the code
import gensim.downloader as gensim_api

print("Downloading word2vec-google-news-300 (~1.7 GB, cached after first run)…")
wv    = gensim_api.load("word2vec-google-news-300")
W_w2v = wv.vectors                           # shape (≈3 M, 300)
print(f"word2vec:  vocab {len(wv):,}  dim {wv.vector_size}")

def w2v_embed(word):
    return wv[word]
Downloading word2vec-google-news-300 (~1.7 GB, cached after first run)…
word2vec:  vocab 3,000,000  dim 300

2d · nomic-embed-text-v1.5

Show the code
from sentence_transformers import SentenceTransformer

try:
    nomic = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5",
                                trust_remote_code=True, local_files_only=True)
except Exception:
    nomic = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
print(f"nomic-embed:  dim {nomic.get_sentence_embedding_dimension()}")

def nomic_embed(word):
    # nomic-embed uses a task prefix for best results; outputs unit-normalized vectors
    return nomic.encode(f"search_document: {word}", normalize_embeddings=True)
nomic-embed:  dim 768

3 · Are Embedding Matrices Normalized?

A natural question: does each token vector have unit length? Do dimensions have zero mean across the vocabulary? The answers matter for choosing between cosine similarity and Euclidean distance.

Show the code
vocab_matrices = {
    "word2vec":   W_w2v,
    "GPT-2":      W_gpt2,
    "Qwen3-0.8B": W_qwen,
}

fig, axes = plt.subplots(2, 3, figsize=(15, 9))

for col, (name, W) in enumerate(vocab_matrices.items()):
    norms     = np.linalg.norm(W, axis=1)
    col_means = W.mean(axis=0)

    ax = axes[0, col]
    ax.hist(norms, bins=100, color="#457B9D", edgecolor="white", lw=0.4)
    ax.axvline(1.0, color="#E63946", lw=1.5, ls="--", label="unit norm (1.0)")
    ax.axvline(norms.mean(), color="#FFB703", lw=1.5, ls="-",
               label=f"mean = {norms.mean():.2f}")
    ax.set_title(f"{name}  ({W.shape[0]:,} × {W.shape[1]})", fontweight="bold")
    ax.set_xlabel("L2 norm of token vector")
    if col == 0:
        ax.set_ylabel("token count  (row norms)")
    ax.legend(fontsize=8)

    ax = axes[1, col]
    ax.hist(col_means, bins=60, color="#2A9D8F", edgecolor="white", lw=0.4)
    ax.axvline(0, color="#E63946", lw=1.5, ls="--", label="zero mean")
    ax.set_xlabel("per-dimension mean across vocab")
    if col == 0:
        ax.set_ylabel("# dimensions  (column means)")
    ax.legend(fontsize=8)

plt.suptitle("Are embedding matrices normalized?",
             fontsize=14, fontweight="bold", y=1.01)
plt.tight_layout();  plt.show()

Top row: L2 norm of each token vector. Bottom row: mean value per dimension across all tokens. Red dashed line = reference value (unit norm / zero mean).
print(f"{'Model':15s}  {'norm mean':>10s}  {'norm std':>9s}  "
      f"{'norm min':>9s}  {'norm max':>9s}  {'unit-norm?':>11s}  {'col-mean ≈ 0?':>14s}")
print("-" * 85)
for name, W in vocab_matrices.items():
    norms     = np.linalg.norm(W, axis=1)
    col_means = W.mean(axis=0)
    is_unit   = np.allclose(norms, 1.0, atol=0.01)
    is_zero   = np.allclose(col_means, 0.0, atol=0.05)
    print(f"  {name:13s}  {norms.mean():>10.3f}  {norms.std():>9.3f}  "
          f"{norms.min():>9.3f}  {norms.max():>9.3f}  {str(is_unit):>11s}  {str(is_zero):>14s}")

# nomic-embed outputs unit-normalized vectors by construction
v = nomic_embed("king")
print(f"\n  nomic-embed  output norm = {np.linalg.norm(v):.6f}  (normalize_embeddings=True)")
Model             norm mean   norm std   norm min   norm max   unit-norm?   col-mean ≈ 0?
-------------------------------------------------------------------------------------
  word2vec            2.040      1.077      0.015     21.108        False           False
  GPT-2               3.959      0.434      2.454      6.316        False           False
  Qwen3-0.8B          0.627      0.062      0.347      1.057        False           False

  nomic-embed  output norm = 1.000000  (normalize_embeddings=True)

Key findings:

  • None of the three models with explicit vocabulary matrices store unit-normalized token vectors. Norms vary, sometimes substantially.
  • Euclidean distance between two tokens depends on both the direction of their vectors and their magnitudes. A high-norm token will be Euclidean-far from almost everything even if directionally similar.
  • Cosine similarity is immune to magnitude — it measures direction only. This is why it is the standard metric for semantic comparisons.
  • nomic-embed is an exception by design: encode(..., normalize_embeddings=True) always returns unit-norm vectors.

4 · Background Geometry: Random Token Pairs

Before interpreting any similarity score, we need the baseline: what does a typical random pair look like? Any meaningful similarity must stand out from this background distribution.

Show the code
np.random.seed(42)
N = 5_000

fig, axes = plt.subplots(3, 2, figsize=(12, 11))

for row, (name, W) in enumerate(vocab_matrices.items()):
    idx_a  = np.random.choice(len(W), N, replace=False)
    idx_b  = np.random.choice(len(W), N, replace=False)
    A, B   = W[idx_a], W[idx_b]

    A_u    = A / (np.linalg.norm(A, axis=1, keepdims=True) + 1e-9)
    B_u    = B / (np.linalg.norm(B, axis=1, keepdims=True) + 1e-9)
    cos    = np.sum(A_u * B_u, axis=1)
    euclid = np.linalg.norm(A - B, axis=1)

    ax = axes[row, 0]
    ax.hist(cos, bins=80, color="#2A9D8F", edgecolor="white", lw=0.4)
    ax.axvline(cos.mean(), color="#E63946", lw=1.5, ls="--",
               label=f"mean = {cos.mean():.3f}")
    ax.axvline(0, color="black", lw=0.8, ls=":")
    ax.set_title(f"{name} — cosine similarity", fontweight="bold")
    ax.set_xlabel("cosine similarity");  ax.set_ylabel("count")
    ax.legend(fontsize=9)

    ax = axes[row, 1]
    ax.hist(euclid, bins=80, color="#457B9D", edgecolor="white", lw=0.4)
    ax.axvline(euclid.mean(), color="#E63946", lw=1.5, ls="--",
               label=f"mean = {euclid.mean():.1f}")
    ax.set_title(f"{name} — Euclidean distance", fontweight="bold")
    ax.set_xlabel("Euclidean distance");  ax.set_ylabel("count")
    ax.legend(fontsize=9)

plt.suptitle(f"Background: cosine similarity and Euclidean distance for {N:,} random token pairs",
             fontsize=13, fontweight="bold", y=1.01)
plt.tight_layout();  plt.show()

Cosine similarity (left) and Euclidean distance (right) for 5,000 random token pairs in each model.

The curse of dimensionality. In high dimensions (768-D for GPT-2, 1024-D for Qwen3), random vectors are nearly orthogonal — cosine ≈ 0. This means:

  • A cosine similarity of 0.2 between two tokens is already non-trivial; 0.5+ is strongly related.
  • word2vec (300-D) shows more spread and a slightly higher mean cosine.
  • Euclidean distances differ across models primarily because token norms differ (see Section 3), not because the geometry is fundamentally different — another reason to prefer cosine.

5 · Cosine Similarity Between Word Pairs

Show the code
pairs = [
    ("king",   "queen"),   ("dog",    "cat"),    ("Paris",  "London"),
    ("happy",  "joyful"),  ("run",    "sprint"),
    ("king",   "castle"),  ("dog",    "park"),   ("France", "Paris"),
    ("king",   "banana"),  ("dog",    "algebra"),("happy",  "concrete"),
]
pair_labels  = [f"{a}{b}" for a, b in pairs]
similarities = [cosine_similarity(gpt2_embed(a), gpt2_embed(b)) for a, b in pairs]
colors_bar   = ["#2A9D8F" if s > 0.5 else ("#E9C46A" if s > 0.25 else "#E63946")
                for s in similarities]

fig, ax = plt.subplots(figsize=(9, 5))
bars = ax.barh(pair_labels, similarities, color=colors_bar, edgecolor="white", height=0.65)
ax.axvline(0, color="black", lw=0.8)
ax.axvline(0.5,  color="#2A9D8F", lw=1, ls="--", alpha=0.5, label="High (> 0.5)")
ax.axvline(0.25, color="#E9C46A", lw=1, ls="--", alpha=0.5, label="Moderate (> 0.25)")
ax.set_xlim(-0.15, 1.0)
ax.set_xlabel("Cosine Similarity")
ax.set_title("Cosine Similarity Between Word Pairs (GPT-2)", fontweight="bold")
ax.legend(fontsize=9)
for bar, val in zip(bars, similarities):
    ax.text(val + 0.01, bar.get_y() + bar.get_height() / 2,
            f"{val:.3f}", va="center", fontsize=9)
plt.tight_layout();  plt.show()

Cosine similarity between hand-picked word pairs (GPT-2). Green = high, yellow = moderate, red = low.

Discuss: How do these cosine values compare to the random-pair background from Section 4? At what threshold does a similarity score become “meaningful”?


6 · Nearest Neighbours

For any word, we can ask: which tokens live closest in embedding space?

Show the code
probe_words = ["king", "dog", "Paris", "happy"]
fig, axes   = plt.subplots(1, 4, figsize=(14, 5))
fig.suptitle("Nearest Neighbours in GPT-2 Embedding Space", fontsize=14, fontweight="bold")

for ax, word in zip(axes, probe_words):
    neighbours = top_k_from_matrix(gpt2_embed(word), W_gpt2, gpt2_decode, k=9)
    labels     = [n[0] for n in neighbours]
    sims       = [n[1] for n in neighbours]
    bar_colors = ["#E63946" if lbl.strip() == word else "#457B9D" for lbl in labels]
    ax.barh(labels[::-1], sims[::-1], color=bar_colors[::-1], edgecolor="white")
    ax.set_title(f'"{word}"', fontweight="bold", fontsize=11)
    ax.set_xlabel("Cosine Sim");  ax.set_xlim(0.5, 1.02)

plt.tight_layout();  plt.show()

Top-9 nearest neighbours for four seed words in GPT-2.

7 · Semantic Clusters

7a · The Embedding Matrix

A 40-token slice of GPT-2’s embedding matrix. Different tokens activate different dimensions; similar tokens share similar activation patterns.

Show the code
np.random.seed(42)
sample_ids   = np.random.choice(W_gpt2.shape[0], 40, replace=False)
sample_slice = W_gpt2[sample_ids, :64]
sample_words = [gpt2_tok.decode([i]).strip() or f"<{i}>" for i in sample_ids]

fig, ax = plt.subplots(figsize=(14, 8))
sns.heatmap(sample_slice, ax=ax, cmap="RdBu_r", center=0,
            xticklabels=[f"d{i}" for i in range(64)],
            yticklabels=sample_words, linewidths=0.0,
            cbar_kws={"label": "Value", "shrink": 0.6})
ax.set_title("GPT-2 Embedding Matrix — 40 random tokens × first 64 dims",
             fontweight="bold", pad=12)
plt.tight_layout();  plt.show()

Heatmap of 40 random GPT-2 tokens × first 64 dimensions.

7b · Semantic Categories

Show the code
from sklearn.decomposition import PCA

vocab_groups = {
    "Royalty":  ["king", "queen", "prince", "princess", "throne", "crown", "noble", "lord"],
    "Animals":  ["dog", "cat", "horse", "lion", "tiger", "wolf", "bear", "fox"],
    "Cities":   ["Paris", "London", "Tokyo", "Berlin", "Rome", "Madrid", "Seoul", "Cairo"],
    "Emotions": ["happy", "sad", "angry", "fear", "joy", "love", "hate", "calm"],
    "Tech":     ["computer", "software", "internet", "data", "algorithm", "neural", "code", "model"],
    "Food":     ["apple", "bread", "soup", "pizza", "coffee", "sugar", "salt", "rice"],
    "Sports":   ["football", "tennis", "swim", "run", "race", "goal", "team", "ball"],
    "Science":  ["physics", "biology", "chemistry", "atom", "gene", "planet", "force", "energy"],
}

all_words, all_groups, all_vecs = [], [], []
for group, words in vocab_groups.items():
    for w in words:
        all_words.append(w);  all_groups.append(group)
        all_vecs.append(gpt2_embed(w))
X = np.array(all_vecs)

try:
    import umap
    reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=6, min_dist=0.3)
    X_2d    = reducer.fit_transform(X);  method = "UMAP"
except ImportError:
    pca2  = PCA(n_components=2, random_state=42)
    X_2d  = pca2.fit_transform(X);  method = "PCA"

fig = px.scatter(
    x=X_2d[:, 0], y=X_2d[:, 1],
    text=all_words, color=all_groups,
    color_discrete_sequence=PALETTE,
    title=f"{method} Projection of GPT-2 Embeddings by Semantic Category",
    labels={"x": f"{method} 1", "y": f"{method} 2", "color": "Category"},
    width=860, height=580,
)
fig.update_traces(textposition="top center",
                  marker=dict(size=10, opacity=0.85, line=dict(width=1, color="white")))
fig.show()

PCA / UMAP projection of GPT-2 embeddings coloured by semantic category.

Semantically related words cluster together even though the model never received explicit category labels — structure emerges entirely from predicting the next token.


8 · The Analogy: king − man + woman ≈ ?

\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}

8a · Top-20 Nearest Neighbors

We compute king − man + woman in three models and find the closest tokens by cosine similarity.

Show the code
source_words = {"king", "man", "woman"}

lm_models_list = [
    ("word2vec",   w2v_embed,  W_w2v,  lambda i: wv.index_to_key[i]),
    ("GPT-2",      gpt2_embed, W_gpt2, gpt2_decode),
    ("Qwen3-0.8B", qwen_embed, W_qwen, qwen_decode),
]

fig, axes = plt.subplots(1, 3, figsize=(16, 12))

for ax, (name, embed_fn, W, decode_fn) in zip(axes, lm_models_list):
    result   = embed_fn("king") - embed_fn("man") + embed_fn("woman")
    neighbors = top_k_from_matrix(result, W, decode_fn, k=20)
    labels    = [n[0] for n in neighbors]
    sims      = [n[1] for n in neighbors]

    bar_colors = []
    for lbl in labels:
        l = lbl.lower().strip()
        if l == "queen":
            bar_colors.append("#E63946")
        elif l in source_words:
            bar_colors.append("#FFB703")
        else:
            bar_colors.append("#457B9D")

    ax.barh(labels[::-1], sims[::-1], color=bar_colors[::-1], edgecolor="white")
    ax.set_title(f"{name}", fontweight="bold", fontsize=13)
    ax.set_xlabel("cosine similarity to analogy vector")

    queen_rank = next((i + 1 for i, lbl in enumerate(labels)
                       if lbl.lower().strip() == "queen"), "—")
    cos_q = cosine_similarity(result, embed_fn("queen"))
    cos_k = cosine_similarity(result, embed_fn("king"))
    ax.text(0.02, 0.02,
            f"queen rank: {queen_rank}\ncos(queen) = {cos_q:.3f}\ncos(king)  = {cos_k:.3f}",
            transform=ax.transAxes, fontsize=9, va="bottom",
            bbox=dict(facecolor="white", alpha=0.75, edgecolor="lightgrey"))

legend_handles = [
    mpatches.Patch(color="#E63946", label="queen"),
    mpatches.Patch(color="#FFB703", label="source words  (king / man / woman)"),
    mpatches.Patch(color="#457B9D", label="other tokens"),
]
fig.legend(handles=legend_handles, loc="lower center", ncol=3,
           fontsize=10, frameon=False, bbox_to_anchor=(0.5, -0.02))
plt.suptitle("Top-20 nearest tokens to (king − man + woman)",
             fontsize=14, fontweight="bold", y=1.01)
plt.tight_layout();  plt.show()

Top-20 tokens nearest to (king − man + woman). Red = queen, gold = source words, blue = other.

8b · Full-Vocabulary Cosine Histograms

Where does the analogy vector land relative to every token in each model’s vocabulary?

Show the code
landmark_cats = [
    ("source",    ["king", "man", "woman"],                                    "#FFB703"),
    ("royalty",   ["queen", "princess", "prince", "empress", "emperor",
                   "monarch", "duchess", "duke"],                              "#E63946"),
    ("power",     ["ruler", "sovereign", "lord", "leader", "chief"],           "#6A0572"),
    ("unrelated", ["dog", "computer"],                                         "#888888"),
]

def log_y_ax(frac, ax):
    lo = math.log10(max(ax.get_ylim()[0], 0.5))
    hi = math.log10(ax.get_ylim()[1])
    return 10 ** (lo + frac * (hi - lo))

row_fracs = [0.88, 0.65, 0.42, 0.22]

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for ax, (name, embed_fn, W, _) in zip(axes, lm_models_list):
    result_v = embed_fn("king") - embed_fn("man") + embed_fn("woman")
    result_u = result_v / (np.linalg.norm(result_v) + 1e-9)
    W_u      = W / (np.linalg.norm(W, axis=1, keepdims=True) + 1e-9)
    all_sims = W_u @ result_u

    ax.hist(all_sims, bins=120, color="#CCCCCC", edgecolor="white", log=True)
    ax.set_xlabel("cosine similarity to (king − man + woman)")
    ax.set_ylabel("tokens  (log scale)")
    ax.set_title(f"{name}{len(all_sims):,} tokens", fontweight="bold", pad=8)

    all_lm = []
    for _cat, words, color in landmark_cats:
        for w in words:
            try:
                sim = cosine_similarity(result_v, embed_fn(w))
                all_lm.append((w, sim, color))
            except Exception:
                pass
    all_lm.sort(key=lambda t: t[1])

    for i, (w, sim, color) in enumerate(all_lm):
        ax.axvline(sim, color=color, lw=1.5, alpha=0.9)
        ax.annotate(f"{w}\n{sim:.3f}",
                    xy=(sim, log_y_ax(row_fracs[i % len(row_fracs)], ax)),
                    xytext=(4, 0), textcoords="offset points",
                    fontsize=8, fontweight="bold", color=color, va="center")

legend_handles = [
    mpatches.Patch(color="#FFB703", label="source words"),
    mpatches.Patch(color="#E63946", label="royalty"),
    mpatches.Patch(color="#6A0572", label="power"),
    mpatches.Patch(color="#888888", label="unrelated"),
]
fig.legend(handles=legend_handles, loc="lower center", ncol=4,
           fontsize=9, frameon=False, bbox_to_anchor=(0.5, -0.06))
plt.suptitle("Cosine similarity of (king − man + woman) to full vocabulary",
             fontsize=13, fontweight="bold", y=1.02)
plt.tight_layout();  plt.show()

Cosine similarity of the analogy vector to every vocab token. Vertical lines mark annotated landmark words.

What to look for. In word2vec, royalty words (red) cluster tightly at the right tail and queen typically sits above king — the analogy “works.” In GPT-2 and Qwen3, king dominates (the analogy vector still points mostly toward king). Qwen3 narrows the king–queen gap substantially, but king still wins.

8c · Cosine vs. Euclidean Scatter

Plotting both metrics simultaneously reveals where they agree and where they diverge.

Show the code
landmark_colors = {
    "queen":    "#E63946",
    "king":     "#FFB703", "woman": "#FFB703", "man": "#FFB703",
    "emperor":  "#6A0572", "empress": "#6A0572",
    "princess": "#457B9D", "prince":  "#457B9D",
    "dog":      "#888888", "computer": "#888888",
}
landmarks = list(landmark_colors)

fig, axes = plt.subplots(1, 2, figsize=(16, 7))

for ax, (name, embed_fn, W, _) in zip(axes, lm_models_list[1:]):   # GPT-2, Qwen3
    result_v = embed_fn("king") - embed_fn("man") + embed_fn("woman")
    result_u = result_v / (np.linalg.norm(result_v) + 1e-9)
    W_u      = W / (np.linalg.norm(W, axis=1, keepdims=True) + 1e-9)
    all_sims  = W_u @ result_u
    all_dists = np.linalg.norm(W - result_v, axis=1)

    ax.scatter(all_sims, all_dists, s=3, color="#CCCCCC", alpha=0.3,
               edgecolors="none", rasterized=True)
    for w in landmarks:
        try:
            v = embed_fn(w)
            s = cosine_similarity(result_v, v)
            d = float(np.linalg.norm(v - result_v))
            c = landmark_colors[w]
            ax.scatter(s, d, s=90, color=c, edgecolor="white", lw=1.2, zorder=5)
            ax.annotate(f"{w}\ncos={s:.2f}  d={d:.1f}",
                        xy=(s, d), xytext=(7, 5), textcoords="offset points",
                        fontsize=9, fontweight="bold", color=c)
        except Exception:
            pass
    ax.set_xlabel("cosine similarity to analogy vector")
    ax.set_ylabel("Euclidean distance to analogy vector")
    ax.set_title(f"{name}", fontweight="bold", pad=10)
    ax.axvline(0, color="#EEEEEE", lw=0.8)

plt.suptitle("Cosine vs. Euclidean — every vocab token",
             fontsize=14, fontweight="bold")
plt.tight_layout();  plt.show()

Each grey dot is one vocab token. Highlighted landmarks show where cosine and Euclidean rankings diverge.

When cosine and Euclidean disagree. A token with high cosine but large Euclidean distance points in the right direction but has a different magnitude. Because LLM embedding norms vary widely (Section 3), Euclidean distance can penalise or reward a token just for having an unusual norm — this is why cosine is preferred for semantic retrieval.

8d · Semantic-Axis Projection (GPT-2)

Project words onto two interpretable axes: royalty (commoner → royalty) and gender (male → female).

Show the code
man_v   = gpt2_embed("man");   woman_v = gpt2_embed("woman")
king_v  = gpt2_embed("king");  queen_v = gpt2_embed("queen")

gender_axis  = woman_v - man_v;  gender_axis /= np.linalg.norm(gender_axis)
royalty_raw  = (king_v + queen_v) / 2 - (man_v + woman_v) / 2
royalty_axis = royalty_raw - np.dot(royalty_raw, gender_axis) * gender_axis
royalty_axis /= np.linalg.norm(royalty_axis)

def proj(vec):
    return np.dot(vec, royalty_axis), np.dot(vec, gender_axis)
def proj_norm(word):
    v = gpt2_embed(word);  return proj(v / (np.linalg.norm(v) + 1e-9))

focus_words  = ["man", "woman", "king", "queen", "prince", "princess", "emperor", "empress"]
neg_controls = ["dog", "Paris", "computer", "banana", "happy", "physics"]
focus_colors = {
    "man": "#457B9D", "woman": "#457B9D",
    "king": "#E63946", "queen": "#E63946",
    "prince": "#E63946", "princess": "#E63946",
    "emperor": "#6A0572", "empress": "#6A0572",
}
result_vec = king_v - man_v + woman_v
cos_q = cosine_similarity(result_vec, queen_v)
print(f"cos(king−man+woman, queen) = {cos_q:.4f}")

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

for ax, use_norm, title in zip(
    axes,
    [False, True],
    ["Euclidean projection\n(raw vectors)", "Cosine projection\n(unit-normalized)"],
):
    pf  = (lambda w: proj(gpt2_embed(w))) if not use_norm else proj_norm
    rp  = proj(result_vec) if not use_norm else proj(result_vec / (np.linalg.norm(result_vec) + 1e-9))

    for w in neg_controls:
        x, y = pf(w)
        ax.scatter(x, y, color="#CCCCCC", s=80, zorder=2)
        ax.annotate(w, (x, y), textcoords="offset points", xytext=(6, 3),
                    fontsize=9, color="#999999")
    for w in focus_words:
        x, y = pf(w)
        ax.scatter(x, y, color=focus_colors[w], s=120, zorder=3)
        ax.annotate(w, (x, y), textcoords="offset points", xytext=(6, 3),
                    fontsize=11, fontweight="bold", color=focus_colors[w])
    ax.scatter(*rp, color="#FFB703", s=220, zorder=5, marker="*")
    ax.annotate("king−man+woman", rp, textcoords="offset points",
                xytext=(8, 3), fontsize=10, fontweight="bold", color="#FFB703")
    for w1, w2 in [("man", "king"), ("woman", "queen")]:
        ax.annotate("", xy=pf(w2), xytext=pf(w1),
                    arrowprops=dict(arrowstyle="->", color="#BBBBBB", lw=1.2))
    ax.axhline(0, color="#EEEEEE", lw=0.8);  ax.axvline(0, color="#EEEEEE", lw=0.8)
    ax.set_xlabel("Royalty axis  →", fontsize=11)
    ax.set_ylabel("Gender axis  →", fontsize=11)
    ax.set_title(title, fontweight="bold", pad=10)

plt.suptitle("Royalty × Gender projection — raw vs. unit-normalized (GPT-2)",
             fontsize=13, fontweight="bold", y=1.02)
plt.tight_layout();  plt.show()
cos(king−man+woman, queen) = 0.7085

Left: raw vector projections. Right: unit-normalized (cosine geometry only). The ★ is the analogy result king−man+woman.

The raw projection (left) overshoots queen on the gender axis because the analogy adds the full man → woman vector. After unit-normalization (right), magnitude is stripped out — the ★ lands much closer to queen. This is why cosine geometry is the right frame for semantic similarity.


9 · Plural and Number Directions

If embeddings encode grammar as geometry, we should find a consistent “plural direction”: the vector cats − cat should point roughly the same way as dogs − dog, kings − king, and so on. Similarly, number words (one, two, three …) should form a structured sequence.

Show the code
# --- plural pairs ---
plural_pairs = [
    ("cat",    "cats"),
    ("dog",    "dogs"),
    ("king",   "kings"),
    ("queen",  "queens"),
    ("word",   "words"),
    ("token",  "tokens"),
    ("city",   "cities"),
    ("country","countries"),
    ("man",    "men"),
    ("woman",  "women"),
]

# plural direction vectors (unit-normalised)
def unit(v):
    return v / (np.linalg.norm(v) + 1e-9)

plural_vecs = []
for sing, plur in plural_pairs:
    try:
        d = gpt2_embed(plur) - gpt2_embed(sing)
        plural_vecs.append((sing, plur, unit(d)))
    except Exception:
        pass

# mean plural direction
mean_plural = unit(np.mean([v for *_, v in plural_vecs], axis=0))

# cosine of each pair's direction with the mean
print("Cosine similarity of each plural direction with the mean plural direction:")
print(f"  {'singular → plural':22s}  {'cosine':>8s}")
print("  " + "-" * 34)
for sing, plur, d in plural_vecs:
    c = float(np.dot(d, mean_plural))
    print(f"  {sing + ' → ' + plur:22s}  {c:8.4f}")
Cosine similarity of each plural direction with the mean plural direction:
  singular → plural         cosine
  ----------------------------------
  cat → cats                0.5947
  dog → dogs                0.6812
  king → kings              0.5837
  queen → queens            0.6277
  word → words              0.3947
  token → tokens            0.4444
  city → cities             0.6040
  country → countries       0.5485
  man → men                 0.5895
  woman → women             0.6246
Show the code
all_embed_fns = {
    "word2vec":    w2v_embed,
    "GPT-2":       gpt2_embed,
    "Qwen3-0.8B":  qwen_embed,
    "nomic-embed": nomic_embed,
}
labels = [f"{s}{p}" for s, p in plural_pairs]

fig, axes = plt.subplots(2, 2, figsize=(18, 16))
axes = axes.flatten()
mean_scores = {}

for ax, (model_name, embed_fn) in zip(axes, all_embed_fns.items()):
    vecs = []
    valid_labels = []
    for sing, plur in plural_pairs:
        try:
            d = embed_fn(plur) - embed_fn(sing)
            vecs.append(unit(d))
            valid_labels.append(f"{sing}{plur}")
        except Exception:
            pass
    mat = np.array(vecs) @ np.array(vecs).T
    n = len(vecs)
    # mean off-diagonal = consistency score
    off_diag = mat[np.triu_indices(n, k=1)]
    score = float(off_diag.mean())
    mean_scores[model_name] = score

    im = ax.imshow(mat, cmap="RdYlGn", vmin=-0.2, vmax=1.0)
    ax.set_xticks(range(n)); ax.set_xticklabels(valid_labels, rotation=45, ha="right", fontsize=8)
    ax.set_yticks(range(n)); ax.set_yticklabels(valid_labels, fontsize=8)
    for i in range(n):
        for j in range(n):
            ax.text(j, i, f"{mat[i,j]:.2f}", ha="center", va="center", fontsize=7,
                    color="white" if abs(mat[i,j]) > 0.6 else "black")
    ax.set_title(f"{model_name}  (mean off-diag = {score:.3f})", fontweight="bold", pad=10)
    plt.colorbar(im, ax=ax, label="Cosine sim", shrink=0.75)

plt.suptitle("Plural direction similarity — do all models agree on what 'plural' means geometrically?",
             fontsize=13, fontweight="bold", y=1.01)
plt.tight_layout();  plt.show()

Pairwise cosine similarity between plural direction vectors across four models. Each cell is cos( (plural−singular)_i , (plural−singular)_j ). Higher = more consistent plural direction.
Show the code
fig, ax = plt.subplots(figsize=(7, 4))
colors = [f"C{i}" for i in range(len(mean_scores))]
bars = ax.barh(list(mean_scores.keys()), list(mean_scores.values()), color=colors, height=0.5)
ax.bar_label(bars, fmt="%.3f", padding=4, fontsize=11)
ax.set_xlim(0, max(mean_scores.values()) * 1.25)
ax.set_xlabel("Mean pairwise cosine similarity of (plural − singular) vectors", fontsize=11)
ax.set_title("Plural direction consistency by model", fontweight="bold", pad=10)
plt.tight_layout();  plt.show()

Mean off-diagonal cosine similarity of plural direction vectors — a single ‘plural consistency score’ per model.

Discuss: Which model has the most consistent plural direction? Does the ranking match your intuition from the analogy results? What does it say about each model’s training objective?

Show the code
# number words — are they ordered?
number_words = ["one", "two", "three", "four", "five",
                "six", "seven", "eight", "nine", "ten"]

num_vecs = np.array([gpt2_embed(w) for w in number_words])

# pairwise cosine similarity
num_unit = num_vecs / (np.linalg.norm(num_vecs, axis=1, keepdims=True) + 1e-9)
num_sim  = num_unit @ num_unit.T

fig, ax = plt.subplots(figsize=(8, 7))
im2 = ax.imshow(num_sim, cmap="Blues", vmin=0, vmax=1.0)
ax.set_xticks(range(len(number_words))); ax.set_xticklabels(number_words, rotation=30, ha="right", fontsize=10)
ax.set_yticks(range(len(number_words))); ax.set_yticklabels(number_words, fontsize=10)
for i in range(len(number_words)):
    for j in range(len(number_words)):
        ax.text(j, i, f"{num_sim[i,j]:.2f}", ha="center", va="center", fontsize=8,
                color="white" if num_sim[i,j] > 0.6 else "black")
plt.colorbar(im2, ax=ax, label="Cosine similarity", shrink=0.8)
ax.set_title("Number-word cosine similarity (GPT-2)", fontweight="bold", pad=12)
plt.tight_layout();  plt.show()

Cosine similarity between consecutive number-word steps in GPT-2 embedding space.

Discuss: Are adjacent numbers more similar to each other than to distant ones (e.g. twothree vs. twonine)? Does the similarity matrix reveal any groupings (small vs. large numbers, or even vs. odd)?


10 · Cross-Model Summary

all_models = {
    "word2vec":    w2v_embed,
    "GPT-2":       gpt2_embed,
    "Qwen3-0.8B":  qwen_embed,
    "nomic-embed": nomic_embed,
}
candidates = ["queen", "king", "princess", "prince", "empress", "emperor",
              "woman", "man", "girl", "boy", "lady", "dog", "computer", "Paris"]

print(f"{'Model':15s}  {'cos(analogy,queen)':>20s}  {'cos(analogy,king)':>19s}  "
      f"{'queen rank†':>12s}  {'angle°(Δking,Δqueen)':>22s}")
print("-" * 95)

for name, embed_fn in all_models.items():
    try:
        king_v  = embed_fn("king");  man_v   = embed_fn("man")
        woman_v = embed_fn("woman"); queen_v = embed_fn("queen")
        analogy = king_v - man_v + woman_v

        cos_q = cosine_similarity(analogy, queen_v)
        cos_k = cosine_similarity(analogy, king_v)
        cand_sims = sorted(
            [(w, cosine_similarity(analogy, embed_fn(w))) for w in candidates],
            key=lambda t: t[1], reverse=True
        )
        queen_rank = next((i + 1 for i, (w, _) in enumerate(cand_sims) if w == "queen"), "?")
        d1  = king_v  - man_v
        d2  = queen_v - woman_v
        ang = angle_between(d1, d2)
        print(f"  {name:13s}  {cos_q:>20.4f}  {cos_k:>19.4f}  {queen_rank:>12}  {ang:>22.1f}°")
    except Exception as e:
        print(f"  {name:13s}  error: {e}")

print("\n† rank among curated candidates: queen, king, princess, prince, empress, emperor, "
      "woman, man, girl, boy, lady, dog, computer, Paris")
Model              cos(analogy,queen)    cos(analogy,king)   queen rank†    angle°(Δking,Δqueen)
-----------------------------------------------------------------------------------------------
  word2vec                     0.7301               0.8449             2                    40.7°
  GPT-2                        0.7085               0.7758             2                    49.3°
  Qwen3-0.8B                   0.5782               0.6446             2                    56.2°
  nomic-embed                  0.8293               0.8972             2                    50.3°

† rank among curated candidates: queen, king, princess, prince, empress, emperor, woman, man, girl, boy, lady, dog, computer, Paris

Reading the table:

  • word2vec — skip-gram optimizes directly for analogy arithmetic. Queen typically ranks #1 or #2; the angle between (king−man) and (queen−woman) is the smallest.
  • GPT-2 — cos(analogy, queen) ≈ 0.28; king dominates at ≈ 0.78; angle ≈ 76°.
  • Qwen3-0.8B — cos(analogy, queen) improves to ≈ 0.58; king shrinks to ≈ 0.68. Better, but king still wins.
  • nomic-embed — trained explicitly for cosine similarity via contrastive learning. Expect queen at rank #1 with a clear margin.

The core lesson: training objective shapes geometry more than model size. A 2013 model (word2vec) outperforms a 2019 117M-parameter model (GPT-2) on analogy arithmetic — because skip-gram was optimized for exactly that geometry. Scale and better recipes improve things (GPT-2 → Qwen3), but the objective is the dominant factor.


11 · Category Similarity Heatmap

How similar are whole categories of words to each other on average?

Show the code
groups_list = list(vocab_groups.keys())
group_mats  = {g: np.array([gpt2_embed(w) for w in ws]) for g, ws in vocab_groups.items()}

sim_matrix = np.zeros((len(groups_list), len(groups_list)))
for i, g1 in enumerate(groups_list):
    for j, g2 in enumerate(groups_list):
        M1 = group_mats[g1] / (np.linalg.norm(group_mats[g1], axis=1, keepdims=True) + 1e-9)
        M2 = group_mats[g2] / (np.linalg.norm(group_mats[g2], axis=1, keepdims=True) + 1e-9)
        sim_matrix[i, j] = (M1 @ M2.T).mean()

fig, ax = plt.subplots(figsize=(8, 7))
im = ax.imshow(sim_matrix, cmap="YlOrRd", vmin=0.0, vmax=sim_matrix.max())
ax.set_xticks(range(len(groups_list))); ax.set_yticks(range(len(groups_list)))
ax.set_xticklabels(groups_list, rotation=35, ha="right", fontsize=11)
ax.set_yticklabels(groups_list, fontsize=11)
for i, j in itertools.product(range(len(groups_list)), repeat=2):
    ax.text(j, i, f"{sim_matrix[i,j]:.3f}", ha="center", va="center", fontsize=9,
            color="white" if sim_matrix[i,j] > 0.5 * sim_matrix.max() else "black")
plt.colorbar(im, ax=ax, label="Mean Cosine Similarity", shrink=0.8)
ax.set_title("Inter-Category Cosine Similarity (GPT-2)", fontweight="bold", pad=12)
plt.tight_layout();  plt.show()

Mean cosine similarity between all pairs of semantic categories (GPT-2 embeddings).

Discuss: Which pairs of categories are more similar than you’d expect? Can you explain why from the model’s training data?


Key Takeaways

Summary
Concept What we saw
Embedding matrix W_E A (vocab × d_model) lookup table — GPT-2 is 50,257 × 768; Qwen3 is larger
Not normalized Raw token embeddings have variable L2 norms — cosine similarity removes this
Background distribution Random pairs have cosine ≈ 0 in high dims; anything > 0.2 is already meaningful
Cosine vs. Euclidean Agree on direction; diverge when norms vary — prefer cosine for semantic comparisons
Semantic clustering Similar words cluster without explicit labels — emerges from next-token prediction
Analogy arithmetic Training objective matters more than model size: word2vec > GPT-2 for analogies
Plural direction cats−cat ≈ dogs−dog — grammar encodes as a consistent vector direction
Number words Similar numbers cluster; adjacency in meaning ≈ proximity in embedding space
Scale helps Qwen3-0.8B improves over GPT-2; modern training recipes push toward cleaner geometry
Dedicated models nomic-embed is explicitly trained for cosine similarity — it wins on retrieval tasks

Suggested extensions

  • Token frequency vs. norm — scatter token rank (by frequency) against L2 norm: common tokens often have larger norms in word2vec and GPT-2.
  • PCA scree plot per model — compare how many principal components capture 90% of variance: reveals the effective dimensionality of each embedding space.
  • Cross-model neighbor overlap — compute the Jaccard similarity of the top-20 neighbor sets across models for the same query word: which models agree on semantic proximity?
  • Hubness analysis — count how often each token appears in other tokens’ top-K neighbors: high-hubness tokens signal geometry collapse in high dimensions.

What Happens Next in the Model?

Input tokens
    │
    ▼
 Embedding Lookup (W_E)          ← explored in this notebook
    │
    ▼
 Positional Encoding  +  Residual Stream
    │
    ▼
 Self-Attention Layers  (Q, K, V)
    │
    ▼
 Feed-Forward Layers
    │
    ▼
 Un-embedding  (W_E^T)  →  Logits  →  Next-token probabilities

The embedding matrix is both the first and last layer in most transformer architectures (weight tying).

© HakyImLab and Listed Authors - CC BY 4.0 License