The Scribe Reads the Room — Part 2
[This is Part 2. Part 1 — The Forgetful Scribe covers autoencoders.]
Mathityahu’s aunt Rivka was a very good listener. At any family gathering, she could track fourteen conversations simultaneously — who said what to whom, who owed whom an apology, which cousin still hadn’t called his mother. She did not need to wait until the end of dinner to understand the beginning of dinner. Everything informed everything else, in real time.
“This,” said Mathityahu, “is what we need the machine to do.”
The autoencoder, as we saw in Part 1, forgets the order. It compresses everything into a single small representation and reconstructs from that. For images, this works. For language — for anything where sequence matters — it doesn’t. A sentence is not a bag of words. It is a structure.
The field tried to solve this with sequential models. Read left to right, carry the memory forward. But the memory fades. By the time you reach the end of a long sentence, the beginning is a blur — like trying to remember the first course at a Pesach seder after you’ve already reached the afikomen.
The solution was something else entirely. Instead of reading left to right and forgetting, look at everything at once — and learn what to look at.
This is self-attention.
The Bottleneck Problem
Before we solve it, let us be precise about what is broken.
Imagine encoding the sentence “The matzah, which the bubbe made from scratch and which took her three hours and two arguments with Uncle Shimon, was excellent” into a single fixed-size vector.
By the time you encode “was excellent”, the model must somehow still remember “matzah” — the subject — despite everything that came in between. In practice, it doesn’t. It remembers the most recent things well, and the earlier things poorly.
This is the bottleneck. Not size this time — time. The information has to travel through a long chain, and it degrades.
The Fix: Look at Everything at Once
Self-attention abandons the sequential constraint entirely. Every position in the sequence can attend directly to every other position. No chain. No degradation.
For each token in the sequence, we ask three questions:
- What am I looking for? → the Query
- What do I have to offer? → the Key
- What information do I actually carry? → the Value
Each token produces a Query, a Key, and a Value — three vectors, each a learned projection of the token’s embedding. Then:
- Compute how much each Query matches each Key (dot product, scaled by √d)
- Turn those scores into weights with softmax
- Take a weighted sum of all Values
The result: each token gets a new representation that is a blend of all other tokens, weighted by relevance.
The key insight: the weights are not fixed. They are computed fresh for every input. A word that is relevant to the current token gets a high weight. A word that is irrelevant gets a low weight. The network learns what relevance means.
The Code
In PyTorch, self-attention is a clean computation:
import torch
import torch.nn as nn
import torch.nn.functional as F
class SelfAttention(nn.Module):
def __init__(self, d_model):
super().__init__()
self.d_model = d_model
self.W_q = nn.Linear(d_model, d_model, bias=False)
self.W_k = nn.Linear(d_model, d_model, bias=False)
self.W_v = nn.Linear(d_model, d_model, bias=False)
def forward(self, x):
# x: (batch, seq_len, d_model)
Q = self.W_q(x)
K = self.W_k(x)
V = self.W_v(x)
# Scaled dot-product attention
scores = torch.bmm(Q, K.transpose(1, 2)) / (self.d_model ** 0.5)
weights = F.softmax(scores, dim=-1) # (batch, seq_len, seq_len)
output = torch.bmm(weights, V) # (batch, seq_len, d_model)
return output, weights
Let us run it on a toy example — five tokens, embedding size 16:
batch_size, seq_len, d_model = 1, 5, 16
x = torch.randn(batch_size, seq_len, d_model)
attn = SelfAttention(d_model)
output, weights = attn(x)
print(f"Input shape: {x.shape}") # (1, 5, 16)
print(f"Output shape: {output.shape}") # (1, 5, 16)
print(f"Weights shape: {weights.shape}") # (1, 5, 5) — each token attends to all tokens
Each token in the output is a blend of all tokens in the input. The sequence length is preserved. The information is not bottlenecked into a single vector.
What the Weights Look Like
Below is an illustrative attention map — what a trained model might learn for the sentence “The bubbe made matzah ball soup again”:
Read it row by row. Each row is one query token — the word that is “looking.” Each column is a key token — the word being “looked at.” Darker means stronger attention.
Notice: “ball” attends strongly to “matzah” — because “ball” alone means nothing; its meaning depends on what preceded it. “made” attends to “bubbe” — the verb looks for its subject.
This is not programmed. It is learned.
The Remaining Problem
Rivka could track fourteen conversations, but she knew who said what when. The words arrived in order. She knew which story came first.
Self-attention, as written above, has no such knowledge. If you shuffle the tokens — “soup again bubbe matzah made ball The” — the attention computation gives exactly the same result. Position has no meaning.
For text, this is catastrophic. “The dog bit the man” must be different from “The man bit the dog.” If the model cannot tell position 1 from position 5, it cannot tell subject from object.
We need a way to tell the model: “this token is first, that one is sixth.” We need to bake position into the representation itself.
And once we solve that — once every token knows both what it is and where it sits — we have everything we need to build something much larger.
“Call everyone in,” said Mathityahu. “Multi-head. All of them.”
[Continue to Part 3 — The Whole Room Is Listening]
All code in this post runs on CPU with no training required. Full code: Jewpyter notebook repository.