What changes after this microlearning
Why O(n²) was the bottleneck
You can explain why attention complexity kept context windows small for years, and what that cost in practice.
How attention actually works
You can describe how transformers compute attention between tokens using queries, keys, and values.
What SSA claims, and why it matters
You can assess what SSA claims to do differently from all previous approaches, and why all four properties together matter.
How to evaluate the claim critically
You know what evidence to look for when judging whether the claimed breakthrough holds at scale.
The constraint that shaped everything
Use the arrows to move through the four slides.
Key concepts unpacked
Each term below appears repeatedly in discussions of transformer models. Click to expand.
In a transformer, "attention" is the mechanism by which each token in a sequence can look at every other token to decide what's relevant. For each token, the model computes a query (what am I looking for?), a key (what do I represent?), and a value (what should I contribute?). The attention score between two tokens is the dot product of query and key. Doing this for all pairs is where O(n²) comes from: n tokens × n tokens = n² comparisons.
O(n²) is a description of how compute grows with input size. If processing 1,000 tokens costs one unit of compute, then 2,000 tokens costs four units, not two. At 128,000 tokens (a common current limit), the attention computation alone requires roughly 16 billion pairwise comparisons. That is why longer contexts are expensive, slow, and until recently, rare.
The context window is the maximum amount of text a model can "see" at once. Anything outside it is invisible. This is why models lose the thread of long documents, why RAG systems have to chunk codebases into fragments, and why a model cannot reason over an entire book in a single pass. The O(n²) constraint is the main reason context windows stayed small for so long.
Subquadratic means growing slower than n². If a method is O(n log n) or O(n · √n), it is subquadratic: doubling context length much less than quadruples compute. SSA, Subquadratic Sparse Attention, is Subquadratic's claimed mechanism. It selects which token pairs actually need to attend to each other based on content (not a fixed pattern), computes those interactions exactly rather than approximately, and achieves this selection in subquadratic time. If all three properties hold simultaneously, that has not been demonstrated before at scale.
Sparse attention methods (like Longformer or BigBird) use fixed patterns: local windows, global tokens, random connections. They are fast but ignore content. Linear attention approximates the full attention matrix and loses exactness. Sliding window attention is local only. Recurrent models like Mamba are not transformers at all and require retraining from scratch. The common theme: every prior approach traded one desirable property to gain another. SSA claims to trade none of them.
The claim in detail
Background: How attention works
What SSA claims: four properties at once
Claim 1
How does SSA decide which token pairs to compute?
Which pairs are computed depends on the actual tokens, not a fixed template. The model attends to what matters.
Claim 2
Does SSA approximate attention values or compute them exactly?
The computed attention values are identical to full attention for the selected pairs. No approximation error.
Claim 3
What context length has SSA been demonstrated at?
Demonstrated at 12 million tokens. For comparison, GPT-4 launched with 8,000 tokens. Current models reach 128k to 1M.
Claim 4
What makes SSA different from earlier sparse attention methods?
The selection mechanism itself is subquadratic. Previous sparse methods solved the compute problem but not the selection problem.
One open question
"If the claim holds, it is the most important architectural advance since the 2017 paper itself."
The key word is if. Independent benchmarks and peer review will determine whether all four properties hold simultaneously at scale. That work is ongoing.
Inside the attention mechanism
Click the + markers to explore each part of the diagram.
Each word (or sub-word) in a sentence becomes a token. The attention mechanism must consider every possible pair. With n tokens, that is n × n pairs, giving O(n²) complexity.
The darker a cell, the stronger the attention score between that pair of tokens. Standard transformers compute all n² cells. At 128k tokens, that is over 16 billion values, just for attention.
SSA computes only the pairs the content demands, shown in magenta. The grey cells are skipped entirely. The key claim: the selection itself also runs in subquadratic time, and the computed values are exact, not approximate.
Check your understanding
1. If you double the context length in a standard transformer, the attention compute increases by a factor of…
One answer correct
2. Which of the following are consequences of the O(n²) attention constraint?
Multiple answers correct
3. What makes SSA different from previous sparse attention approaches like Longformer or BigBird?
One answer correct
What to take with you
The O(n²) constraint is not a detail. It is the reason AI systems still cannot read a full book, track a multi-day conversation, or reason over a large codebase in a single pass. If SSA holds up under scrutiny, these limits start to move.
Three things to watch for in the coming months:
- Independent benchmarks comparing SubQ to Llama, Gemini, and GPT on long-context tasks
- Peer-reviewed analysis of the SSA selection algorithm's actual complexity
- Whether existing models adopt the mechanism via fine-tuning, or require retraining from scratch
You made it.
Microlearning complete.
You now understand why the O(n²) constraint shaped AI for nine years, and what SSA claims to change. That context is rare. Most people miss it entirely.
How was this Microlearning?