0% read

Microlearning

How transformers think, and what just changed

From "Attention Is All You Need" to Subquadratic Sparse Attention: the constraint that shaped AI for nine years, and what just changed.

6 sections ~40 min Intermediate

What changes after this microlearning

Why O(n²) was the bottleneck

You can explain why attention complexity kept context windows small for years, and what that cost in practice.

How attention actually works

You can describe how transformers compute attention between tokens using queries, keys, and values.

What SSA claims, and why it matters

You can assess what SSA claims to do differently from all previous approaches, and why all four properties together matter.

How to evaluate the claim critically

You know what evidence to look for when judging whether the claimed breakthrough holds at scale.

The constraint that shaped everything

Use the arrows to move through the four slides.

Slide 1 of 4

"Attention Is All You Need"

In 2017, a team at Google published a paper with that title. It introduced the transformer architecture. Almost every large language model built since then, including the ones behind ChatGPT, Claude, and Gemini, is built on that foundation.

Slide 2 of 4

One constraint, enormous consequences

The transformer came with a built-in cost: attention scales as O(n²). Double the context length, quadruple the compute. Triple it, and compute grows ninefold. Every context window limit you have ever hit is a downstream consequence of this single fact.

Slide 3 of 4

Nine years of partial answers

Researchers have tried sliding windows, global tokens, recurrent state spaces, sparse hybrids, and linear approximations. Each approach reduced the cost but traded something: exactness, or content-awareness, or true end-to-end subquadratic scaling. None solved it without giving something up.

Slide 4 of 4

May 5, 2026

A startup called Subquadratic emerged from stealth. Their model, SubQ, uses an attention mechanism called SSA, Subquadratic Sparse Attention. The claim: content-dependent, exact, capable of a 12-million-token context, and genuinely subquadratic end to end. If the claim holds, it is the most important architectural advance since the 2017 paper itself.

Key concepts unpacked

Each term below appears repeatedly in discussions of transformer models. Click to expand.

In a transformer, "attention" is the mechanism by which each token in a sequence can look at every other token to decide what's relevant. For each token, the model computes a query (what am I looking for?), a key (what do I represent?), and a value (what should I contribute?). The attention score between two tokens is the dot product of query and key. Doing this for all pairs is where O(n²) comes from: n tokens × n tokens = n² comparisons.

O(n²) is a description of how compute grows with input size. If processing 1,000 tokens costs one unit of compute, then 2,000 tokens costs four units, not two. At 128,000 tokens (a common current limit), the attention computation alone requires roughly 16 billion pairwise comparisons. That is why longer contexts are expensive, slow, and until recently, rare.

The context window is the maximum amount of text a model can "see" at once. Anything outside it is invisible. This is why models lose the thread of long documents, why RAG systems have to chunk codebases into fragments, and why a model cannot reason over an entire book in a single pass. The O(n²) constraint is the main reason context windows stayed small for so long.

Subquadratic means growing slower than n². If a method is O(n log n) or O(n · √n), it is subquadratic: doubling context length much less than quadruples compute. SSA, Subquadratic Sparse Attention, is Subquadratic's claimed mechanism. It selects which token pairs actually need to attend to each other based on content (not a fixed pattern), computes those interactions exactly rather than approximately, and achieves this selection in subquadratic time. If all three properties hold simultaneously, that has not been demonstrated before at scale.

Sparse attention methods (like Longformer or BigBird) use fixed patterns: local windows, global tokens, random connections. They are fast but ignore content. Linear attention approximates the full attention matrix and loses exactness. Sliding window attention is local only. Recurrent models like Mamba are not transformers at all and require retraining from scratch. The common theme: every prior approach traded one desirable property to gain another. SSA claims to trade none of them.

The claim in detail

Background: How attention works

What SSA claims: four properties at once

Claim 1

How does SSA decide which token pairs to compute?

Content-dependent

Which pairs are computed depends on the actual tokens, not a fixed template. The model attends to what matters.

Claim 2

Does SSA approximate attention values or compute them exactly?

Exact

The computed attention values are identical to full attention for the selected pairs. No approximation error.

Claim 3

What context length has SSA been demonstrated at?

12M-token context

Demonstrated at 12 million tokens. For comparison, GPT-4 launched with 8,000 tokens. Current models reach 128k to 1M.

Claim 4

What makes SSA different from earlier sparse attention methods?

End-to-end subquadratic

The selection mechanism itself is subquadratic. Previous sparse methods solved the compute problem but not the selection problem.

One open question

"If the claim holds, it is the most important architectural advance since the 2017 paper itself."

The key word is if. Independent benchmarks and peer review will determine whether all four properties hold simultaneously at scale. That work is ongoing.

Inside the attention mechanism

Click the + markers to explore each part of the diagram.

Token 1 Token 2 Token 3 Token 4 Token 5 n tokens Full attention n × n = O(n²) vs. SSA: sparse attention content-selected pairs only computed (full attention) computed (SSA, content-selected) skipped (subquadratic saving)
+
+
+
Input tokens

Each word (or sub-word) in a sentence becomes a token. The attention mechanism must consider every possible pair. With n tokens, that is n × n pairs, giving O(n²) complexity.

Full attention: O(n²)

The darker a cell, the stronger the attention score between that pair of tokens. Standard transformers compute all n² cells. At 128k tokens, that is over 16 billion values, just for attention.

SSA: content-selected pairs

SSA computes only the pairs the content demands, shown in magenta. The grey cells are skipped entirely. The key claim: the selection itself also runs in subquadratic time, and the computed values are exact, not approximate.

Check your understanding

1. If you double the context length in a standard transformer, the attention compute increases by a factor of…

One answer correct

2. Which of the following are consequences of the O(n²) attention constraint?

Multiple answers correct

3. What makes SSA different from previous sparse attention approaches like Longformer or BigBird?

One answer correct

What to take with you

The O(n²) constraint is not a detail. It is the reason AI systems still cannot read a full book, track a multi-day conversation, or reason over a large codebase in a single pass. If SSA holds up under scrutiny, these limits start to move.

Three things to watch for in the coming months:

  • Independent benchmarks comparing SubQ to Llama, Gemini, and GPT on long-context tasks
  • Peer-reviewed analysis of the SSA selection algorithm's actual complexity
  • Whether existing models adopt the mechanism via fine-tuning, or require retraining from scratch

You made it.

Microlearning complete.

You now understand why the O(n²) constraint shaped AI for nine years, and what SSA claims to change. That context is rare. Most people miss it entirely.