What changes after this microlearning
Why some tasks have hard capability thresholds
You can explain why certain task types require a minimum model size, and why prompt engineering cannot compensate for a missing capability.
How token costs actually accumulate
You can describe how iteration multiplies resource cost, and why vague prompts at large model tiers are the most expensive combination.
How to match a task to a model tier
You can apply a concrete decision question to a real task and justify your model choice with a technical argument, not just intuition.
When a larger model is actually cheaper
You understand the reliability gradient: why cost per correct result is a better measure than cost per query for complex automated tasks.
What model size actually means
Use the arrows to move through the four slides.
Emergent capabilities: the threshold effect
The key concept in this microlearning. Click each question to expand.
Some capabilities in large language models don't improve gradually with model size. They appear abruptly past a certain parameter threshold. Below that threshold, the model fails consistently on a given task type. Above it, performance jumps sharply. This is called emergence. Examples include multi-step arithmetic reasoning, logical chain-of-thought, and nuanced contextual interpretation of ambiguous input. The capability doesn't grow — it arrives.
Prompt engineering adjusts how the model processes a problem. It cannot create processing capacity that doesn't exist in the model's weights. If the capability hasn't emerged yet, the model lacks the internal representation needed to perform the task reliably. You can refine a marginal result. You cannot generate a missing capability. This is why switching to a more detailed prompt on a too-small model produces inconsistent improvement rather than a clear fix.
Tasks that require multi-step reasoning where each step depends on the previous, tasks that require simultaneously weighing contradictory information, tasks that require synthesis across unrelated knowledge domains, and tasks that require interpreting subtle shifts in register, intent, or interpersonal context. Simple extraction, bounded reformatting, and pattern-based classification typically don't trigger thresholds. The clearest signal: if the task requires the model to hold several competing considerations at once, a threshold likely applies.
Not for all tasks. On well-defined tasks like extraction, reformatting, or classification, smaller models perform comparably to larger ones. No emergent capability is required, so the additional parameters add cost without adding quality. The goal is not maximum capability. It's minimum sufficient capability: the smallest model that can reliably complete this task. Oversizing is a resource cost with no performance benefit.
Formally documented. Wei et al. (2022) analyzed over 100 model capabilities across training scales and found that many showed a sharp phase-transition: near-zero performance at smaller scales, followed by sudden improvement past a threshold. The phenomenon was unexpected and remains partially explained. What's clear for practitioners: you cannot interpolate from small model behavior to predict large model behavior on complex tasks. Below the threshold, improvement is marginal. Above it, the capability is there.
Watch and anchor
Before applying the concepts, watch this video to anchor the scaling behavior visually. The key moment: notice how capability improvements are not smooth across scale.
Emergent abilities in large language models
The jump you're watching is not a rendering artifact. It reflects the underlying threshold effect: capability that wasn't there, and then suddenly is. This is what makes model selection a binary decision for some tasks, not a quality preference.
Two decisions, four outcomes
Your resource cost per query is determined by two independent decisions: how precise your prompt is, and which model tier you chose. The map below shows the four possible combinations. Click each zone to see what drives the cost outcome.
Vague prompt, oversized model. Every iteration runs at the highest token rate. What should cost 280 tokens accumulates to 1,900 across five rounds: the 6.8× scenario. This is the most expensive combination, and the most common for practitioners who haven't thought about model selection.
Precise prompt, wrong-sized model. The task completes in one round, but you paid for reasoning depth the task didn't require. For extraction, classification, or fixed-format output, a smaller model performs equally well. The extra parameters add cost, not quality.
Right-sized model, imprecise prompting. Cheaper per token than Q1, but iteration still multiplies cost. The fix is not a different model — the tier is already appropriate. The fix is a more precise prompt. Iteration cost applies regardless of model tier.
Precise prompt, right-sized model. One round at the correct tier. The optimization target: smallest model that reliably completes this task, with a prompt that requires no back-and-forth. Every query in this zone delivers the same result as Q1 at a fraction of the cost.
The 6.8× is not a single factor. It is two independent decisions that compound.
Factor 1: Iteration
A vague prompt requires multiple rounds to produce the right result. Five rounds instead of one multiplies token cost by 5.
Factor 2: Model tier
An oversized model adds a ~1.36× cost multiplier per token on top. Both factors compound: 5 × 1.36 ≈ 6.8.
Drag the handle to reveal the Q4 comparison.
Task to model: the decision
Apply the decision question to each task. What model tier does it require, and why? Flip each card to see the answer.
Task A
Summarize a meeting transcript in five bullet points.
Fixed format, bounded task. No competing angles to weigh. Reliable below the threshold.
Task B
Identify blind spots in a growth plan.
Competing perspectives, cross-domain reasoning, inference about what's missing. Emergent threshold.
Task C
Classify 500 support tickets by topic.
Pattern classification. Add few-shot examples for accuracy. Scales cheaply.
Task D
Draft a response to a difficult client escalation.
Tone, accuracy, and stakeholder impact simultaneously. Edge cases have real consequences.
Task E
Extract all dates and deadlines from a contract.
Extraction, no reasoning chain. Predictable output. A larger model adds cost, not quality.
Task F
Explain to a new employee why onboarding is structured the way it is.
Context synthesis, recipient adaptation, unspoken assumptions. Not retrieval.
Check your understanding
1. A team automates a weekly reporting task using a small model. It gets the correct result 87% of the time. In 13% of cases, a human must review and correct the output. When would it make sense to switch to a larger model?
One answer correct
2. Which of these task properties suggest a capability threshold applies and a larger model is likely necessary?
Multiple answers correct
3. A colleague says: "I just use the largest available model for everything. That way I'm always safe." What is the technical problem with this approach?
One answer correct
What stays with you
Two variables determine your token cost per query: prompt precision and model tier. Both are your decision. Most practitioners either always oversize or never question their default. Here's what changes:
- Model size determines capability, not quality. Parameters set what a model can and cannot do. That is a technical argument, not a preference.
- Capability thresholds are binary. Below a certain model size, some capabilities do not exist. Prompt engineering cannot substitute for missing capacity.
- Token cost compounds. Vague prompt × large tier × iterations = the 6.8× scenario. Each factor is independent, and each is yours to control.
- Minimum sufficient capability is the optimization target. If the task does not require competing considerations, the smaller model is the correct choice.
You made it.
Microlearning complete.
You now understand why model selection is a technical decision, not a preference. The capability threshold concept is what most practitioners miss — and it changes how you evaluate every AI tool you use.
How was this Microlearning?