The right model for the right task

What changes after this microlearning

Why some tasks have hard capability thresholds

You can explain why certain task types require a minimum model size, and why prompt engineering cannot compensate for a missing capability.

How token costs actually accumulate

You can describe how iteration multiplies resource cost, and why vague prompts at large model tiers are the most expensive combination.

How to match a task to a model tier

You can apply a concrete decision question to a real task and justify your model choice with a technical argument, not just intuition.

When a larger model is actually cheaper

You understand the reliability gradient: why cost per correct result is a better measure than cost per query for complex automated tasks.

What model size actually means

Use the arrows to move through the four slides.

Slide 1 of 4

Not all intelligence is equal

When you ask a model to format a spreadsheet, you don't need the same capability as when you ask it to identify blind spots in a business strategy. The model doesn't know the difference. You need to. The question isn't "which model is best." It's "which model is sufficient for this task."

Slide 2 of 4

Parameters are compressed knowledge

A model's size is measured in parameters — the learned weights that encode relationships between concepts during training. More parameters means finer distinctions, more abstract relationships, and more complex patterns recognized. It also means more compute per query, higher cost, and more energy. The relationship between parameter count and capability is not linear. That non-linearity is what makes model selection a technical decision.

Slide 3 of 4

The cost of every query

Every time you send a prompt, the model runs inference: it processes your input and generates output one token at a time. A larger model processes more parameters per token. The bill is: input tokens plus output tokens plus context, multiplied by the model tier rate. Iteration multiplies everything. A vague prompt that requires five rounds costs five times as much as a precise one — and if you chose an oversized model for a simple task, you pay both penalties at once.

Slide 4 of 4

One decision question

Before selecting a model tier, ask: does this task require holding multiple competing considerations simultaneously? Evaluating contradictory evidence, synthesizing across unrelated domains, navigating nuanced human context — these are the markers. If yes, a capability threshold likely applies and a larger model is necessary. If not, a smaller, faster, cheaper model will deliver the same result. The following sections give you the technical reasoning behind that question.

Emergent capabilities: the threshold effect

The key concept in this microlearning. Click each question to expand.

Some capabilities in large language models don't improve gradually with model size. They appear abruptly past a certain parameter threshold. Below that threshold, the model fails consistently on a given task type. Above it, performance jumps sharply. This is called emergence. Examples include multi-step arithmetic reasoning, logical chain-of-thought, and nuanced contextual interpretation of ambiguous input. The capability doesn't grow — it arrives.

Prompt engineering adjusts how the model processes a problem. It cannot create processing capacity that doesn't exist in the model's weights. If the capability hasn't emerged yet, the model lacks the internal representation needed to perform the task reliably. You can refine a marginal result. You cannot generate a missing capability. This is why switching to a more detailed prompt on a too-small model produces inconsistent improvement rather than a clear fix.

Tasks that require multi-step reasoning where each step depends on the previous, tasks that require simultaneously weighing contradictory information, tasks that require synthesis across unrelated knowledge domains, and tasks that require interpreting subtle shifts in register, intent, or interpersonal context. Simple extraction, bounded reformatting, and pattern-based classification typically don't trigger thresholds. The clearest signal: if the task requires the model to hold several competing considerations at once, a threshold likely applies.

Not for all tasks. On well-defined tasks like extraction, reformatting, or classification, smaller models perform comparably to larger ones. No emergent capability is required, so the additional parameters add cost without adding quality. The goal is not maximum capability. It's minimum sufficient capability: the smallest model that can reliably complete this task. Oversizing is a resource cost with no performance benefit.

Formally documented. Wei et al. (2022) analyzed over 100 model capabilities across training scales and found that many showed a sharp phase-transition: near-zero performance at smaller scales, followed by sudden improvement past a threshold. The phenomenon was unexpected and remains partially explained. What's clear for practitioners: you cannot interpolate from small model behavior to predict large model behavior on complex tasks. Below the threshold, improvement is marginal. Above it, the capability is there.

Watch and anchor

Before applying the concepts, watch this video to anchor the scaling behavior visually. The key moment: notice how capability improvements are not smooth across scale.

Emergent abilities in large language models

The jump you're watching is not a rendering artifact. It reflects the underlying threshold effect: capability that wasn't there, and then suddenly is. This is what makes model selection a binary decision for some tasks, not a quality preference.

Two decisions, four outcomes

Your resource cost per query is determined by two independent decisions: how precise your prompt is, and which model tier you chose. The map below shows the four possible combinations. Click each zone to see what drives the cost outcome.

The 6.8× is not a single factor. It is two independent decisions that compound.

Factor 1: Iteration

A vague prompt requires multiple rounds to produce the right result. Five rounds instead of one multiplies token cost by 5.

Factor 2: Model tier

An oversized model adds a ~1.36× cost multiplier per token on top. Both factors compound: 5 × 1.36 ≈ 6.8.

Drag the handle to reveal the Q4 comparison.

Q1 · Maximum cost Q4 · Optimal

Task to model: the decision

Apply the decision question to each task. What model tier does it require, and why? Flip each card to see the answer.

Task A

Summarize a meeting transcript in five bullet points.

Small model

Fixed format, bounded task. No competing angles to weigh. Reliable below the threshold.

Task B

Identify blind spots in a growth plan.

Large model

Competing perspectives, cross-domain reasoning, inference about what's missing. Emergent threshold.

Task C

Classify 500 support tickets by topic.

Small model

Pattern classification. Add few-shot examples for accuracy. Scales cheaply.

Task D

Draft a response to a difficult client escalation.

Large model

Tone, accuracy, and stakeholder impact simultaneously. Edge cases have real consequences.

Task E

Extract all dates and deadlines from a contract.

Small model

Extraction, no reasoning chain. Predictable output. A larger model adds cost, not quality.

Task F

Explain to a new employee why onboarding is structured the way it is.

Large model

Context synthesis, recipient adaptation, unspoken assumptions. Not retrieval.

What stays with you

Two variables determine your token cost per query: prompt precision and model tier. Both are your decision. Most practitioners either always oversize or never question their default. Here's what changes:

Model size determines capability, not quality. Parameters set what a model can and cannot do. That is a technical argument, not a preference.
Capability thresholds are binary. Below a certain model size, some capabilities do not exist. Prompt engineering cannot substitute for missing capacity.
Token cost compounds. Vague prompt × large tier × iterations = the 6.8× scenario. Each factor is independent, and each is yours to control.
Minimum sufficient capability is the optimization target. If the task does not require competing considerations, the smaller model is the correct choice.

You made it.

Microlearning complete.

You now understand why model selection is a technical decision, not a preference. The capability threshold concept is what most practitioners miss — and it changes how you evaluate every AI tool you use.

How was this Microlearning?