Saving tokens in GitHub Copilot

Mental model

Cost does not come only from what you type into chat. It adds up across many layers:

Always-on instructions — AGENTS.md, custom instructions, hooks
Selected and open files in context
Chat history and summaries between turns
MCP tool definitions and their JSON schemas — even the ones you do not use
Tool call results replayed as input in the next step
Model output including reasoning/thinking tokens (output tokens are the most expensive)
Retries, subagents, loops in agent mode

Three token types — why they matter:

Input — everything you send to the model for the first time (prompt, context, tool results)
Cached input — a repeated prefix the model has already seen in a previous turn of the same session. ~10× cheaper than fresh input.
Output — what the model generates, often including internal reasoning/thinking tokens. ~6× more expensive than fresh input, ~60× more expensive than cached input.

Example from current real pricing (USD per 1M tokens):

Model	Short cache	Short input	Short output	Long cache	Long input	Long output
gpt-5.5	$0.50	$5.00	$30.00	$1.00	$10.00	$45.00
gpt-5.4	$0.25	$2.50	$15.00	$0.50	$5.00	$22.50
gpt-5.4-mini	$0.075	$0.75	$4.50	—	—	—
gpt-5.4-nano	$0.02	$0.20	$1.25	—	—	—

Mini and nano do not have long context.

Simplified rules:

Cache : Input : Output ≈ 1 : 10 : 60
Each tier down is ~3–4× cheaper
Long context makes input/cache ~2× more expensive and output ~1.5× more expensive
Cache is why /compact is not free

Easy wins for everyone

If you are not sure why you need a premium reasoning model, start with Auto.

It routes by task complexity, availability and system health
You will not keep an expensive model pinned for routine work
10% discount on the multiplier for paid plans
Manual override for architecture or hard debugging remains available

In practice this means two things: Auto picks a suitable model for the session and paid plans get a 10% multiplier discount. Microsoft published the HyDRA paper (arXiv): on SWE-Bench Verified it can match strong Sonnet 4.6 quality while saving 54.1% of cost; in peak-quality mode it even beats Sonnet and still saves 12.9%.

Expensive

Understand this repository and fix the login problem.

Cheap

Focus on src\auth\login.ts and tests\auth\login.test.ts. Bug: validateEmail rejects user+tag@example.com. Add a test and fix only this behavior.

Rule: concrete file paths are cheaper than exploring the whole repository.

Bonus: batch related tasks into one prompt.

In src\auth\login.ts: 1. validateEmail — allow user+tag@example.com 2. add a test in tests\auth\login.test.ts 3. add a docstring to the function 4. record the change in CHANGELOG.md

Instead of "do you understand the whole thing?", ask:

Find the smallest set of files I need to understand the event-driven flow. List only paths and one sentence why. Do not summarize the whole repo.

Only then give the task — knowing what is relevant.

Expensive

Explain in detail everything you changed.

Cheap

List: changed files, why, tests. Max 5 bullets.

"Caveman" output style for reports:

Done. List: files, why, validation, risks. No intro. ≤5 bullets.

Set reasoning effort as deliberately as response length.

Setting	When it makes sense	Token impact
Low	quick questions, small edits, format conversions	little hidden output
Medium	normal agentic work	usually best cost/performance
High	architecture, hard debugging, unclear multi-step problem	more reasoning tokens, higher latency

The same information in three versions. It shows the "sweet spot" between readability and compactness.

Hi sweetheart, mom will come home around six in the evening. Please do your math and Czech homework first, okay? Once you are done, you can play on the tablet for an hour. There is tomato soup in the fridge; heat it in the microwave for three minutes on high. Please do not forget to walk Bertik around five o'clock and then put kibble in his bowl. I love you very much, mom 💕

Metric	Value
Characters	388
Tokens (GPT-5)	140
Readability	high
For whom	people with time

Home at 18:00. Homework: math + Czech → then tablet 1h. Soup in fridge → microwave 3 min. Dog out 17:00 + kibble. Mom 💕

Metric	Value
Characters	110 −72 %
Tokens (GPT-5)	48 −66 %
Readability	still good
For whom	human who understands

Metric	Value
Characters	57 −85 %
Tokens (GPT-5)	29 −79 %
Readability	poor
For whom	agent → agent

For terminal output, a screenshot is an antipattern. Instead of 5,000 lines of CI log, send only the relevant excerpt:

Command: npm test Exit code: 1 Relevant error: TypeError: Cannot read property 'id' of undefined at UserService.findById (src/services/user.ts:42) Last 30 lines: ...

For visual questions, it is more efficient to use Playwright MCP or browser canvas.

The tokenizer is trained mostly on English text, but this is not dogma:

Structured format erases most of the difference
Quality comes before saving
Cost of a mistake > cost of tokens

Create a POST endpoint /api/users that validates required fields name and email, returns 400 on error and 201 with the created user on success.

Language	Characters	Tokens	vs EN
English	148	31	1.00×
Czech	148	50	1.61×

POST /api/users
Validate: name req, email req+valid
400 errors
201 user

Language	Characters	Tokens	vs EN
English	71	20	1.00×
Czech	70	23	1.15×

Please, could you create a new HTTP endpoint of type POST at path /api/users that accepts JSON with fields name and email, validates that both values are present and the email is in the correct format, and on success returns 201 Created with the created user object, while on validation error it returns 400 Bad Request with error details?

Version	Tokens	vs structured
Verbose CZ	~95	+313 %
Normal CZ	50	+117 %
Structured CZ	23	baseline

One more layer: models do not share the same vocabulary. Two models can have the same price per million tokens without having the same price for the same text. Anthropic says Claude Sonnet 5's new tokenizer produces approximately 30% more tokens for the same text than Claude Sonnet 4.6; per-token pricing is unchanged, but an equivalent request can cost more (Anthropic docs). Models from different vendors can differ in the same way.

Advanced techniques

Always-on instructions are a recurring tax — you pay them on every turn.

Context type	Where it belongs	Rule
Always-on (small)	`AGENTS.md`	only facts the agent cannot infer
Path-specific	`.github\instructions\*.instructions.md`	loaded only for relevant files
Workflow-specific	prompt files	invoked on demand
Detailed capability	`.github\skills\`	progressive reveal — only when the topic appears
Live data	MCP server	fetch on demand

Let the agent write context for the future.

Write into .github\skills\auth-flow\SKILL.md a concise (≤60 lines) description of the auth flow as we have just understood it. Focus on: entry points, key files, pitfalls, what to do when changing it. No prose, only lists and links.

MCP has three hidden cost layers:

Tool definitions and JSON schemas loaded into context
Tool call arguments as output tokens
Tool results replayed as input in the next step

1. search or list candidates 2. choose one 3. fetch only the detail you need for the decision 4. summarize the result before continuing

The good news: Copilot has started doing part of this work for you. The VS Code team describes tool search, where the model initially receives only lightweight tool metadata and full JSON schemas are loaded on demand (VS Code blog). For OpenAI GPT-5.4/5.5, the experiment reduced median total tokens per turn by 8.61–9.81% and median session tokens by 8.97–10.92%. For Anthropic models, deferring tool definitions reduced total tokens for the median user by roughly 18%.

The principle still holds: tool search saves tool definitions, not result volume. MCP should still return small candidates, filter, and send detail only on demand.

If an algorithm exists that solves the task exactly, do not force the model to do it across several turns.

Typical candidates:

JSON → XML / CSV conversion
Token counting, log slicing, schema validation
Dependency graph extraction
Sort, filter, dedupe large data
ID generators, parameterized templates

The economics are always the same: old cache is cheapest, fresh input ~10×, output ~60×.

Situation	Command	What happens
Short off-topic question	`/ask` (`/btw`)	cache stays, answer does not grow into history
Bad last turn	`/undo` (`/rewind`)	removes changes and the context of the last turn
Sidequest on the same base	`/fork`	branches share the cached prefix
Return after a break	`/resume`	often cache hit, later input
New topic	`/new`	old session saved, new one without cache
End permanently	`/clear`	session discarded, file changes are not
Context grew	`/compact`	expensive operation, output bolus + cache invalidation
Export session	`/share`	input for analysis and improvement

/chronicle tips, /chronicle cost-tips and /chronicle improve serve as regular self-reflection.

Spend governance is another layer. Copilot CLI can set a soft session limit with /limits set max-ai-credits NUMBER or --max-ai-credits NUMBER in non-interactive runs (docs). Admins can govern user-level budgets, cost center per-user limits, cost center budgets, organization/enterprise budgets, and hard stops when budgets are exhausted (docs).

Parallel agents can save wall-clock time, but multiply input tokens if they read the same files.

Use a subagent when:

The work is truly independent
Context can be sharded by service / file area
A cheaper model is enough
The result can come back as a short summary

Do not use when:

All agents need the same context
The task is sequential
Coordination overhead dominates

A subagent does not have to mean "another big model in another window". The direction GitHub/VS Code describes is specialized subagents for narrow tasks: workspace search, command execution and result summarization. The goal is to move noisy work out of the main agent and run it on the smallest model that can handle it (VS Code blog).

Token efficiency is not a one-time audit — it is an engineering discipline with a feedback loop.

Measure:

input, output, cached tokens
number of turns
tool calls
latency
retry rate
quality of result

on: pull_request
  paths: ['.github/skills/**', 'AGENTS.md', '.github/prompts/**']

steps:
  - run: copilot-token-lab run --scenarios skills-regression --iterations 3
  - run: copilot-token-lab compare --baseline main --head HEAD
  - fail-if: weighted_units_delta > +10% AND quality_score < baseline
  - comment-pr: report.md

Summary

Easy

Auto model as the default
Do not leave reasoning effort unnecessarily high
Name exact files and done criteria
Find files as the first step
Limit output
Log excerpts, not dumps
Structured format erases the language difference

Advanced

Small AGENTS.md + skills + path instructions
MCP as search → select → fetch
Account for different tokenizers across models
Deterministic tooling
/ask, /fork, /resume consciously, /compact carefully
Subagents only for independent work; discovery/search subagents are a promising direction
Measure in CI and ask the agent for self-improvement

Tokens are not a cost to throttle. They are an investment. Optimize return, not consumption.