Why should you care about tokens? If you want to improve your AI workflow, reduce costs, and get better results from each prompt, tokens are the core metric to understand.
GitHub Copilot now uses a usage-based pricing model, where cost is tied directly to token consumption. That sounds simple in theory, but in practice it raises a few important questions: how do you actually measure token usage, what should you optimise, and does reducing tokens hurt output quality?
I ran a series of small experiments in VS Code to explore this in a practical way.
What are tokens?
Tokens are chunks of text that models use to process input and generate output. There is no clear count, like 3 letters mean one token. Tokens can be whole words, parts of words, or symbols.
In tooling like Copilot or APIs, you'll typically see usage broken down into:
- Input (prompt) tokens: Everything sent to the model, including your prompt, system instructions, and any additional context (files, docs, etc.)
- Output (completion) tokens: The model's response, including generated code or explanations
- Cached tokens: Previously processed context reused at a reduced cost (provider-dependent)
- Reasoning tokens: Internal tokens used by some models; not always visible or billed separately
A rough rule of thumb is ~3–4 characters per token in English, but this varies widely, especially in code.
Where do tokens actually come from?
One important realisation is that what you type in the chat is only a small part of the total input.
In VS Code, tools like Copilot Chat may automatically include:
- Open files
- Related files in the project
- Instruction or "skills" documents
- Nearby code or dependencies
This means your short question can result in a very large prompt under the hood. In many cases, this additional context makes up the majority of input tokens.
That said, your chat input still counts toward tokens, but it is usually a tiny bit (and a very important one).
What should you benchmark?
My goal was not just to measure tokens, but to understand efficiency:
- How many tokens are used per task?
- How many prompts are needed to reach a correct solution?
- Does more context improve results or just increase cost?
I designed three test cases:
- Simple task
Change an icon in an existing component, including updating the related test. Clear file paths provided. Expected to complete in one prompt. - Medium task
Modify multiple files, follow existing patterns, and place a new component correctly. Tested with a short, precise prompt and a longer "rubber-ducking" prompt with extra context and reasoning. - Vague task
Build a new view using existing patterns and mock data. Again tested with both short and verbose prompts.
The idea was to run the same tasks repeatedly while adjusting project instructions, file structure, and context to see how results and token usage changed.
Where to find token data in VS Code
This turned out to be the trickiest part.
Log files exist (e.g. under /User/workspaceStorage/), but they're not very
usable directly. Most third-party extensions I found focused on estimated cost rather than
actual token usage. Some used overly simplistic calculations.
The most reliable method I found was the Chat Debug View in VS Code. It provides detailed request data, including:
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 | |
11 | |
12 | |
13 | |
14 | |
15 | |
16 | |
17 | |
18 | |
19 | |
20 | |
21 | |
22 | |
Each block represents a request/response cycle. A single user prompt can trigger multiple such cycles depending on how the tool operates.
How I processed the data
Since there's no built-in way to analyse this easily, I created a small Node.js script.
The workflow:
- Copy debug blocks into files (e.g.
temp/sessions/session-1.md) - Run a script that:
- Asks for task name
- Labels it (simple/complex)
- Records success rate
- Counts number of prompts
- Adds notes
- Outputs everything into a CSV file for analysis in Excel
It's quite manual, but it works. Also, the script automatically combines and calculates all those blocks into one CSV row, so it is not too bad.
Key finding
The most interesting result came from comparing short prompts vs "rubber-ducking" prompts.
For the same task:
- The short, minimal prompt produced a flawed result:
- Incorrect structure
- Missing tests
- Required follow-up prompts
- The longer, more detailed prompt:
- Solved everything in one go
- Required no fixes
- Used fewer tokens overall
This highlights an important point: token efficiency is not about writing the shortest prompt.
Short prompts may reduce the number of input tokens but often lead to worse results and more follow-ups. Those extra iterations increase total token usage. Clear, slightly longer prompts can reduce total cost by improving first-pass accuracy.
Obviously, I only did a few runs, and that is not very conclusive, but I definitely saw a massive difference in the output between the same task and different-style prompts.
Side note: it is not about the "garbage in, garbage out" experiment. Both prompts had complete information and hints for solving!
Final thoughts
This is still a fairly manual and imperfect process. There's definitely room for better tooling and automation.
However, even this basic setup is enough to start answering useful questions:
- Are your project instructions helping or adding noise?
- Is extra context improving results or just increasing cost?
- Are you optimising for fewer tokens, or fewer iterations?
Understanding this can make a real difference in both cost and output quality.
Here is the post which helped me understand how to count those little buggers: Decoding Copilot Token Costs Using VS Code by Ken Muse.