How to Count Tokens in VS Code (and Why It Matters)

Tools
back
June, 2026
5 min read

Why should you care about tokens? If you want to improve your AI workflow, reduce costs, and get better results from each prompt, tokens are the core metric to understand.

GitHub Copilot now uses a usage-based pricing model, where cost is tied directly to token consumption. That sounds simple in theory, but in practice it raises a few important questions: how do you actually measure token usage, what should you optimise, and does reducing tokens hurt output quality?

I ran a series of small experiments in VS Code to explore this in a practical way.

What are tokens?

Tokens are chunks of text that models use to process input and generate output. There is no clear count, like 3 letters mean one token. Tokens can be whole words, parts of words, or symbols.

In tooling like Copilot or APIs, you'll typically see usage broken down into:

  • Input (prompt) tokens: Everything sent to the model, including your prompt, system instructions, and any additional context (files, docs, etc.)
  • Output (completion) tokens: The model's response, including generated code or explanations
  • Cached tokens: Previously processed context reused at a reduced cost (provider-dependent)
  • Reasoning tokens: Internal tokens used by some models; not always visible or billed separately

A rough rule of thumb is ~3–4 characters per token in English, but this varies widely, especially in code.

Where do tokens actually come from?

One important realisation is that what you type in the chat is only a small part of the total input.

In VS Code, tools like Copilot Chat may automatically include:

  • Open files
  • Related files in the project
  • Instruction or "skills" documents
  • Nearby code or dependencies

This means your short question can result in a very large prompt under the hood. In many cases, this additional context makes up the majority of input tokens.

That said, your chat input still counts toward tokens, but it is usually a tiny bit (and a very important one).

What should you benchmark?

My goal was not just to measure tokens, but to understand efficiency:

  • How many tokens are used per task?
  • How many prompts are needed to reach a correct solution?
  • Does more context improve results or just increase cost?

I designed three test cases:

  1. Simple task
    Change an icon in an existing component, including updating the related test. Clear file paths provided. Expected to complete in one prompt.
  2. Medium task
    Modify multiple files, follow existing patterns, and place a new component correctly. Tested with a short, precise prompt and a longer "rubber-ducking" prompt with extra context and reasoning.
  3. Vague task
    Build a new view using existing patterns and mock data. Again tested with both short and verbose prompts.

The idea was to run the same tasks repeatedly while adjusting project instructions, file structure, and context to see how results and token usage changed.

Where to find token data in VS Code

This turned out to be the trickiest part.

Log files exist (e.g. under /User/workspaceStorage/), but they're not very usable directly. Most third-party extensions I found focused on estimated cost rather than actual token usage. Some used overly simplistic calculations.

The most reliable method I found was the Chat Debug View in VS Code. It provides detailed request data, including:

1
requestType      : ChatMessages
2
model            : claude-sonnet-4.5
3
maxPromptTokens  : 167997
4
maxResponseTokens: 32000
5
location         : 7
6
otherOptions     : {"stream":true}
7
intent           : undefined
8
startTime        : 2026-06-***
9
endTime          : 2026-06-***
10
duration         : 7765ms
11
ourRequestId     : ***
12
requestId        : ***
13
serverRequestId  : ***
14
timeToFirstToken : 2215ms
15
resolved model   : claude-sonnet
16
usage            : {
17
  "prompt_tokens":32956,
18
  "completion_tokens":332,
19
  "total_tokens":33288,
20
  "prompt_tokens_details": {}
21
}
22
copilotUsage    : 12.86 AIC (12855825000 nano-AIU)

Each block represents a request/response cycle. A single user prompt can trigger multiple such cycles depending on how the tool operates.

How I processed the data

Since there's no built-in way to analyse this easily, I created a small Node.js script.

The workflow:

  • Copy debug blocks into files (e.g. temp/sessions/session-1.md)
  • Run a script that:
    • Asks for task name
    • Labels it (simple/complex)
    • Records success rate
    • Counts number of prompts
    • Adds notes
  • Outputs everything into a CSV file for analysis in Excel

It's quite manual, but it works. Also, the script automatically combines and calculates all those blocks into one CSV row, so it is not too bad.

Key finding

The most interesting result came from comparing short prompts vs "rubber-ducking" prompts.

For the same task:

  • The short, minimal prompt produced a flawed result:
    • Incorrect structure
    • Missing tests
    • Required follow-up prompts
  • The longer, more detailed prompt:
    • Solved everything in one go
    • Required no fixes
    • Used fewer tokens overall

This highlights an important point: token efficiency is not about writing the shortest prompt.

Short prompts may reduce the number of input tokens but often lead to worse results and more follow-ups. Those extra iterations increase total token usage. Clear, slightly longer prompts can reduce total cost by improving first-pass accuracy.

Obviously, I only did a few runs, and that is not very conclusive, but I definitely saw a massive difference in the output between the same task and different-style prompts.

Side note: it is not about the "garbage in, garbage out" experiment. Both prompts had complete information and hints for solving!

Final thoughts

This is still a fairly manual and imperfect process. There's definitely room for better tooling and automation.

However, even this basic setup is enough to start answering useful questions:

  • Are your project instructions helping or adding noise?
  • Is extra context improving results or just increasing cost?
  • Are you optimising for fewer tokens, or fewer iterations?

Understanding this can make a real difference in both cost and output quality.

Here is the post which helped me understand how to count those little buggers: Decoding Copilot Token Costs Using VS Code by Ken Muse.

Andris Švarcs

Somehow, I've survived over 15 years as a web developer without losing my interest in the craft. Quite the opposite, with so many great improvements in the Web standards, what was nearly impossible now is easy to make.

My career has been a wild ride through small agencies and big corporations, building everything from finance apps to health dashboards.

I'm that annoying person who needs to understand products beyond just slinging code. I ask questions like 'Why is this feature important?' and 'How will this improve the customer journey?' – you know, the kind of questions that make project managers reach for the pint aspirin. This curiosity has led me down the rabbit holes of design, accessibility, and SEO. Because apparently, making websites pretty, usable, and findable wasn't challenging enough on its own.

P.S. If this bio sounds too polished, blame my evil AI twin. I'm still working on teaching it sarcasm.

Copyright © since 2021, Andris Švarcs. All rights reserved.

Lets connect

bluesky

youtube

linkedin