AI Agent Cost Breakdown: Tokens, Inference, Embeddings, and API Spend

AI agent costs break down into six components: token costs (30–40% of total spend), inference compute (15–20%), embeddings and vector storage (8–18%), tool and API calls (10–15%), fine-tuning (10–15%), and monitoring (3–5%). Token costs get the headlines, but they are less than half the total.

When your firm’s monthly AI bill arrives showing £7,200, most of that is not tokens. Inference compute, embedding storage, tool calls, and fine-tuning collectively account for 60–70% of the total. Yet according to Gartner’s 2026 AI Cost Management Report, 78% of organisations only track token costs. They are missing the majority of their spend.

This guide breaks down every component of AI agent costs — what it is, what it costs, why it matters, and how to control it. If your firm tracks AI agent costs at the project level, understanding these components tells you where the money actually goes.

Key Takeaway: Tokens are only 30–40% of AI agent costs. Track all six components to see the full picture and find real savings.

What Makes Up the Cost of Running an AI Agent?

An AI agent’s cost is the sum of every resource it consumes while executing a task. Six components contribute to the total, each with different cost drivers and optimisation levers.

Cost Distribution Overview

Component	% of Total Spend	Monthly Range (5,000 tasks)	Primary Cost Driver
Token/API costs	30–40%	£960–£5,200	Input/output volume, model choice
Inference compute	15–20%	£480–£2,600	GPU time, model size
Embeddings + vector DB	8–18%	£256–£2,340	Document volume, query frequency
Tool and API calls	10–15%	£320–£1,950	External service pricing
Fine-tuning	10–15%	£320–£1,950	Training data volume, frequency
Monitoring/observability	3–5%	£96–£650	Log volume, trace depth
Total	100%	£3,200–£13,000	—

These ranges reflect a mid-tier professional services deployment — a firm running AI agents across five to ten client engagements simultaneously. Smaller firms will sit at the lower end. Firms with heavy research or document processing workloads will push toward the upper end.

The distribution is not fixed. A firm doing mostly document review will have higher embedding costs. A firm running complex multi-step reasoning agents will have higher token and inference costs. Understanding your firm’s specific distribution is the first step toward targeted cost reduction.

How Do Token Costs Work?

Token costs are the most visible component of AI spend. They appear directly on provider invoices and are the easiest to understand — which is partly why they get disproportionate attention.

What Is a Token?

A token is a sub-word unit that language models use to process text. Roughly four characters of English text equal one token. The word “professional” is three tokens. The sentence “The contract review is complete” is seven tokens.

Every request to a language model consumes tokens in three ways:

Input tokens: The prompt, system instructions, and context you send to the model
Output tokens: The response the model generates
Reasoning tokens: Internal thinking steps in reasoning models (not visible in the output but still billed)

Why Output Tokens Cost More

Output tokens cost three to five times more than input tokens across most providers. In April 2026, typical rates for a frontier model are £2.50 per million input tokens and £10 per million output tokens.

The cost asymmetry exists because generating output requires more compute than processing input. The model must predict each token sequentially, running a full inference pass for every word it writes. Processing input is partially parallelisable.

For professional services firms, this means tasks that generate long outputs (document drafts, research reports, code) cost significantly more per token than tasks that process long inputs (document review, classification, summarisation with short outputs).

Reasoning Tokens: The Hidden Multiplier

Reasoning models — those that “think” before responding — consume five to twenty times more tokens per request than standard models. The reasoning happens in internal token sequences that do not appear in the visible output but still incur charges.

A standard model might use 500 input tokens and 200 output tokens to answer a question. A reasoning model might use the same 500 input tokens, generate 3,000 internal reasoning tokens, and produce 200 output tokens. The total token cost increases by five times, even though the visible output is identical.

This makes reasoning models powerful but expensive. Use them for complex analytical tasks where the reasoning quality justifies the cost. Use standard models for routine work.

Context Window Impact

Every request includes a system prompt and any context provided. A 2,000-token system prompt is charged on every single request. Over 1,000 daily requests, that is 2 million tokens — roughly £5 per day — just for the system prompt.

Prompt caching reduces this. Many providers now cache frequently repeated prompt prefixes, charging cached tokens at 75–90% less than uncached tokens. If your agents use consistent system prompts, caching can reduce token costs by 15–25%.

Token Cost by Provider (April 2026 Indicative Rates)

Provider Category	Input (per 1M tokens)	Output (per 1M tokens)	Reasoning (per 1M tokens)
Frontier models	£2.00–£3.00	£8.00–£15.00	£12.00–£20.00
Mid-tier models	£0.50–£1.50	£1.50–£5.00	N/A
Lightweight models	£0.05–£0.25	£0.15–£0.75	N/A
Open-weight (self-hosted)	Compute cost only	Compute cost only	Compute cost only

The right model depends on the task. A token cost calculator can help estimate spend by model and task type. Most firms benefit from routing simple tasks to lightweight models and reserving frontier models for complex work.

What Are Inference and Compute Costs?

Inference is the computation required to process your request — the GPU time spent running the model. When you pay for tokens, you are partly paying for inference. But inference costs also include overhead that is not captured in per-token pricing.

What Inference Covers

GPU allocation: The physical hardware processing your request
Queue and scheduling: Time spent waiting for available compute
Memory: Loading model weights and context into GPU memory
Networking: Data transfer between components in distributed inference

For API-based providers, inference costs are bundled into per-token pricing. You do not see a separate “inference” line item. But the cost is embedded — it is why output tokens cost more than input tokens (they require sequential GPU passes).

Real-Time vs Batch Inference

Real-time inference processes requests immediately. Batch inference queues requests and processes them during off-peak periods.

Batch inference typically costs 40–60% less than real-time. The trade-off is latency — batch results might take minutes or hours instead of seconds.

For professional services, batch processing works for overnight document review, scheduled report generation, and non-urgent research. Real-time processing is necessary for interactive workflows and time-sensitive client deliverables.

Self-Hosted vs API Inference

Self-hosting models on your own infrastructure makes financial sense at scale. The breakeven point varies, but industry benchmarks suggest self-hosting becomes cost-effective at approximately 500,000 to 1,000,000 requests per month for mid-tier models.

Below that threshold, API pricing is more cost-effective because you avoid the capital expenditure and operational overhead of managing GPU infrastructure. Most professional services firms should use API providers unless their volume justifies the investment.

What Do Embeddings and Vector Databases Cost?

Embeddings power retrieval-augmented generation (RAG) — the technique that lets AI agents search your firm’s documents and knowledge base. This component has two cost layers: creating the embeddings and storing/querying them.

Embedding Model Costs

Embedding models convert text into numerical vectors. They are priced per token, like generation models, but at significantly lower rates — typically £0.01–£0.10 per million tokens.

The cost comes from volume. A firm with 50,000 documents needs to embed all of them. If each document averages 2,000 tokens, that is 100 million tokens to embed — a one-time cost of £1–£10. Re-embedding happens when documents change or when you switch embedding models.

Ongoing costs come from embedding queries. Every time an agent searches the knowledge base, the query is embedded. At 1,000 queries per day, embedding costs are negligible — a few pence.

Vector Database Costs

Vector databases store embeddings and serve similarity queries. Costs include:

Storage: £0.10–£0.50 per GB per month. A firm with 50,000 embedded documents might need 5–20 GB, costing £0.50–£10 per month.
Query fees: £0.01–£0.10 per 1,000 queries. At 30,000 queries per month, that is £0.30–£3.
Compute: Dedicated instances for low-latency queries range from £25 to £1,750 per month depending on scale and performance requirements.

For most professional services firms, vector database costs are modest — £50–£500 per month. Costs escalate for firms with large document corpuses or high query volumes.

Optimisation Strategies

Batch embedding updates rather than re-embedding individual documents reduces API calls. Hierarchical chunking — creating summaries of document sections rather than embedding every paragraph — reduces storage and query costs. Query caching prevents repeated embedding of identical search queries.

What Do Tool and API Calls Cost?

AI agents do not just think — they act. Tool calls are how agents interact with the outside world: searching the web, executing code, querying databases, and calling third-party services.

The Compounding Effect

A single agent task might trigger five to twenty tool calls. A research task could involve:

Three web search queries (£0.005–£0.02 each)
Five webpage retrievals (£0.001–£0.01 each)
Two code execution runs (£0.01–£0.05 each)
One database query (£0.001–£0.005)
One file write operation (£0.001)

Individual call costs are small. But multiply by hundreds of tasks per day, and tool calls accumulate to 10–15% of total spend.

Common Tool Call Costs

Tool Type	Typical Cost Per Call	Daily Volume (Mid-Tier)	Monthly Cost
Web search	£0.005–£0.02	200–500	£30–£300
Code execution	£0.01–£0.05	50–200	£15–£300
Database queries	£0.001–£0.005	500–2,000	£15–£300
Document retrieval	£0.001–£0.01	300–1,000	£9–£300
External API calls	£0.01–£0.10	50–200	£15–£600

Integration and Maintenance Costs

Beyond per-call fees, tool integrations require maintenance. Connectors to CRM, PSA, ERP, and billing systems need updating when APIs change. Authentication tokens need refreshing. Error handling needs monitoring.

These are operational costs that do not appear on any API bill but consume engineering time. Factor them into total cost of ownership calculations.

What About Fine-Tuning and Model Adaptation?

Fine-tuning adapts a general-purpose model to your firm’s specific needs — its terminology, writing style, domain knowledge, and quality standards.

Fine-Tuning Cost Structure

Fine-tuning costs include:

Training compute: GPU time to process your training data. A fine-tuning run with 10,000 examples typically costs £20–£200 depending on model size and provider.
Training data preparation: Curating, formatting, and validating training examples. This is primarily a human cost, not an API cost.
Iteration cycles: Most fine-tuning requires three to five iterations to achieve acceptable quality. Multiply the per-run cost accordingly.
Ongoing retraining: Models need periodic retraining as your firm’s knowledge evolves. Monthly or quarterly retraining runs add recurring cost.

When Fine-Tuning Is Worth the Cost

Fine-tuning makes sense when:

Your agents handle domain-specific tasks where general models underperform
You need consistent output formatting or tone
You can reduce prompt length (and therefore token costs) by encoding instructions into the model

It does not make sense when prompt engineering achieves the same result, or when your task volume is too low to justify the investment.

For professional services firms, fine-tuning is most valuable for document generation tasks where output must match firm standards — client reports, legal memos, audit findings.

How Do You Track Each Cost Component?

Tracking aggregate AI spend is a starting point. Tracking component-level costs is where cost control begins.

Instrument at the API Layer

Most LLM providers return cost data in API responses — token counts, model used, processing time. Capture this data on every request. Store it alongside attribution metadata (client, project, task). A lightweight way to do this is to log each session’s token spend straight into your time-and-billing system — Keito accepts LLM usage expenses via its API v2 and Node/Python SDKs, attributed to the same client and project as the agent’s time.

Separate Tool Call Costs

Tool calls often go through different billing systems than LLM calls. Web search has its own billing. Code execution has its own billing. Aggregate these into your central cost tracking system.

Tag each tool call with the same attribution metadata used for LLM calls. This ensures tool costs flow into the same client, project, and task buckets as token and inference costs.

Build Component-Level Reports

Finance and operations teams need different views:

Finance: Total cost by component, per client, per month — for billing and profitability analysis
Operations: Cost per component per agent — for identifying which agents are expensive and why
Engineering: Per-request cost breakdown — for optimising prompts, model selection, and tool usage

These reports surface hidden costs that aggregate tracking misses. A firm might discover that 25% of its total AI spend comes from tool calls in a single agent, pointing to an optimisation opportunity.

Automate Cost Aggregation

Manual cost tracking breaks down at scale. Automate the pipeline: capture raw cost events, enrich with attribution metadata, aggregate by component, client, project, and time period, and push to dashboards and billing systems.

The goal is a system where component-level cost data is available within minutes of an agent completing a task — not at the end of the month when the bill arrives.

Frequently Asked Questions

What are the main cost components of an AI agent?

Six components: token/API costs (30–40% of total), inference compute (15–20%), embeddings and vector databases (8–18%), tool and API calls (10–15%), fine-tuning and model adaptation (10–15%), and monitoring/observability (3–5%). A mid-tier agent handling 5,000 tasks per month costs £3,200–£13,000.

How much do AI tokens cost compared to inference?

Token costs represent 30–40% of total AI agent spend. Inference compute represents 15–20%. However, for API-based providers, inference costs are partly embedded in per-token pricing. The distinction matters most for self-hosted deployments where token processing and GPU compute are separate cost lines.

What percentage of AI agent costs are token costs?

Token costs account for 30–40% of total AI agent spend. This means 60–70% of costs come from other components — inference, embeddings, tool calls, fine-tuning, and monitoring. Firms that only track token costs miss the majority of their AI spend.

How much do vector databases cost for AI agents?

Vector database costs range from £50 to £500 per month for most professional services firms. This includes storage (£0.10–£0.50 per GB), query fees (£0.01–£0.10 per 1,000 queries), and compute for low-latency queries (£25–£1,750/month at scale). Costs depend on document volume and query frequency.

What are AI agent tool call costs?

Tool calls are how agents interact with external services — web searches, code execution, database queries, and API integrations. Individual calls cost £0.001–£0.10 each, but a single task can trigger five to twenty calls. Tool calls typically represent 10–15% of total agent spend.

How can firms reduce AI agent costs?

Target each component: use lighter models for routine tasks (token savings), batch non-urgent work (inference savings), cache frequent queries (embedding savings), limit unnecessary tool calls (API savings), and fine-tune only when prompt engineering is insufficient. Component-level tracking reveals where the biggest savings opportunities exist.

Keito logs AI agent time and LLM token costs as expenses — attributed to clients and projects, ready for billing. See Cost Breakdowns →