Token spend grew 3×.
Your AI roadmap didn’t.
Most LLM features ship without cost discipline. We audit where token spend is leaking, then engineer the routing, caching, prompt, and retrieval changes that take 30–80% out — without dropping answer quality or NPS. Eval-gated. Provider-agnostic.
Token budgets stop being a footnote at around $50K/month. By then the routing decisions, prompt patterns, retrieval shape, and model selection that worked at prototype scale are quietly burning enterprise cash. Cost optimization is engineering work — not vendor negotiation, not “use a cheaper model.” We instrument what you’re actually paying for, then ship the architecture changes that hold the savings.
Reductions, not benchmarks.
A cost audit
Every route, intent, and customer segment instrumented. Per-token costs attributed to features. Projected savings band before a line of code changes.
A routing layer
Cheaper model when context allows. Premium model only when eval shows it's needed. Per-request observable, per-tenant tunable, per-intent cost-capped.
Caching that actually hits
Exact-match + semantic + structured-response cache. Hit rate as a KPI on your dashboard, not a footnote in your runbook.
Prompt and context discipline
System prompts and few-shots audited and tightened. RAG that retrieves what the LLM will actually use, not what scored highly on a benchmark.
An eval harness gated on quality AND cost
Every routing or prompt change passes the same regression bar as a model upgrade. Cost-per-resolved-intent is a tracked metric, not a quarterly estimate.
Specifics, because ‘use a cheaper model’ isn’t a strategy.
- Token attribution
- Instrument what's actually spent per route, intent, customer, and feature. Every dollar attributed before any change is shipped.
- Model routing
- Provider-aware routing across OpenAI, Anthropic, Bedrock, Azure OpenAI, Vertex, and open-weights via Together / Groq / Replicate. Failover and cost caps per provider.
- Prompt compression
- System prompts audited line-by-line. Few-shots replaced with structured output where it saves more. Token counts measured per intent, not eyeballed.
- Context window discipline
- Retrieve only what the LLM will use. Aggressive top-k tuning. Chunk-level relevance gates so the generation step isn't paying for noise.
- RAG cost reduction
- Smaller embedding models when corpus quality allows. Lower-tier vector store when latency permits. Hybrid scoring tuned to cut downstream LLM calls.
- Caching layer
- Exact-match + semantic + response-shape cache. TTLs tied to underlying data freshness. Cache invalidation owned, not hand-waved.
- Inference routing
- Bedrock and Vertex enterprise contracts honoured where price is right. Together / Groq for open-weights latency. Load-balanced with health checks and cost caps.
How it runs
A Tier-1 retailer's customer-service AI burned $480K/month in OpenAI tokens. We instrumented per-intent spend, routed 62% of queries to a smaller model behind eval gates, and added a response-shape cache. Spend fell to $115K/month over six weeks; CSAT moved up 4 points in the same window.
Tier-1 retailer · Customer service AI · OpenAI + open-weights routing
What buyers actually ask
What's the typical reduction range, and where will we land?
Will answer quality drop?
Does this work with our Azure OpenAI, Bedrock, or Vertex enterprise contracts?
What about open-weights routing?
How long until we see real savings?
What's NOT in scope?
How do you make sure the savings persist after handover?
Talk to an engineer, not a salesperson.
30 minutes. No slides. Bring an architecture, a stalled roadmap, or a vendor proposal you want a second opinion on. We'll tell you what we'd do.