ENGINEERING

Building an AI sales agent for $99/mo: the honest economics.

Written by

Mohamed Wajahat

Content

In this post
Loading…

The question every founder asks first.

"How are you not losing money at $99 a month?"

It's the second question every prospect asks during demos, right after "what does it do?" And it's the right question to ask. The AI infrastructure layer is not cheap. Anthropic's API isn't free. Our customers run real conversations with real AI inference happening in real time. So how does the math work?

This is a longer post than usual because the answer is genuinely interesting. If you're building anything in the AI tooling space, the cost engineering matters as much as the product engineering.

If you're building in AI, the cost engineering matters as much as the product engineering.

The naive math, before optimization.

A typical ClientX conversation involves:

  • An initial system prompt with knowledge base, FAQs, agent personality, and guardrails (~10,000 tokens of context)

  • 5 to 12 message exchanges with the visitor

  • Each AI response is 200 to 400 tokens of output

  • End-of-conversation: visitor memory, handoff summary, deal intelligence, follow-up email draft (~4 additional Sonnet calls)

  • Async: suggestion chips after each response (Haiku), real-time enrichment (third-party APIs)

If you priced this naively, with no optimization, the numbers look bad:

  • Sonnet 4.5: $3 per million input tokens, $15 per million output tokens

  • Total per conversation: ~100,000 input, ~4,400 output

  • Cost: $0.30 + $0.07 = $0.37 per conversation

500 conversations a month at $0.37 is $185 in API spend. On a $99 plan. We'd lose money on the first conversation.

Obviously this isn't what we do. Here's how we got it down to roughly $0.05 to $0.08 per conversation.

Lever 1: Aggressive prompt caching.

Anthropic offers prompt caching that reduces cached input from $3 per million tokens to $0.30 per million. A 10x reduction on the parts of the prompt that don't change between calls.

The trick is structuring your prompt so the stable parts come first and the variable parts come last:

[CACHE BREAKPOINT]
- Agent identity and personality (stable)
- Guardrails (stable)
- FAQ content (stable per workspace)
- Knowledge base chunks (stable per workspace)
[/CACHE BREAKPOINT]

[CACHE BREAKPOINT]
- Agent identity and personality (stable)
- Guardrails (stable)
- FAQ content (stable per workspace)
- Knowledge base chunks (stable per workspace)
[/CACHE BREAKPOINT]

[CACHE BREAKPOINT]
- Agent identity and personality (stable)
- Guardrails (stable)
- FAQ content (stable per workspace)
- Knowledge base chunks (stable per workspace)
[/CACHE BREAKPOINT]

Roughly 80% of the input tokens are in the cached block. After the first cache miss in a session, every subsequent message hits the cache.

That's a 65% reduction on input costs alone. But verifying caching actually hits in production is critical. Read the cache_read_input_tokens field on every API response. If it's consistently zero, your structure is broken.

Lever 2: Right-sized model selection.

Not every AI task in ClientX needs Sonnet. Sonnet is great at reasoning, especially for the deal briefing where the model has to extract pain points and infer competitor evaluation from a conversation. But several tasks don't require reasoning at all:

  • Generating 3 follow-up suggestion chips after each AI reply

  • Summarizing a conversation as a 3-bullet visitor memory

  • Drafting a templated follow-up email

  • Matching natural language to a calendar slot

These are all good fits for Haiku 4.5, which costs $1 per million input tokens and $5 per million output. About 1/3 the cost of Sonnet for tasks where Haiku produces equivalent output.

We swapped 3 of the 4 end-of-conversation calls to Haiku. End-of-conversation cost dropped from $0.06 to $0.022 per conversation.

Lever 3: Knowledge base retrieval, not full-context dump.

Most AI chat tools dump the entire knowledge base into the system prompt. For a customer with a 100-page site crawl, that's 40,000+ tokens per request. Even cached, that's a lot of bytes flowing through every reply.

We use embeddings-based retrieval instead. On knowledge base ingest, we chunk the content and generate embeddings using a lightweight model. At query time, we embed the visitor's question, retrieve the top 3 to 5 chunks (about 2,000 tokens), and inject only those into the prompt.

This drops average input tokens from ~10,000 to ~4,000 per turn. Total impact: ~40% additional cost reduction on top of caching.

Lever 4: Hard rate limits and conversation caps.

The most expensive customer is one bad-faith user who realizes they can chat with your bot for an hour. Without limits, one curious developer can rack up $20 in API spend in a single session.

We enforce three layers:

  • Per visitor per minute: 5 messages

  • Per visitor per session: 20 messages

  • Per visitor per 24 hours: 60 messages

And conversation length itself caps at 12 AI replies. After 12, the AI gracefully says "let me connect you with the team" and captures an email.

Lever 5: Smart defaults that prevent expensive workspaces.

The hidden cost in any AI product is the customer who configures it badly:

  • Customer crawls their entire 500-page site as the knowledge base

  • Customer sets the system prompt to "be extremely detailed and thorough"

  • Customer enables every end-of-conversation feature regardless of plan tier

We constrain these by default. Knowledge base sources are capped per plan. System prompt customization is gated behind Scale. The defaults keep costs bounded; customers can buy more headroom by upgrading.

The actual numbers, today.

With all five levers in place, our costs per conversation come out to roughly:

  • Sonnet inference (cached): $0.03 to $0.05

  • Haiku for chips, memory, email: $0.005 to $0.01

  • End-of-conversation Sonnet: $0.01 to $0.02

  • Apollo enrichment: customer's API key, not ours

  • Total: $0.05 to $0.08 per conversation

500 conversations a month on the Starter plan: $25 to $40 in actual cost. On a $99 plan, that's 60% to 75% gross margin.

What this means for builders.

If you're building AI products and pricing them affordably, the cost engineering is non-negotiable. Five things that matter, in order:

  1. Prompt caching, verified in production

  2. Right-sized model selection per task

  3. Retrieval over full-context dumps

  4. Hard rate limits at the visitor and session level

  5. Default configurations that prevent expensive misuse

Without all five, $99/mo isn't viable for AI chat. With them, it's a healthy business at scale.

What's next.

The biggest remaining cost lever is fine-tuning a smaller model for the high-volume, low-complexity tasks. That's a 6-12 month project. For now, the math works.

Notes from the build, in your inbox.

One post every couple of weeks. No marketing spam. Just what we're learning while building.