A website chatbot built on the Claude API takes a user's message, optionally injects relevant content chunks from your site (RAG), and streams a grounded reply — all within a few hundred milliseconds. The difference between a toy demo and a production chatbot is mostly system prompt discipline, retrieval quality, and knowing when to hand off to a human.
What architecture does a production Claude chatbot actually need?
Skip the single-file demos. A production chatbot has four distinct layers: a retrieval layer that fetches relevant content from your site, a prompt construction layer that assembles context + history + user message, an inference layer (the Claude API call itself), and a delivery layer that streams tokens back to the browser. Each layer has failure modes that will bite you in production.
Here's the minimal stack we use at TopSyde:
| Layer | Implementation | Why |
|---|---|---|
| Retrieval | pgvector on Postgres + OpenAI ada-002 embeddings | Colocated with app DB, no extra infra |
| Prompt construction | Server-side TypeScript function | Keeps token budget logic out of client |
| Inference | Claude claude-haiku-3-5 (default), Sonnet escalation | Cost-optimized with quality fallback |
| Delivery | Vercel Edge Function + ReadableStream | Sub-100ms TTFB globally |
| Session storage | Redis (Upstash) | TTL-based, no PII at rest |
The retrieval and inference layers are where most implementations get sloppy. We'll dig into both.
How to ground a chatbot in your own site content (RAG)
RAG — retrieval-augmented generation — means you pull relevant chunks of your own content, inject them into the prompt, and instruct Claude to answer only from that context. Without it, Claude answers from its training data, which knows nothing about your pricing, your specific product features, or your policies.
Step 1: Chunk and embed your content
Split your site content into ~500-token chunks with 50-token overlap. Overlap preserves context across chunk boundaries — a sentence that straddles two chunks won't get orphaned.
// Rough chunking logic
function chunkContent(text: string, chunkSize = 500, overlap = 50): string[] {
const words = text.split(' ');
const chunks: string[] = [];
let i = 0;
while (i < words.length) {
chunks.push(words.slice(i, i + chunkSize).join(' '));
i += chunkSize - overlap;
}
return chunks;
}
Embed each chunk with your embedding model and store the vector alongside the raw text in pgvector. We re-embed on every content publish via a webhook.
Step 2: Retrieve at query time
When a user sends a message, embed the query and run a cosine similarity search against your content vectors. Pull the top 5 chunks.
SELECT content, 1 - (embedding <=> $1::vector) AS similarity
FROM site_content
ORDER BY embedding <=> $1::vector
LIMIT 5;
Five chunks at 500 tokens each is 2,500 tokens of context — usually enough to answer a specific question without blowing your token budget.
Step 3: Inject into the system prompt
Pass retrieved chunks as a <context> block in the system prompt, before the conversation history. This is critical: Claude weighs earlier prompt content more heavily, so context should precede history.
This same pattern — chunking, embedding, retrieval, injection — is what powers our AI website audit tool, though that pipeline runs batch rather than streaming.
How to write a system prompt that doesn't lie to your users
The system prompt is your only reliable guardrail. Streaming filters, post-processing regex — they all have edge cases. A well-constructed system prompt handles the hard cases before the model ever generates a token.
Here's the core structure we use:
You are [Company]'s support assistant. You help visitors understand our
products and services.
GROUNDING RULES:
- Answer only from the <context> block provided. If the answer isn't there,
say so explicitly and offer to connect them with the team.
- Never quote specific prices unless they appear verbatim in <context>.
- Never make commitments about timelines, SLAs, or features unless
explicitly stated in <context>.
- If asked about a competitor, describe our own product only — do not
disparage competitors.
ESCALATION RULES:
- If the user expresses frustration, anger, or uses words like "cancel",
"refund", "legal", or "complaint", respond with empathy and immediately
offer to connect them with a human agent.
- If you cannot confidently answer after reviewing context, offer escalation
rather than guessing.
TONE:
- Concise, direct, technically accurate. Match the user's register.
- No excessive affirmations ("Great question!"). Just answer.
<context>
{retrievedChunks}
</context>
The pricing guardrail is non-negotiable. Claude has no way to know your current pricing from training data, and even if it guesses correctly today, it will be wrong after your next price change. The explicit rule "never quote prices unless they appear in context" plus RAG retrieval of your pricing page is the only safe pattern. We cover a related problem — what happens when a site's product catalog changes out from under an AI integration — in our WooCommerce Claude AI integration guide.
How to implement streaming responses
Streaming makes your chatbot feel fast. A 300-token response at non-streaming takes ~2 seconds to return; with streaming, the user sees the first tokens in under 300ms. According to Anthropic's own latency benchmarks, streaming with Haiku 3.5 delivers time-to-first-token under 500ms for most regions (Anthropic, 2024).
The Claude API's streaming interface via the Python SDK:
import anthropic
client = anthropic.Anthropic()
def stream_response(messages: list, system: str):
with client.messages.stream(
model="claude-haiku-3-5",
max_tokens=1024,
system=system,
messages=messages,
) as stream:
for text in stream.text_stream:
yield text
On the frontend, consume the stream via fetch with a ReadableStream decoder:
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ message, sessionId }),
headers: { 'Content-Type': 'application/json' },
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
appendToChat(chunk); // Update UI incrementally
}
One gotcha: if you're running on a serverless platform with response timeout limits, make sure your streaming endpoint won't hit the wall-clock limit on long responses. We hit this on Vercel with Sonnet on complex queries — solution was to move those to a longer-timeout route.
How to handle human escalation
Escalation is the most underbuilt part of most chatbot implementations. Two signal types matter: explicit triggers (user says "I want to talk to a person") and implicit triggers (frustration language, repeated failed answers, sensitive topics).
const EXPLICIT_ESCALATION_KEYWORDS = [
'human', 'agent', 'person', 'support ticket', 'call', 'phone',
'cancel', 'refund', 'legal', 'complaint', 'sue'
];
const IMPLICIT_ESCALATION_KEYWORDS = [
'frustrated', 'angry', 'useless', 'broken', 'never works',
'worst', 'terrible', 'disaster'
];
function shouldEscalate(message: string, failedAttempts: number): boolean {
const lower = message.toLowerCase();
const explicitMatch = EXPLICIT_ESCALATION_KEYWORDS.some(k => lower.includes(k));
const implicitMatch = IMPLICIT_ESCALATION_KEYWORDS.some(k => lower.includes(k));
return explicitMatch || implicitMatch || failedAttempts >= 2;
}
When escalation triggers, don't just drop a "contact us" link. Capture the full conversation history and route it to your support system with context. We POST the session to a Slack webhook with the transcript, so the human agent knows exactly what was already tried. This keeps the handoff from feeling like starting over — which, according to Salesforce's 2024 State of Service report, is one of the top two frustrations customers report when escalating from chatbot to human (Salesforce, 2024).
What do Claude API chatbots actually cost at scale?
Token cost math is important to get right before you ship. Underestimating it is how you end up with surprise bills.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Typical session cost* |
|---|---|---|---|
| Claude Haiku 3.5 | $0.80 | $4.00 | ~$0.015 |
| Claude Sonnet 4 | $3.00 | $15.00 | ~$0.06 |
| Claude Opus 4 | $15.00 | $75.00 | ~$0.30 |
*Assumes 10-turn session: 2,000 input tokens/turn (context + history), 300 output tokens/turn
At 1,000 sessions/day on Haiku, you're at ~$15/day — $450/month. That's a real number, but it's also well within reason for a site doing meaningful support volume. The usual optimization path: run Haiku by default, escalate to Sonnet only when the question complexity score (based on entity count + query length) exceeds a threshold.
Cache your retrieved content chunks aggressively. If a user asks three questions about the same topic in a session, the retrieved chunks for Q2 and Q3 are often identical to Q1 — no need to re-embed and re-query. Session-level chunk caching cuts retrieval latency and avoids redundant embedding API calls.
The TopSyde pricing page has current plan details if you're evaluating whether to build this in-house versus using a platform that already ships it.
What guardrails prevent hallucinated pricing and promises?
Beyond system prompt rules, three technical guardrails give you defense in depth:
1. Context-only citation requirement Instruct Claude to cite which context chunk it's drawing from when making specific claims. This surfaces hallucinations — if Claude can't cite a chunk, it's improvising.
2. Post-generation validation for numbers Run a regex pass over the output before returning it to the user. Flag any response containing currency symbols, percentages, or time commitments that don't appear verbatim in the retrieved context. Reject and regenerate, or flag for human review.
import re
SENSITIVE_PATTERN = re.compile(r'\$[\d,]+|[\d]+%|\d+ (day|hour|week|month)')
def validate_response(response: str, context_chunks: list[str]) -> bool:
matches = SENSITIVE_PATTERN.findall(response)
context_text = ' '.join(context_chunks)
return all(match in context_text for match in matches)
3. Confidence scoring Add a second, minimal Claude call (or a smaller local model) to score whether the primary response is actually grounded in the provided context. Responses scoring below 0.7 trigger a "I'm not certain — want me to connect you with the team?" fallback. This costs a few extra tokens but prevents the worst hallucination cases.
This reliability layer matters especially if your chatbot surfaces near cost-sensitive decisions. The same principle applies to any AI system making consequential recommendations — which is why our WordPress 7 AI features guide emphasizes keeping humans in the loop on AI-generated SEO recommendations before they go live.
Deployment and monitoring checklist
Before you ship to production:
- Rate limit by session ID (prevent prompt injection spam)
- Log every conversation with session ID, model used, token counts, and escalation events
- Set
max_tokenshard limits per response (we use 1024 — longer answers belong in docs) - Configure spend alerts in Anthropic Console ($50/day threshold is reasonable to start)
- Test your escalation path end-to-end — does the Slack/CRM notification actually fire?
- Audit system prompt for PII leakage (don't include internal config in context chunks)
- Verify your Redis session TTL matches your support SLA window
Uptime and reliability planning for the backend that powers your chatbot is the same discipline as hosting planning — if the server is down, the chatbot is down. The same principles we cover in what website downtime actually costs your business apply: chatbot outages during peak hours have direct revenue impact for sales-facing bots.
For the hosting layer itself, our technical stack details show what infrastructure TopSyde runs its own chatbot on, which you can use as a reference architecture.
Frequently Asked Questions
How many tokens does a typical chatbot session use?
A 10-turn conversation with RAG context typically uses 20,000–25,000 input tokens and 3,000–4,000 output tokens. Input cost dominates because each turn re-sends the full conversation history plus retrieved context chunks. Use context window compression (summarize older turns) for sessions that run long.
Can Claude refuse to answer questions outside my site's content?
Yes — this is controlled entirely by your system prompt. Instruct Claude to respond with a specific fallback message ("I don't have information on that — would you like to speak with our team?") when the retrieved context doesn't contain a relevant answer. The key is making the fallback feel helpful, not like an error.
What's the difference between using Claude Haiku vs Sonnet for a chatbot?
Haiku is faster and roughly 4x cheaper than Sonnet, and it handles the majority of support and FAQ queries well. Sonnet is worth the cost for technically complex questions, multi-step reasoning, or when you need richer, more nuanced responses. The practical pattern is to start all sessions on Haiku and escalate to Sonnet when query complexity — measured by entity count, question depth, or explicit user frustration — exceeds a threshold you tune based on your own session logs.
How do I prevent the chatbot from making promises about pricing or timelines?
The most reliable combination is: (1) explicit system prompt rules prohibiting price/timeline statements unless present in retrieved context, (2) RAG retrieval that actually includes your current pricing page, and (3) a post-generation regex validator that flags numeric outputs not found in the context. No single layer is sufficient — defense in depth is the right posture for anything customer-facing.
What happens if the Claude API goes down?
Implement a graceful degradation path: catch API errors, display a friendly "our chat is temporarily unavailable" message, and show your standard contact options. Never let an uncaught exception surface to the user. For high-volume deployments, maintain a simple FAQ fallback (static responses to

Founder & Lead Developer
20+ years full-stack development, WordPress, AI tools & agents
Colton is the founder of TopSyde with 20+ years of full-stack development experience spanning WordPress, cloud infrastructure, and AI-powered tooling. He specializes in performance optimization, server architecture, and building AI agents for automated site management.

