From Stateless Chatbot to Persistent Agent: Building an Internal AI Chat That Actually Remembers
Every internal chatbot starts the same way. Someone wires up an LLM endpoint, slaps a text input on a page, and ships it. For about a week, it feels magical. Then someone asks a follow-up question and the bot has no idea what they’re talking about.
That was us. Our ops portal had a chat feature - a single Lambda (one of the TypeScript Lambdas we migrated from Rust) that took a message, sent it to Claude Haiku via Bedrock, and returned the response. No history. No memory. Every message was a fresh start. Staff would ask about a contractor, get a great answer, then ask “what about their financials?” and the bot would stare blankly into the void.
We needed persistence. We needed multi-turn context. We needed conversations that multiple staff members could share and continue over days. And we needed it without blowing our token budget on a 200K context window or introducing a vector database just for chat history.
This is the story of how we got there.
The Data Model: Two Entities, One Table
We were already running a single-table DynamoDB design with ElectroDB for the rest of the ops platform. Adding chat meant two new entities: Conversation and Message.
A Conversation carries the metadata: a ULID for the ID (time-sortable, which matters for listing), a title (auto-generated from the first message), a status, a rolling summary, a summaryUpToMessageId pointer, and a message count. The summary fields are the interesting ones - we will come back to those.
A Message has the basics - role, content, timestamp - plus tokenEstimate (a rough char/4 approximation), toolCalls (the full tool call payload, stored as a JSON blob), and a chunkId field that is null for now. That last one is a future upgrade path for chunked streaming responses. We put it in the schema now because adding attributes to a DynamoDB entity after the fact is free, but regretting your access patterns is not.
The partition key layout keeps things simple: CONVERSATION#{id} as the PK, with the conversation metadata on SK CONVERSATION and messages on SK MESSAGE#{messageId}. ULIDs as message IDs give us chronological ordering for free on the sort key. A GSI on status lets us list active conversations without scanning the table.
Context Engineering: The Hard Part
Here is the core problem. A long-running conversation might have 200 messages. Claude’s context window could technically hold all of them. But should it?
No. For three reasons:
-
Cost scales linearly with input tokens. Every message in context is tokens you pay for on every turn. A 200-message conversation with tool call results could easily hit 50K tokens of input - and you are paying that on every single response.
-
Latency scales with input size. More tokens in, more time to first token out. For an internal tool where staff expect snappy responses, we are targeting sub-5-second round trips. Dumping the entire history into context makes that impossible.
-
Relevance decays with distance. Message 3 from last Tuesday is almost certainly less relevant than message 198 from five minutes ago. Treating all context as equally important is a waste of the model’s attention.
So we built a two-layer context system: a sliding window of recent messages, plus a rolling summary of everything that came before.
The Token Budget
I carved up the input like this:
- System prompt + tool definitions: ~2K tokens (fixed cost)
- Rolling summary: ~1.5K tokens (grows slowly, bounded by summarisation prompt)
- Recent message window: ~8K tokens (the sliding window)
- New user message: ~0.5K tokens (bounded by input validation)
- Tool call headroom: ~3K tokens (for tool results in the current turn)
Total input: roughly 15K tokens per turn. That is well within the 200K limit, fast to process, and cheap to run. The sliding window gives the model full fidelity on recent context, and the summary gives it the gist of everything older.
Building the Window
The sliding window is token-budgeted, not fixed-count. We walk backwards through the message history, accumulating token estimates until we hit the 8K budget:
function buildSlidingWindow(messages, budget) {
let accumulated = 0;
const window = [];
for (let i = 0; i < messages.length; i++) {
const msg = messages[i];
if (accumulated + msg.tokenEstimate > budget) break;
let content = msg.content;
// Tool call results older than 3 messages get truncated
if (msg.toolCalls && i >= 3) {
const summary = msg.toolCalls
.map(tc => `[tool: ${tc.toolName ?? 'unknown'}]`)
.join(', ');
content = summary || content;
}
accumulated += msg.tokenEstimate;
window.push({ role: msg.role, content });
}
return window.reverse();
}
Two things worth noting. First, messages come in from DynamoDB in reverse chronological order (we query with order: 'desc'), so we are naturally walking from newest to oldest. The final reverse() puts them back in chronological order for the prompt.
Second, tool call results get special treatment. The most recent three turns keep their full tool call payloads - the model needs those to understand the current thread of reasoning. Older tool calls get collapsed to a stub like [tool: searchContractors]. The full results are still in DynamoDB if anyone needs them, but they would just waste tokens in the prompt. A single contractor search result can easily be 2K tokens. Three of those from earlier in the conversation and your window is half gone.
Prompt Assembly
The final prompt is assembled in layers:
- The system prompt (static, defines the agent’s role and available tools)
- If a rolling summary exists, it gets injected as a system message:
<summary>...</summary> - The sliding window of recent messages
- The new user message
This gives the model a clear information hierarchy: who am I, what happened before (summary), what just happened (window), what does the user want now.
Rolling Summaries: The Long-Term Memory
The sliding window handles the last few minutes of conversation. But what about the last few days?
That is what the rolling summary does. After every ~10 turns, we fire an EventBridge event (conversation.summary.needed) that triggers a dedicated summary Lambda. This Lambda runs asynchronously - the user gets their response immediately, and the summary generates in the background.
The summary Lambda fetches all messages, finds everything between the last summary point and the current window, and asks a stronger model to produce a condensed summary. The key detail: I use Sonnet for summaries and Haiku for conversation turns.
This is a deliberate cost/quality trade-off. Haiku is fast and cheap - perfect for the interactive loop where latency matters. But summarisation is a harder task. You need to identify what matters, preserve named entities and decisions, and discard noise. Sonnet does this meaningfully better than Haiku, and since summaries only generate every ~10 turns (not on every message), the cost impact is negligible.
The summarisation prompt is specific about what to preserve:
Key decisions and conclusions. Important facts and data points. Action items or open questions. Named entities - people, companies, projects.
If a previous summary exists, it gets included as context, so the new summary is an update rather than a fresh take. This means the summary evolves incrementally - it does not lose information from 50 messages ago just because the most recent batch did not mention it.
Concurrency Safety
What happens if two users send messages at the same time and both trigger a summary? DynamoDB conditional writes handle this. The summary Lambda writes the new summary along with a summaryUpToMessageId pointer. If two summary invocations race, only one write succeeds - the other sees a stale pointer and either retries or no-ops. An EventBridge dead-letter queue catches anything that falls through.
Tools, Not RAG
The original chat had an automatic RAG step: embed the user’s message, search Pinecone, stuff the results into context, then send everything to the model. Every single message triggered a vector search, whether relevant or not.
We replaced this with tool-based retrieval. The model gets tool definitions - searchContractors, getContractorDetail, queryDataWarehouse - and decides when to call them. The Vercel AI SDK’s generateText with maxSteps: 5 handles the tool calling loop: the model requests a tool call, we execute it, feed the result back, and the model continues.
This is better for three reasons:
-
Precision. The model calls tools when it actually needs data, not on every turn. “What’s 2+2?” does not trigger a contractor database search.
-
Transparency. Tool calls are explicit and logged. We store the full
toolCallsarray on the assistant message. Staff can see exactly what the bot looked up. -
Composability. Adding a new data source means adding a new tool definition - a Zod schema and an execute function. We recently added
queryDataWarehouse, a text-to-SQL tool that queries business data via DuckDB against Parquet exports. It took about a day to add. With RAG, adding a new data source means figuring out embeddings, chunking strategy, index configuration, and retrieval thresholds.
The queryDataWarehouse tool is particularly fun. The model writes SQL, we validate it is SELECT-only, resolve table names to Parquet file paths (auto-discovering the latest export date), enforce a row LIMIT, and run it through DuckDB. The model gets to query accounts, users, projects, quotes, and partner profiles with full SQL semantics - JOINs, aggregations, date functions, the lot.
Three Lambdas, Two Patterns
The backend is split into three Lambdas with very different profiles:
conversation-api handles CRUD operations: list conversations, get a conversation, update title/status, list messages with cursor-based pagination. It is a pure data Lambda - no LLM calls, no tool execution. Timeout: 30 seconds (generous for DynamoDB reads).
conversation-chat handles the actual AI interaction: create a conversation with a first message, or send a message to an existing one. It builds the context window, calls Bedrock, executes tools, stores messages, and fires summary events. Timeout: 60 seconds (tool calling loops can take a while).
summary-generator runs async via EventBridge. It fetches the full message history, builds the summarisation prompt, calls Sonnet, and writes the result back. Timeout: 120 seconds (Sonnet is slower, and large conversations mean large prompts).
I use a handler factory pattern - createApiHandler for HTTP request/response Lambdas, createEventHandler for EventBridge-triggered ones - that handles routing, body parsing, auth claims extraction, and error wrapping. The actual handler code reads like a script: get data, do thing, return result.
Shared Conversations
One of the requirements that shaped the whole design was shared conversations. In our ops team, someone starts investigating a contractor issue, gets halfway through, and then hands it off when their shift ends. Or a question escalates from one person to another.
Because conversations are just DynamoDB entities with no user-scoped partition key, any authenticated staff member can see and continue any conversation. The author field on each message tracks who said what, but access is shared by default. The conversation list endpoint filters by status (active/archived), not by user.
This is a deliberate choice. In a larger org you would want access controls. For a small ops team, the shared-by-default model means anyone can pick up where someone else left off, and the rolling summary means they get caught up without reading 50 messages.
What We Would Do Differently
Token estimation is a hack. Math.ceil(text.length / 4) is a rough approximation that works well enough for English text but falls apart for structured data, code, or anything with lots of whitespace. A proper tokeniser would be more accurate, but it also adds a dependency and latency to every message. The approximation has not caused problems yet - the window just ends up slightly under or over budget - but it is technical debt we are aware of.
The summary threshold is fixed. Every 10 turns, regardless of conversation density. A conversation where someone is firing off quick clarification questions generates summaries at the same rate as one with long, detailed exchanges. An adaptive threshold based on token accumulation rather than message count would be smarter.
Streaming. I use generateText, not streamText. The original implementation used streaming, but tool calling with streams added complexity we did not need for an internal tool. Users get the full response after the model finishes. For external-facing products, streaming would be non-negotiable - but for an internal ops chat where responses take 2-4 seconds, the simplicity trade-off is worth it.
The Outcome
What we shipped: a persistent, multi-turn AI agent with tool use, shared conversations, rolling context summaries, and a token-budgeted sliding window. Three Lambdas, two ElectroDB entities, one DynamoDB table (that we were already paying for), and an EventBridge rule.
The cost model is straightforward. Haiku for conversation turns is cheap - a few cents per conversation per day at our volume. Sonnet summaries happen infrequently and are bounded in size. DynamoDB costs are negligible (single-digit RCU/WCU). The most expensive part is the tool calling - each tool call is an additional Bedrock invocation - but the model is judicious about when it calls tools, which is the whole point of using tools instead of automatic RAG.
Staff now start a conversation, ask a series of related questions, leave, come back the next day, and the bot knows what they were talking about. Someone else can jump in and continue the thread. The summary means even conversations from weeks ago have useful context when reopened.
It is not a particularly novel architecture. Sliding window plus summary is a well-known pattern. Tool use instead of RAG is increasingly standard. DynamoDB single-table design is old hat. But the combination - assembled with attention to the token budget, the cost/quality split between models, the concurrency safety of summary generation, and the practical reality of a small team sharing conversations - works well enough that people actually use it. And for an internal tool, that is the only metric that matters.
Building an AI Chat That Actually Remembers What You Were Talking About
Every internal chatbot starts the same way. Someone connects an AI to a text box, ships it, and for a week it feels impressive. Then someone asks a follow-up question and the bot stares blankly into the void.
That was us. Our operations portal had a chat feature — a single server function that took a message, sent it to an AI, and returned the response. No history. No memory. Every message was a fresh start. Staff would ask about a contractor, get a useful answer, then ask “what about their financials?” and the bot would have no idea what “their” referred to.
We needed something better. We needed conversations that lasted more than one question, that multiple team members could share, and that didn’t cost a fortune to run.
The Core Problem: Context Has a Price
Here’s the thing about AI context. You can technically feed an entire conversation history into every message you send. For a long-running conversation, that might be 200 messages. The AI could handle it. But should you?
No, for a few practical reasons.
Every word you send to an AI costs money. Every message in a long conversation sitting in that context means you’re paying for it on every single response. A long conversation with lots of detailed responses could easily hit 50,000 words of input — and you pay that cost on every turn.
More words in also means slower responses. For an internal tool where people expect quick answers, this matters.
And honestly, what someone said 50 messages ago is usually less relevant than what they said two minutes ago. Treating all context as equally important wastes the AI’s attention on things that don’t matter.
So we built a two-layer approach: a window of recent messages, plus a summary of everything older.
How the Memory Works
Think of it like briefing a colleague who’s joining a conversation halfway through. You don’t hand them a transcript of everything that was said. You give them a one-paragraph summary — “we’ve been investigating a contractor invoicing issue, found that their rate hadn’t been updated since January, and we’re trying to figure out who approved the last contract” — and then they read the last few exchanges themselves.
That’s exactly what we do. For every conversation turn, the AI gets:
- A summary of everything older than the recent window (built automatically in the background)
- The full recent conversation — however many messages fit within a token budget
- The new question
The summary is built separately by a stronger AI model, because summarisation is a harder task than answering questions and is worth spending slightly more on. It runs asynchronously in the background while the user gets their response, so there’s no waiting. Every 10 or so messages, a new summary gets built that incorporates everything.
One subtle detail: tool call results — where the AI looked something up — get compressed after a few messages. The full results are still stored, but they stop taking up expensive context space. The AI keeps a note that says “looked up contractor search results” rather than including all the actual results.
Looking Things Up vs. Searching Automatically
The original version of our chat had an automatic search step. Every message would trigger a search through our data before the AI even saw the question. Every single message, regardless of whether it needed data.
We replaced this with something called tool use. Instead of always searching, the AI decides when it needs information and asks for it by name. “Give me the details for this contractor.” “Run this query against the business data.” We tell the AI what tools exist and what they can do, and it calls them when it needs them.
This is better in a few ways. Asking “what’s 2+2?” doesn’t trigger a search through contractor records. When the AI does look something up, it’s logged explicitly — staff can see exactly what it retrieved. Adding new data sources means adding a new tool definition, not rearchitecting how search works.
The most interesting tool is one that lets the AI write database queries against our business data. It writes the query, we validate it’s read-only, and we run it. The AI can answer questions involving actual numbers from actual records rather than having to apologise that it doesn’t have access to that information.
Shared by Default
One of the requirements that shaped everything was sharing. In a small operations team, someone might start investigating an issue, hand it off when their shift ends, and have a colleague pick it up. Or a question escalates from one person to another.
So conversations aren’t private. Any team member can see and continue any conversation. Each message has an author attached to it so you know who said what, but access is shared. When someone picks up a conversation from yesterday, the rolling summary means they don’t need to read 50 messages to understand the context.
For a larger organisation, you’d want access controls. For a small team, shared by default means anyone can pick up where someone left off without a handover call.
The Architecture in Plain Terms
Three server functions handle everything.
One manages conversations as data — listing them, updating their titles, fetching message history. No AI involvement, just database operations.
One handles the actual AI interaction — receiving a message, building the context, calling the AI, running any tool lookups, storing the results. This one can take up to a minute because tool calling can involve multiple back-and-forth steps.
One runs in the background triggered by a message queue — it generates the rolling summaries. Longer timeout because it’s dealing with more data and using the stronger model.
Everything lives in the same database we were already using, so no new infrastructure to pay for.
What We’d Do Differently
The token counting is rough. We estimate based on text length rather than using a precise counter. It works well enough for normal English text but gets less accurate for structured data. It hasn’t caused real problems, but we know it’s approximate.
The summary trigger is fixed at every 10 messages. A conversation with lots of short questions gets summarised as often as one with long, detailed exchanges. Something that adapts to how much text is accumulating would be smarter.
We also chose not to stream responses — the AI generates the whole response, then sends it. Streaming (where text appears word by word as it’s generated) adds complexity we didn’t need for an internal tool. At 2-4 second response times, it doesn’t noticeably matter. For a customer-facing product, we’d think harder about this.
What We Got
Persistent conversations that multiple people can share and continue. A rolling summary that keeps the AI oriented even in week-old conversations. Tools that let it look up real data when it needs to. And a cost model that stays reasonable: cheap AI for conversation, better AI for summaries only when needed, no vector search infrastructure, storage costs that are negligible.
Staff now start a conversation, ask a series of related questions over days, come back to it later, and the bot knows what they were working on. Someone else can jump in and continue. The work accumulates instead of disappearing every time you close the tab.
For an internal tool, the only metric that matters is whether people actually use it. They do.
← Back to posts