← Back to posts

From Stateless Chatbot to Persistent Agent: Building an Internal AI Chat That Actually Remembers

· 12 min read

Every internal chatbot starts the same way. Someone wires up an LLM endpoint, slaps a text input on a page, and ships it. For about a week, it feels magical. Then someone asks a follow-up question and the bot has no idea what they’re talking about.

That was us. Our ops portal had a chat feature - a single Lambda (one of the TypeScript Lambdas we migrated from Rust) that took a message, sent it to Claude Haiku via Bedrock, and returned the response. No history. No memory. Every message was a fresh start. Staff would ask about a contractor, get a great answer, then ask “what about their financials?” and the bot would stare blankly into the void.

We needed persistence. We needed multi-turn context. We needed conversations that multiple staff members could share and continue over days. And we needed it without blowing our token budget on a 200K context window or introducing a vector database just for chat history.

This is the story of how we got there.

The Data Model: Two Entities, One Table

We were already running a single-table DynamoDB design with ElectroDB for the rest of the ops platform. Adding chat meant two new entities: Conversation and Message.

A Conversation carries the metadata: a ULID for the ID (time-sortable, which matters for listing), a title (auto-generated from the first message), a status, a rolling summary, a summaryUpToMessageId pointer, and a message count. The summary fields are the interesting ones - we will come back to those.

A Message has the basics - role, content, timestamp - plus tokenEstimate (a rough char/4 approximation), toolCalls (the full tool call payload, stored as a JSON blob), and a chunkId field that is null for now. That last one is a future upgrade path for chunked streaming responses. We put it in the schema now because adding attributes to a DynamoDB entity after the fact is free, but regretting your access patterns is not.

The partition key layout keeps things simple: CONVERSATION#{id} as the PK, with the conversation metadata on SK CONVERSATION and messages on SK MESSAGE#{messageId}. ULIDs as message IDs give us chronological ordering for free on the sort key. A GSI on status lets us list active conversations without scanning the table.

Context Engineering: The Hard Part

Here is the core problem. A long-running conversation might have 200 messages. Claude’s context window could technically hold all of them. But should it?

No. For three reasons:

  1. Cost scales linearly with input tokens. Every message in context is tokens you pay for on every turn. A 200-message conversation with tool call results could easily hit 50K tokens of input - and you are paying that on every single response.

  2. Latency scales with input size. More tokens in, more time to first token out. For an internal tool where staff expect snappy responses, we are targeting sub-5-second round trips. Dumping the entire history into context makes that impossible.

  3. Relevance decays with distance. Message 3 from last Tuesday is almost certainly less relevant than message 198 from five minutes ago. Treating all context as equally important is a waste of the model’s attention.

So we built a two-layer context system: a sliding window of recent messages, plus a rolling summary of everything that came before.

The Token Budget

I carved up the input like this:

  • System prompt + tool definitions: ~2K tokens (fixed cost)
  • Rolling summary: ~1.5K tokens (grows slowly, bounded by summarisation prompt)
  • Recent message window: ~8K tokens (the sliding window)
  • New user message: ~0.5K tokens (bounded by input validation)
  • Tool call headroom: ~3K tokens (for tool results in the current turn)

Total input: roughly 15K tokens per turn. That is well within the 200K limit, fast to process, and cheap to run. The sliding window gives the model full fidelity on recent context, and the summary gives it the gist of everything older.

Building the Window

The sliding window is token-budgeted, not fixed-count. We walk backwards through the message history, accumulating token estimates until we hit the 8K budget:

function buildSlidingWindow(messages, budget) {
  let accumulated = 0;
  const window = [];

  for (let i = 0; i < messages.length; i++) {
    const msg = messages[i];
    if (accumulated + msg.tokenEstimate > budget) break;

    let content = msg.content;
    // Tool call results older than 3 messages get truncated
    if (msg.toolCalls && i >= 3) {
      const summary = msg.toolCalls
        .map(tc => `[tool: ${tc.toolName ?? 'unknown'}]`)
        .join(', ');
      content = summary || content;
    }

    accumulated += msg.tokenEstimate;
    window.push({ role: msg.role, content });
  }

  return window.reverse();
}

Two things worth noting. First, messages come in from DynamoDB in reverse chronological order (we query with order: 'desc'), so we are naturally walking from newest to oldest. The final reverse() puts them back in chronological order for the prompt.

Second, tool call results get special treatment. The most recent three turns keep their full tool call payloads - the model needs those to understand the current thread of reasoning. Older tool calls get collapsed to a stub like [tool: searchContractors]. The full results are still in DynamoDB if anyone needs them, but they would just waste tokens in the prompt. A single contractor search result can easily be 2K tokens. Three of those from earlier in the conversation and your window is half gone.

Prompt Assembly

The final prompt is assembled in layers:

  1. The system prompt (static, defines the agent’s role and available tools)
  2. If a rolling summary exists, it gets injected as a system message: <summary>...</summary>
  3. The sliding window of recent messages
  4. The new user message

This gives the model a clear information hierarchy: who am I, what happened before (summary), what just happened (window), what does the user want now.

Rolling Summaries: The Long-Term Memory

The sliding window handles the last few minutes of conversation. But what about the last few days?

That is what the rolling summary does. After every ~10 turns, we fire an EventBridge event (conversation.summary.needed) that triggers a dedicated summary Lambda. This Lambda runs asynchronously - the user gets their response immediately, and the summary generates in the background.

The summary Lambda fetches all messages, finds everything between the last summary point and the current window, and asks a stronger model to produce a condensed summary. The key detail: I use Sonnet for summaries and Haiku for conversation turns.

This is a deliberate cost/quality trade-off. Haiku is fast and cheap - perfect for the interactive loop where latency matters. But summarisation is a harder task. You need to identify what matters, preserve named entities and decisions, and discard noise. Sonnet does this meaningfully better than Haiku, and since summaries only generate every ~10 turns (not on every message), the cost impact is negligible.

The summarisation prompt is specific about what to preserve:

Key decisions and conclusions. Important facts and data points. Action items or open questions. Named entities - people, companies, projects.

If a previous summary exists, it gets included as context, so the new summary is an update rather than a fresh take. This means the summary evolves incrementally - it does not lose information from 50 messages ago just because the most recent batch did not mention it.

Concurrency Safety

What happens if two users send messages at the same time and both trigger a summary? DynamoDB conditional writes handle this. The summary Lambda writes the new summary along with a summaryUpToMessageId pointer. If two summary invocations race, only one write succeeds - the other sees a stale pointer and either retries or no-ops. An EventBridge dead-letter queue catches anything that falls through.

Tools, Not RAG

The original chat had an automatic RAG step: embed the user’s message, search Pinecone, stuff the results into context, then send everything to the model. Every single message triggered a vector search, whether relevant or not.

We replaced this with tool-based retrieval. The model gets tool definitions - searchContractors, getContractorDetail, queryDataWarehouse - and decides when to call them. The Vercel AI SDK’s generateText with maxSteps: 5 handles the tool calling loop: the model requests a tool call, we execute it, feed the result back, and the model continues.

This is better for three reasons:

  1. Precision. The model calls tools when it actually needs data, not on every turn. “What’s 2+2?” does not trigger a contractor database search.

  2. Transparency. Tool calls are explicit and logged. We store the full toolCalls array on the assistant message. Staff can see exactly what the bot looked up.

  3. Composability. Adding a new data source means adding a new tool definition - a Zod schema and an execute function. We recently added queryDataWarehouse, a text-to-SQL tool that queries business data via DuckDB against Parquet exports. It took about a day to add. With RAG, adding a new data source means figuring out embeddings, chunking strategy, index configuration, and retrieval thresholds.

The queryDataWarehouse tool is particularly fun. The model writes SQL, we validate it is SELECT-only, resolve table names to Parquet file paths (auto-discovering the latest export date), enforce a row LIMIT, and run it through DuckDB. The model gets to query accounts, users, projects, quotes, and partner profiles with full SQL semantics - JOINs, aggregations, date functions, the lot.

Three Lambdas, Two Patterns

The backend is split into three Lambdas with very different profiles:

conversation-api handles CRUD operations: list conversations, get a conversation, update title/status, list messages with cursor-based pagination. It is a pure data Lambda - no LLM calls, no tool execution. Timeout: 30 seconds (generous for DynamoDB reads).

conversation-chat handles the actual AI interaction: create a conversation with a first message, or send a message to an existing one. It builds the context window, calls Bedrock, executes tools, stores messages, and fires summary events. Timeout: 60 seconds (tool calling loops can take a while).

summary-generator runs async via EventBridge. It fetches the full message history, builds the summarisation prompt, calls Sonnet, and writes the result back. Timeout: 120 seconds (Sonnet is slower, and large conversations mean large prompts).

I use a handler factory pattern - createApiHandler for HTTP request/response Lambdas, createEventHandler for EventBridge-triggered ones - that handles routing, body parsing, auth claims extraction, and error wrapping. The actual handler code reads like a script: get data, do thing, return result.

Shared Conversations

One of the requirements that shaped the whole design was shared conversations. In our ops team, someone starts investigating a contractor issue, gets halfway through, and then hands it off when their shift ends. Or a question escalates from one person to another.

Because conversations are just DynamoDB entities with no user-scoped partition key, any authenticated staff member can see and continue any conversation. The author field on each message tracks who said what, but access is shared by default. The conversation list endpoint filters by status (active/archived), not by user.

This is a deliberate choice. In a larger org you would want access controls. For a small ops team, the shared-by-default model means anyone can pick up where someone else left off, and the rolling summary means they get caught up without reading 50 messages.

What We Would Do Differently

Token estimation is a hack. Math.ceil(text.length / 4) is a rough approximation that works well enough for English text but falls apart for structured data, code, or anything with lots of whitespace. A proper tokeniser would be more accurate, but it also adds a dependency and latency to every message. The approximation has not caused problems yet - the window just ends up slightly under or over budget - but it is technical debt we are aware of.

The summary threshold is fixed. Every 10 turns, regardless of conversation density. A conversation where someone is firing off quick clarification questions generates summaries at the same rate as one with long, detailed exchanges. An adaptive threshold based on token accumulation rather than message count would be smarter.

Streaming. I use generateText, not streamText. The original implementation used streaming, but tool calling with streams added complexity we did not need for an internal tool. Users get the full response after the model finishes. For external-facing products, streaming would be non-negotiable - but for an internal ops chat where responses take 2-4 seconds, the simplicity trade-off is worth it.

The Outcome

What we shipped: a persistent, multi-turn AI agent with tool use, shared conversations, rolling context summaries, and a token-budgeted sliding window. Three Lambdas, two ElectroDB entities, one DynamoDB table (that we were already paying for), and an EventBridge rule.

The cost model is straightforward. Haiku for conversation turns is cheap - a few cents per conversation per day at our volume. Sonnet summaries happen infrequently and are bounded in size. DynamoDB costs are negligible (single-digit RCU/WCU). The most expensive part is the tool calling - each tool call is an additional Bedrock invocation - but the model is judicious about when it calls tools, which is the whole point of using tools instead of automatic RAG.

Staff now start a conversation, ask a series of related questions, leave, come back the next day, and the bot knows what they were talking about. Someone else can jump in and continue the thread. The summary means even conversations from weeks ago have useful context when reopened.

It is not a particularly novel architecture. Sliding window plus summary is a well-known pattern. Tool use instead of RAG is increasingly standard. DynamoDB single-table design is old hat. But the combination - assembled with attention to the token budget, the cost/quality split between models, the concurrency safety of summary generation, and the practical reality of a small team sharing conversations - works well enough that people actually use it. And for an internal tool, that is the only metric that matters.