← Back to posts

When Your Read Endpoints Secretly Write: Refactoring Audit Logging with EventBridge

· 9 min read

A routine IAM tightening broke staging. The root cause? Two read endpoints were secretly writing to DynamoDB. Fixing it properly taught us the difference between gates and side effects — and led to an architecture we should have had from the start.


The IAM Change That Broke Nothing (and Everything)

We were tightening IAM policies across our Lambda fleet. Standard security hygiene: each function gets the minimum permissions it needs, no more. PR #302 was straightforward. Read endpoints get read permissions. Write endpoints get write permissions. What could go wrong?

Staging broke.

Specifically, two endpoints stopped working: getGlobalUsers and getGlobalUserAnalytics. These are admin-facing read endpoints. Staff members use them to search across user accounts, view analytics, that sort of thing. They should need nothing more than DynamoDB:GetItem and DynamoDB:Query.

But they also needed DynamoDB:PutItem. Because buried in the handler logic, after fetching the data, these “read” endpoints were synchronously writing AuditLog and DataPrivacyAudit records directly to DynamoDB. Every time someone searched for a user, the Lambda would dutifully record who searched for what, validate privacy consent, and write the audit trail — all in the hot path of the response.

The IAM tightening removed write permissions from read Lambdas. The audit writes failed. The handlers threw. Users got 500s.


The Uncomfortable Truth About Coupling

The quick fix was obvious: add PutItem back to the read endpoints. PR #304 did exactly that. Staging was green again.

But the underlying problem was now impossible to ignore. We had read-only endpoints carrying write permissions because of a side effect that had nothing to do with their primary responsibility. This created several problems:

IAM policies lied. If you looked at the permissions for getGlobalUsers, you’d see DynamoDB write access and reasonably conclude it mutates data. It doesn’t. It reads user data. The writes are audit bookkeeping that the endpoint shouldn’t know about.

Failures cascaded wrongly. If the audit table had a throughput issue, or a schema change broke the audit write, the user search endpoint would fail. A staff member searching for a user would get a 500 error because an audit record couldn’t be written. That’s backwards. Audit logging is important, but it should never prevent the operation it’s auditing.

Testing was painful. Integration tests for getGlobalUsers had to mock DynamoDB write operations that were conceptually unrelated to the feature under test. The test setup was full of audit-related fixtures that obscured what the test was actually verifying.

The core issue was a conceptual one: we had conflated two very different things.


Gates vs Side Effects

This is where the insight crystallised. Our handlers were doing two kinds of “extra” work beyond their primary job:

Gates are synchronous checks that determine whether the operation should proceed. validateAccess() is a gate. It checks privacy consent, evaluates data access rules, and decides which fields the requester is allowed to see. If it fails, the request must fail. Gates are load-bearing. They belong in the hot path.

Side effects are things that happen because of an operation but don’t affect its outcome. recordPrivacyAudit() is a side effect. It writes a record of what happened for compliance purposes. If it fails, the user should still get their search results. The audit record can be retried, written later, or at worst flagged as missing. Side effects are important but not load-bearing. They don’t belong in the hot path.

Once we drew this line, the architecture became obvious. Gates stay synchronous. Side effects go async.


The Middleware Pattern

We use middy for our Lambda middleware stack. Middy has a clean lifecycle: before runs pre-handler, the handler runs, then after runs post-handler. The after phase is perfect for fire-and-forget side effects.

The solution has three parts:

1. Handlers declare audit intent. Instead of calling audit services directly, handlers set an audit context on the request object. It’s a lightweight declaration of what happened:

setAuditContext(rawEvent, {
  audit: {
    action: 'SEARCH',
    entityType: 'global_user_search',
    entityId: userContext.userId,
    dataAfter: {
      searchQuery: search || 'no-search-term',
      resultCount: result.data.length,
    },
    // ... actor and metadata fields
  },
  privacy: {
    operationType: 'GLOBAL_SEARCH',
    endpoint: '/global/users',
    dataFieldsAccessed: privacyResult.dataFieldsAllowed,
    recordCount: result.data.length,
    riskLevel: privacyResult.riskLevel,
    // ... consent and compliance fields
  },
});

The handler returns its response normally. It never imports an audit service, never awaits a DynamoDB write, never needs write permissions.

2. Middleware emits events. The withAuditEvents middleware runs in the after phase. It reads the audit context from the request, and if present, publishes events to EventBridge:

export function withAuditEvents(): MiddlewareObj {
  return {
    after: async (request) => {
      const auditContext = request.event.__auditContext;
      if (!auditContext) return;

      const promises = [];
      if (auditContext.audit) promises.push(publishAuditEvent(auditContext.audit));
      if (auditContext.privacy) promises.push(publishPrivacyAuditEvent(auditContext.privacy));

      try {
        await Promise.allSettled(promises);
      } catch (error) {
        logger.error('Failed to emit audit events', { error });
        // Swallowed. Audit failure must never break the response.
      }
    },
  };
}

Two things to note. First, Promise.allSettled — not Promise.all. We want both events to attempt even if one fails. Second, the catch swallows. By the time this middleware runs, the response is already formed. The user gets their data regardless.

3. Consumer Lambdas write to DynamoDB. Two separate Lambdas subscribe to EventBridge events and handle the actual writes. auditEventConsumer listens for audit.ActionLogged events and writes to the AuditLog table. privacyAuditConsumer listens for privacy.AccessRecorded events and writes to the DataPrivacyAudit table.

These consumers are single-purpose. They receive an event, write a record, done. They have DynamoDB write permissions because writing is literally their only job. If they fail, EventBridge retries with backoff. If they keep failing, events land in a dead-letter queue for investigation. No user-facing request is affected.


The Infrastructure Was Already There (Mostly)

Here’s the satisfying part. When we went to build this, we discovered that most of the infrastructure already existed.

Our platform already had an EventBridge bus for notifications. The AuditEventPublisher service existed — it could publish audit.ActionLogged events. The auditEventConsumer Lambda existed too, with its DynamoDB write logic fully implemented. Someone had built all of this months earlier as part of a notifications infrastructure push.

What was missing? The EventBridge rule connecting the publisher to the consumer. The Lambda existed. The bus existed. The event schema existed. Nobody had wired them together. We had a fully functional postal service with no mail routes configured.

For the privacy audit side, we needed to create the publisher, the event types, and the consumer Lambda from scratch. But the pattern was already proven on the audit side.

The CDK changes added EventBridge rules, DLQs for both consumers, appropriate IAM policies, and a baseline events:PutEvents permission for all business Lambdas so any handler could publish audit events through the middleware.


The Rollout

I took a deliberate, phased approach:

Phase 1: Read endpoints only. The two endpoints that broke staging — getGlobalUsers and getGlobalUserAnalytics — were migrated first. These were the most obviously wrong (read handlers with write side effects) and the most well-understood.

Phase 2: Write handlers. Two follow-up PRs migrated all write handlers (create, update, delete operations) to the same pattern. These were less urgent — a write handler having write permissions isn’t surprising — but the consistency and decoupling benefits still applied.

Phase 3: Test cleanup. Integration tests that previously had to mock DynamoDB audit writes could drop those mocks entirely. Tests for getGlobalUsers now tested user search logic. Tests for createProject now tested project creation logic. Audit logging was tested once, in the consumer’s own test suite.

Each phase was a separate PR. Each was deployed to staging and verified independently. We didn’t try to do it all at once.


What We Learned

Label your operations. The distinction between gates and side effects isn’t always obvious, but it’s always important. If an operation failing should fail the request, it’s a gate. If the request should succeed regardless, it’s a side effect. Side effects don’t belong in the hot path.

Check what you already have. We almost built a new event bus before discovering the existing one. We almost wrote a new consumer Lambda before finding one that was already deployed and waiting for events. Before building new infrastructure, inventory your existing infrastructure. You might be closer to done than you think.

IAM policies are documentation. When a read endpoint has write permissions, that’s a code smell. The permissions are telling you something about the architecture, and if they don’t match the endpoint’s stated purpose, something is coupled that shouldn’t be.

Swallow errors in the right places. The middleware catches and swallows audit publish failures. This felt wrong at first — shouldn’t we know about failures? We do. The error is logged, and the consumer Lambda has its own error handling with DLQ. But the user’s request succeeds. There’s a difference between “we need to know about this” and “this should block the user.” Logging and alerting handle the first. Only gates should do the second.

Half-built infrastructure is invisible infrastructure. An EventBridge rule that doesn’t exist is indistinguishable from an EventBridge bus that doesn’t exist. A Lambda that’s deployed but never invoked might as well not be deployed. When building event-driven systems, the wiring matters as much as the components. Document your event routes the way you document your API routes.


A routine PR to tighten IAM policies shouldn’t break staging. The fact that it did told us something important about our architecture. The fix wasn’t to relax the policies — it was to make the architecture match what the policies assumed: that read endpoints only read.

Sometimes the best refactors start with a broken build and the question: why does this endpoint need these permissions?