When Your Read Endpoints Secretly Write: Refactoring Audit Logging with EventBridge
A routine IAM tightening broke staging. The root cause? Two read endpoints were secretly writing to DynamoDB. Fixing it properly taught us the difference between gates and side effects — and led to an architecture we should have had from the start.
The IAM Change That Broke Nothing (and Everything)
We were tightening IAM policies across our Lambda fleet. Standard security hygiene: each function gets the minimum permissions it needs, no more. PR #302 was straightforward. Read endpoints get read permissions. Write endpoints get write permissions. What could go wrong?
Staging broke.
Specifically, two endpoints stopped working: getGlobalUsers and getGlobalUserAnalytics. These are admin-facing read endpoints. Staff members use them to search across user accounts, view analytics, that sort of thing. They should need nothing more than DynamoDB:GetItem and DynamoDB:Query.
But they also needed DynamoDB:PutItem. Because buried in the handler logic, after fetching the data, these “read” endpoints were synchronously writing AuditLog and DataPrivacyAudit records directly to DynamoDB. Every time someone searched for a user, the Lambda would dutifully record who searched for what, validate privacy consent, and write the audit trail — all in the hot path of the response.
The IAM tightening removed write permissions from read Lambdas. The audit writes failed. The handlers threw. Users got 500s.
The Uncomfortable Truth About Coupling
The quick fix was obvious: add PutItem back to the read endpoints. PR #304 did exactly that. Staging was green again.
But the underlying problem was now impossible to ignore. We had read-only endpoints carrying write permissions because of a side effect that had nothing to do with their primary responsibility. This created several problems:
IAM policies lied. If you looked at the permissions for getGlobalUsers, you’d see DynamoDB write access and reasonably conclude it mutates data. It doesn’t. It reads user data. The writes are audit bookkeeping that the endpoint shouldn’t know about.
Failures cascaded wrongly. If the audit table had a throughput issue, or a schema change broke the audit write, the user search endpoint would fail. A staff member searching for a user would get a 500 error because an audit record couldn’t be written. That’s backwards. Audit logging is important, but it should never prevent the operation it’s auditing.
Testing was painful. Integration tests for getGlobalUsers had to mock DynamoDB write operations that were conceptually unrelated to the feature under test. The test setup was full of audit-related fixtures that obscured what the test was actually verifying.
The core issue was a conceptual one: we had conflated two very different things.
Gates vs Side Effects
This is where the insight crystallised. Our handlers were doing two kinds of “extra” work beyond their primary job:
Gates are synchronous checks that determine whether the operation should proceed. validateAccess() is a gate. It checks privacy consent, evaluates data access rules, and decides which fields the requester is allowed to see. If it fails, the request must fail. Gates are load-bearing. They belong in the hot path.
Side effects are things that happen because of an operation but don’t affect its outcome. recordPrivacyAudit() is a side effect. It writes a record of what happened for compliance purposes. If it fails, the user should still get their search results. The audit record can be retried, written later, or at worst flagged as missing. Side effects are important but not load-bearing. They don’t belong in the hot path.
Once we drew this line, the architecture became obvious. Gates stay synchronous. Side effects go async.
The Middleware Pattern
We use middy for our Lambda middleware stack. Middy has a clean lifecycle: before runs pre-handler, the handler runs, then after runs post-handler. The after phase is perfect for fire-and-forget side effects.
The solution has three parts:
1. Handlers declare audit intent. Instead of calling audit services directly, handlers set an audit context on the request object. It’s a lightweight declaration of what happened:
setAuditContext(rawEvent, {
audit: {
action: 'SEARCH',
entityType: 'global_user_search',
entityId: userContext.userId,
dataAfter: {
searchQuery: search || 'no-search-term',
resultCount: result.data.length,
},
// ... actor and metadata fields
},
privacy: {
operationType: 'GLOBAL_SEARCH',
endpoint: '/global/users',
dataFieldsAccessed: privacyResult.dataFieldsAllowed,
recordCount: result.data.length,
riskLevel: privacyResult.riskLevel,
// ... consent and compliance fields
},
});
The handler returns its response normally. It never imports an audit service, never awaits a DynamoDB write, never needs write permissions.
2. Middleware emits events. The withAuditEvents middleware runs in the after phase. It reads the audit context from the request, and if present, publishes events to EventBridge:
export function withAuditEvents(): MiddlewareObj {
return {
after: async (request) => {
const auditContext = request.event.__auditContext;
if (!auditContext) return;
const promises = [];
if (auditContext.audit) promises.push(publishAuditEvent(auditContext.audit));
if (auditContext.privacy) promises.push(publishPrivacyAuditEvent(auditContext.privacy));
try {
await Promise.allSettled(promises);
} catch (error) {
logger.error('Failed to emit audit events', { error });
// Swallowed. Audit failure must never break the response.
}
},
};
}
Two things to note. First, Promise.allSettled — not Promise.all. We want both events to attempt even if one fails. Second, the catch swallows. By the time this middleware runs, the response is already formed. The user gets their data regardless.
3. Consumer Lambdas write to DynamoDB. Two separate Lambdas subscribe to EventBridge events and handle the actual writes. auditEventConsumer listens for audit.ActionLogged events and writes to the AuditLog table. privacyAuditConsumer listens for privacy.AccessRecorded events and writes to the DataPrivacyAudit table.
These consumers are single-purpose. They receive an event, write a record, done. They have DynamoDB write permissions because writing is literally their only job. If they fail, EventBridge retries with backoff. If they keep failing, events land in a dead-letter queue for investigation. No user-facing request is affected.
The Infrastructure Was Already There (Mostly)
Here’s the satisfying part. When we went to build this, we discovered that most of the infrastructure already existed.
Our platform already had an EventBridge bus for notifications. The AuditEventPublisher service existed — it could publish audit.ActionLogged events. The auditEventConsumer Lambda existed too, with its DynamoDB write logic fully implemented. Someone had built all of this months earlier as part of a notifications infrastructure push.
What was missing? The EventBridge rule connecting the publisher to the consumer. The Lambda existed. The bus existed. The event schema existed. Nobody had wired them together. We had a fully functional postal service with no mail routes configured.
For the privacy audit side, we needed to create the publisher, the event types, and the consumer Lambda from scratch. But the pattern was already proven on the audit side.
The CDK changes added EventBridge rules, DLQs for both consumers, appropriate IAM policies, and a baseline events:PutEvents permission for all business Lambdas so any handler could publish audit events through the middleware.
The Rollout
I took a deliberate, phased approach:
Phase 1: Read endpoints only. The two endpoints that broke staging — getGlobalUsers and getGlobalUserAnalytics — were migrated first. These were the most obviously wrong (read handlers with write side effects) and the most well-understood.
Phase 2: Write handlers. Two follow-up PRs migrated all write handlers (create, update, delete operations) to the same pattern. These were less urgent — a write handler having write permissions isn’t surprising — but the consistency and decoupling benefits still applied.
Phase 3: Test cleanup. Integration tests that previously had to mock DynamoDB audit writes could drop those mocks entirely. Tests for getGlobalUsers now tested user search logic. Tests for createProject now tested project creation logic. Audit logging was tested once, in the consumer’s own test suite.
Each phase was a separate PR. Each was deployed to staging and verified independently. We didn’t try to do it all at once.
What We Learned
Label your operations. The distinction between gates and side effects isn’t always obvious, but it’s always important. If an operation failing should fail the request, it’s a gate. If the request should succeed regardless, it’s a side effect. Side effects don’t belong in the hot path.
Check what you already have. We almost built a new event bus before discovering the existing one. We almost wrote a new consumer Lambda before finding one that was already deployed and waiting for events. Before building new infrastructure, inventory your existing infrastructure. You might be closer to done than you think.
IAM policies are documentation. When a read endpoint has write permissions, that’s a code smell. The permissions are telling you something about the architecture, and if they don’t match the endpoint’s stated purpose, something is coupled that shouldn’t be.
Swallow errors in the right places. The middleware catches and swallows audit publish failures. This felt wrong at first — shouldn’t we know about failures? We do. The error is logged, and the consumer Lambda has its own error handling with DLQ. But the user’s request succeeds. There’s a difference between “we need to know about this” and “this should block the user.” Logging and alerting handle the first. Only gates should do the second.
Half-built infrastructure is invisible infrastructure. An EventBridge rule that doesn’t exist is indistinguishable from an EventBridge bus that doesn’t exist. A Lambda that’s deployed but never invoked might as well not be deployed. When building event-driven systems, the wiring matters as much as the components. Document your event routes the way you document your API routes.
A routine PR to tighten IAM policies shouldn’t break staging. The fact that it did told us something important about our architecture. The fix wasn’t to relax the policies — it was to make the architecture match what the policies assumed: that read endpoints only read.
Sometimes the best refactors start with a broken build and the question: why does this endpoint need these permissions?
When Tightening Security Breaks Your App: A Lesson in Hidden Side Effects
Routine security work shouldn’t break your app. When it does, that’s the system telling you something important.
The PR That Broke Staging
We were doing standard security hygiene: tightening permissions so each function has exactly what it needs, nothing more. Read operations get read permissions. Write operations get write permissions. Simple.
Staging broke immediately.
Two features stopped working — both of them admin tools for searching across user accounts. These are read-only features. Staff enter a search, get results back. They should need nothing more than the ability to read from the database.
But they also needed write permissions. Because buried inside those read endpoints, after fetching the search results, the code was also writing audit records directly to the database. Every search secretly triggered two writes: one for the audit trail, one for privacy compliance.
Our security tightening removed the write permissions from read functions. The hidden writes failed. The whole endpoint failed. Staff got error screens.
The Quick Fix vs The Right Fix
The quick fix was obvious: add the write permissions back. We shipped that within the hour. Staging went green.
But now we couldn’t ignore the underlying problem. We had read-only features carrying write permissions because of hidden side effects that had nothing to do with what the feature was supposed to do.
This created three concrete problems:
The permissions were lying. Looking at a read endpoint with write permissions, any reasonable person would assume it mutates data. It didn’t. The writes were invisible bookkeeping happening after the main work was done.
Failures were cascading in the wrong direction. If the audit system had a problem, the user search broke. A staff member couldn’t find a user account because an audit record couldn’t be written. The audit trail is important, but it should never prevent the operation it’s auditing.
Tests were polluted. Every test for the search feature had to set up database mocking for audit writes that were conceptually irrelevant to what was being tested.
Gates vs Side Effects
This is where the solution became clear. Our endpoints were doing two very different kinds of “extra” work:
Gates are checks that determine whether the operation should happen. Validating that a user has permission to see certain data is a gate. If it fails, the request must fail. It belongs in the main path.
Side effects are things that happen because of an operation but don’t change whether it succeeds. Writing an audit record is a side effect. If it fails, the user should still get their search results. The audit record can be retried. It doesn’t belong in the main path.
Once we drew this line clearly, the solution was obvious. Gates stay synchronous. Side effects go asynchronous.
The New Architecture
Instead of writing audit records directly inside the handler, we split the system in two:
Handlers declare what happened. After finishing the main work, a handler attaches a lightweight note to the request: “this was a search, here’s what was searched, here’s what was returned.” Then the response goes out. No database writes, no audit imports, no write permissions needed.
A separate layer publishes events. After the response is sent, background middleware reads that note and fires events onto our message bus (EventBridge). These events say “a search happened” and “data was accessed.” The events are fire-and-forget — if the publish fails, it’s logged, but the user’s response is already on its way.
Dedicated consumers write to the database. Separate background functions subscribe to those events and do the actual database writes. They have write permissions because writing is literally their only job. If they fail, the message bus retries automatically. If they keep failing, the events go to a holding queue for investigation. No user-facing feature is ever affected.
The key design choice: if publishing the audit event fails, we swallow that error. This felt wrong at first. Shouldn’t we surface failures? We do — via logging and alerts, which the operations team can investigate. But the user’s request succeeds regardless. There’s a difference between “we need to know about this” and “this should block the user.”
The Infrastructure Was Already Half Built
Here’s the embarrassing-but-satisfying part. When we went to build this, we discovered that most of the infrastructure already existed.
We already had a message bus. The audit event publisher already existed. The consumer that writes to the audit table already existed — fully implemented, deployed to AWS, waiting for events.
Nobody had connected them. We had a fully functional postal service with no mail routes configured.
For the privacy audit side, we had to build from scratch. But the pattern was proven. We followed the same blueprint.
What We Took Away
The permission tightening that broke staging turned out to be valuable. It surfaced a design problem that had been quietly lurking. The correct response wasn’t to relax the permissions — it was to make the architecture match what the permissions assumed: that read endpoints only read.
The deeper lesson is about what IAM policies are actually telling you. When a read endpoint needs write permissions, that’s a signal. The permissions are documenting the architecture, and if they don’t match the stated purpose, something is coupled that shouldn’t be.
Infrastructure that’s “almost” wired up is invisible infrastructure. A deployed consumer Lambda that receives no events is indistinguishable from a Lambda that doesn’t exist. Event-driven systems are especially susceptible to this: all the components can be present but the connections between them are easy to miss. Document your event routes the way you document your API endpoints.
This post is part of a series on building out a startup’s backend platform. For the broader infrastructure context, see the runners post and the CloudFormation export lock.
← Back to posts