When ESM Meets CJS: How Silent Bundling Failures Broke 98 Lambdas
We migrated our Lambda deployments from CDK to OpenTofu. Everything looked green — CI passed, staging deployed, production promoted. Days later, we discovered 98 out of 160 Lambda handlers were silently broken. Here’s how three invisible failures stacked up to create a production incident nobody saw coming.
The Setup
Our backend is a monorepo with ~160 Lambda handlers, all TypeScript, bundled with tsup/esbuild into CommonJS. Large shared dependencies — @aws-sdk, Sentry, ElectroDB, Pino, Zod — live in a Lambda Layer so they’re not duplicated across every handler.
As part of our road to continuous delivery, we moved to pre-built artifacts uploaded to S3 and deployed via OpenTofu. No more CDK’s NodejsFunction bundling at deploy time. Faster deploys, deterministic artifacts, separated concerns.
That separation is what got us.
The Silent Failure
Somewhere around May 7th — eleven days after production launch — a dependency update pulled in @middy/core v6, which is ESM-only. Its package.json has "type": "module". We’d been on v5 (CJS-compatible) until then. A routine version bump. Happens all the time.
Here’s what tsup/esbuild does when you tell it to produce CJS output and it encounters an ESM-only package:
Nothing.
No error. No warning. It silently leaves a require("@middy/core") in the output as if it were an intentional external — like @aws-sdk in our Lambda Layer. The bundle builds. The exit code is 0. The output looks right if you don’t squint.
The result: 98 handler bundles with unresolved require() calls that crash on Lambda cold start.
Runtime.ImportModuleError: Error: Cannot find module '@middy/core'
Every handler that used Middy was broken. We just didn’t know it yet.
The Triple Mask
Three independent systems conspired to hide the problem for over a week.
Mask 1: Turbo Remote Cache
We use Turborepo with S3 remote caching for build artifacts. Old builds — from before Middy went ESM-only — were cached and kept being served. As long as a handler’s source files didn’t change, Turbo returned the old working bundle.
This is the insidious part: the broken build config was already in place, but most handlers never got rebuilt. First cache miss? Broken bundle deployed. But you don’t know which deploy introduced it because the build tool says everything’s fine.
Mask 2: Warm Lambda Instances
Lambda keeps instances warm for roughly 15 minutes after the last invocation. Broken bundles only crash on cold start — when the runtime loads index.js for the first time. Warm instances from before the broken deploy continued serving requests normally.
Gradual cold start failures looked like intermittent issues, not a systemic problem. CloudWatch showed occasional errors mixed with healthy invocations. The kind of noise you monitor but don’t panic about.
Mask 3: CI Passed
pnpm turbo run build # ✅ Turbo cache hit — returned old working bundle
pnpm turbo run typecheck # ✅ TypeScript doesn't check runtime module resolution
pnpm turbo run test # ✅ Vitest resolves modules differently than Lambda runtime
Every gate was green. No assertion in our pipeline checked that the built bundles were actually self-contained.
Discovery
The incident surfaced through an unrelated bug. We were fixing verification emails that pointed to the wrong URL — a separate task entirely. That fix pulled in @kitajs/html, another ESM-only package, to replace react-email in a Cognito trigger.
The @kitajs/html import broke the Turbo cache for that handler. Fresh build. esbuild couldn’t resolve @kitajs/html/jsx-runtime. That was the first honest error we’d seen.
Investigating that one handler led us to check the others. CloudWatch confirmed the blast radius: every Lambda deployed via OpenTofu that used @middy/core was crashing on cold start. 98 out of 160 handlers.
Timeline:
| Time | Event |
|---|---|
| 16:10 | Registration flow tested — customMessage trigger crashes |
| 16:27 | CloudWatch confirms: Cannot find module @middy/core |
| 17:08 | Blast radius confirmed: 98/160 handlers broken |
| 18:34 | PR #1 merged (Layer approach) — fails in staging |
| 19:00 | PR #2 merged (noExternal approach) — all Lambdas cold-start clean |
| 19:15 | Staging smoke tests pass, production promotion triggered |
Three hours from discovery to production fix. The incident had been silently brewing for eight days.
The Fix: Two Attempts
Attempt 1: Add ESM Packages to the Lambda Layer
The instinct was obvious: put @middy/core in the Lambda Layer alongside @aws-sdk, declare it as an external.
This failed immediately at runtime:
Error [ERR_REQUIRE_ESM]: require() of ES Module not supported.
Instead change the require to a dynamic import() which is available in all CommonJS modules.
CJS handlers can’t require() ESM modules. Full stop. The Layer approach only works for CJS-compatible packages. This was a dead end.
Attempt 2: Force-Bundle via noExternal ✅
The root cause was tsup’s default behaviour: it auto-externalizes everything in node_modules. For CJS-compatible packages, this is fine — they resolve at runtime from the Layer. For ESM-only packages, it’s silent failure.
The fix was explicit: tell tsup to bundle specific packages instead of externalizing them.
// packages/backend/shared/tsup.base.ts
export const baseConfig: Options = {
format: "cjs",
noExternal: [
"@middy/core",
"@middy/http-json-body-parser",
"@middy/http-error-handler",
"@middy/input-output-logger",
// Any ESM-only package must be listed here
],
// esbuild inlines the ESM code and converts it to CJS
// This is what esbuild is good at — it just needs to be told
};
esbuild handles ESM-to-CJS conversion perfectly when you explicitly tell it to bundle the package. The problem was never esbuild’s capability — it was tsup’s default of treating all node_modules as externals, and esbuild’s silence when it can’t resolve one.
The CI Safety Net
Fixing the immediate problem was three hours of work. Making sure it never happens again was another hour. We enabled esbuild’s metafile option, which outputs a JSON file listing every dependency and whether it was bundled or left external.
// tsup.base.ts
export const baseConfig: Options = {
// ...
esbuildOptions(options) {
options.metafile = true;
},
};
Then a CI script that reads every metafile and validates that every external import is either a Node.js builtin or a package in our Lambda Layer:
#!/usr/bin/env bash
# .github/scripts/validate-bundle-deps.sh
LAYER_PACKAGES="@aws-sdk @sentry electrodb pino zod"
ERRORS=0
for metafile in $(find packages/backend -name "metafile-*.json"); do
# Extract all external imports from esbuild's dependency graph
externals=$(node -e "
const meta = require('./$metafile');
const exts = new Set();
for (const [, info] of Object.entries(meta.outputs)) {
for (const [imp, details] of Object.entries(info.imports || {})) {
if (details.external) exts.add(imp);
}
}
console.log([...exts].join('\n'));
")
while IFS= read -r ext; do
[[ -z "$ext" ]] && continue
# Allow Node builtins
[[ "$ext" =~ ^(node:|fs|path|crypto|util|stream|events|http|https|url|os|child_process) ]] && continue
# Allow Layer packages
for layer_pkg in $LAYER_PACKAGES; do
[[ "$ext" == "$layer_pkg"* ]] && continue 2
done
echo "ERROR: Unexpected external '$ext' in $metafile"
ERRORS=$((ERRORS + 1))
done <<< "$externals"
done
if [[ $ERRORS -gt 0 ]]; then
echo "Found $ERRORS unexpected externals. These packages must be bundled (add to noExternal) or added to the Lambda Layer."
exit 1
fi
169 handlers validated on every PR and deploy. This script would have caught the Middy issue on the very first PR that shipped a broken bundle — days before we discovered it manually.
The key insight: esbuild already knows exactly what it bundled and what it left external. The metafile is a complete dependency graph. No grep heuristics, no regex parsing of bundle output, no false positives. You’re reading esbuild’s own accounting.
Lessons
1. esbuild’s silence is dangerous. It doesn’t distinguish between “I externalized this because you asked” and “I couldn’t bundle this so I gave up.” Both look identical in the output. Both produce exit code 0.
2. CJS can’t require() ESM. Putting ESM-only packages in a Lambda Layer doesn’t work. They must be bundled inline. esbuild handles the ESM-to-CJS conversion — it just needs to be told explicitly.
3. Build caches hide build breakage. If your CI only builds changed files and caches the rest, a broken build config can ship for days before a cache miss reveals it. We learned this lesson differently with the 502s — now it bit us from the other side.
4. Warm instances hide runtime breakage. Lambda cold starts are infrequent enough that broken deploys can survive for days on warm instances. Your monitoring shows intermittent errors, not a systemic outage.
5. Assert on build outputs, not just build success. exit 0 from the build tool doesn’t mean the output is correct. Scan the artifacts. The build succeeded — the bundle was wrong.
6. Metafile is your friend. ~65 lines of bash + inline Node.js. Validates 169 handlers. Catches phantom externals before they reach production. Trivial to add, impossible to justify not having after this incident.
The Uncomfortable Question
Check your tsup or esbuild config. Is format: 'cjs'? Are you relying on node_modules auto-externalization?
Now check your package-lock.json or pnpm-lock.yaml for packages that recently added "type": "module". Middy did it. Chalk did it years ago. More packages are going ESM-only every month.
Your build tool won’t tell you when it happens. Your tests won’t catch it. Your CI will stay green. Your Lambdas will keep running — until they don’t.
The fix takes an hour. The metafile validation takes another hour. The incident that finds you first takes longer.
This post is part of an ongoing series about building a startup’s engineering platform. The Turbo cache 502 post covers an earlier encounter with our build caching setup, and Road to Continuous Delivery covers the pipeline evolution that set the stage for this incident.
The Invisible Bug That Broke 98 Backend Services — And Hid For Eight Days
A routine library update silently broke most of our backend services. Three separate systems conspired to hide the damage for over a week. Here’s how an unrelated bug report led us to discover that 98 out of 160 services had been quietly failing — and nobody noticed.
Setting the Scene
Our platform runs on about 160 small backend services (called Lambda functions) hosted on AWS. Each one handles a specific job: processing a registration, sending an email, running a payment check, handling a webhook.
These services are written in TypeScript and “bundled” before deployment — a process that packages all the code and dependencies into a single file that AWS can run. Think of it like compiling a recipe book where every ingredient is included in the package, so the kitchen doesn’t need to stock anything extra.
Eleven days after our production launch, we migrated how these services get deployed. The new system pre-builds everything and uploads it to cloud storage. Faster, more predictable, cleaner. What could go wrong?
The Silent Break
Around the same time, a routine dependency update pulled in a new major version of one of our key libraries — a popular middleware called Middy. This newer version uses a different JavaScript module format (ESM) that many libraries are adopting.
The problem: our build tool doesn’t understand how to package this newer format into the older format our services use. Instead of raising an error, it silently skips the dependency and moves on. The build succeeds. The output looks normal. But the packaged service is missing a critical piece.
When AWS tries to start that service, it crashes immediately: “Cannot find module.” Like shipping a car without an engine — it looks complete until you turn the key.
98 out of 160 services were affected. Every one that used this middleware.
Three Systems That Hid the Problem
What made this particularly nasty is that three unrelated systems worked together to mask the damage.
The build cache served old, working versions. Our build system caches previous builds to save time. If a service’s source code hasn’t changed, it serves the cached version from before the break. Most services never got rebuilt — they kept running on old, working packages. The broken configuration was sitting there like a time bomb, detonating only when a service happened to get a fresh build.
Running services kept running. AWS keeps services “warm” for about 15 minutes after each use. A broken package only crashes when a service starts fresh — a “cold start.” Warm services from before the broken deployment kept handling requests normally. The errors looked intermittent, not systemic.
All automated checks passed. The build tool said success. The type checker said success. The tests said success. None of them actually verified that the final packaged output contained everything it needed. Every quality gate was green for a broken product.
How We Found It
An unrelated bug report broke the spell. We were fixing verification emails that pointed to the wrong URL — a completely separate issue. That fix happened to add a new dependency that also used the newer module format.
This time, the build cache couldn’t help. The service needed a fresh build, and the fresh build failed visibly for the first time. Investigating that one failure led us to check all the others. The damage was far worse than one service.
The timeline from discovery to fix: three hours.
We found the problem at 4:10pm. By 4:27pm we’d confirmed the root cause. By 5:08pm we understood the blast radius — 98 out of 160 services. Our first fix attempt failed (more on that in a moment). The second fix worked. By 7:15pm, all services were healthy and promoted to production.
Two Fix Attempts
First attempt: put the library in a shared location. Our services already share large dependencies through a common package (a “Lambda Layer”). The obvious fix was to add Middy there too. This failed — the older module format our services use literally cannot load newer-format modules from a shared location. A fundamental incompatibility, not a configuration issue.
Second attempt: force the build tool to include it. Instead of treating Middy as an external dependency, we told the build tool to pull it directly into each service’s package and convert it to the older format during bundling. The build tool is perfectly capable of this conversion — it just needs to be explicitly told to do it. This worked immediately.
Making Sure It Never Happens Again
Fixing the immediate problem took three hours. Preventing it from recurring took one more.
Our build tool can produce a detailed manifest — a complete list of what it included in each package and what it left out. We wrote a validation script that reads this manifest for all 169 services on every code change and deployment. It checks that everything left out of the package is either a built-in system module or something we’ve deliberately placed in the shared layer.
If an unexpected dependency gets left out — for any reason — the build fails immediately with a clear error message. No silent skipping. No hoping someone notices.
This script would have caught the Middy issue on the very first deployment that shipped a broken package — days before we discovered it by accident.
What This Taught Us
Build tools can lie by omission. Our build tool didn’t fail — it just silently produced incomplete output. “Build succeeded” doesn’t mean “build is correct.” This distinction matters enormously.
Caches hide breakage. If your build system caches old results, a broken configuration can ship for days before anyone builds fresh. The cache isn’t wrong — it’s faithfully serving old results. But it’s masking that new builds are broken. We’d seen caching issues before, but from the opposite direction.
Running systems mask broken deployments. Services that are already running don’t re-check their packages. A broken deployment can coast on healthy instances for days before enough fresh starts reveal the problem.
Verify the output, not just the process. Every quality check in our pipeline verified a step in the process (does it compile? do tests pass? does the build tool exit cleanly?) — but none verified the actual output. Adding that one check closed the gap entirely.
The Broader Pattern
This isn’t unique to our stack. Any team that bundles JavaScript services, uses build caching, and deploys to serverless platforms has this exact attack surface. Libraries are moving to the newer module format every month. Build tools handle it inconsistently. The failure mode is always the same: silent, delayed, and discovered by accident.
The fix — validating build outputs against an expected dependency list — took about 65 lines of code. It’s the kind of thing that feels unnecessary until the day it would have saved you a week of invisible damage.
This post is part of an ongoing series about building a startup’s engineering platform. For more on the build caching challenges we’ve faced, see the Turbo cache post. For context on how our deployment pipeline evolved, see Road to Continuous Delivery.
← Back to posts