From Click-and-Pray to Merge-and-Forget: Our Road to Continuous Delivery
The Starting Point: Four Repos, Four Different Stories
When I audited our CI/CD setup earlier this year, I expected some inconsistency. We have four independent git repos covering different operational concerns, each with their own GitHub Actions workflows. What I found was four completely different deployment philosophies coexisting under one roof.
Our main platform repo - a monorepo with multiple React/TypeScript frontends, an Astro marketing site, and an AWS CDK backend - was the most mature. Staging auto-deployed when you pushed to main. Production required you to manually create a GitHub release, then wait for an environment approval gate. It used affected-project detection to avoid deploying unchanged apps, which was clever. Integration tests gated staging but not production. Frontends went to S3 + CloudFront. Config lived in SSM Parameter Store. Respectable, if incomplete.
Our DevOps repo was, ironically, the gold standard. True continuous delivery - push to main, everything deploys, no human in the loop. It managed our self-hosted runner infrastructure and it practiced what it preached.
The operations module was manual-dispatch-only with dry-run as the default. You had to explicitly opt into doing something. Production required environment reviewer approval. Cautious, but intentional.
And then the organisation repo - the one that manages the entire AWS Organization structure. Intentionally manual. When a bad deploy there can reassign account ownership across your whole cloud estate, “click the button and pray” is actually the responsible choice.
So three of the four repos had deployment strategies that made sense for their blast radius. But the platform repo - the one shipping user-facing features every day - had the biggest gap between what it needed and what it had.
The Discovery That Kept Me Up at Night
While tracing the CDK deploy configuration, I found something that made my stomach drop.
The CDK_DEPLOY_ACCOUNT environment variable in the production workflow was pointing to the staging account.
Read that again. Production workloads were deploying to the staging account. The Lambdas that real users were hitting, the API Gateways routing real traffic - all of it was being provisioned in the wrong AWS account. Our “production” was staging wearing a fake moustache.
This is the kind of bug that hides in plain sight because everything works. The application runs. Users can log in. Data gets saved. You only notice when you check the AWS console and wonder why your staging account has suspiciously production-looking resources in it, or when you look at the bill and notice staging costs more than production.
The fix was a one-liner - change an account ID in a workflow file. But the implications were sobering. We had been operating in production for months with no account-level isolation between environments. If someone had torn down staging for maintenance, production would have gone with it.
This was the moment I decided: we need a pipeline we can actually trust, and “trust” means the machine does it the same way every time, with no room for a mistyped account ID to lurk undetected for months.
Phase 1: Split Build from Deploy
The first thing we tackled was CDK performance. Our CDK synth (the step that generates CloudFormation templates from TypeScript) was running inside the deploy job. This meant if the deploy failed halfway through and you retried, it would re-synth everything from scratch. On a monorepo with six CDK domains, synth alone took 4-5 minutes.
We split the pipeline into two stages: synth and deploy. The synth job runs CDK against all domains, packages the resulting cdk.out directories into a tarball, and uploads it to an S3 artifacts bucket. The deploy job downloads that tarball and runs cdk deploy against the pre-synthesised output.
This gave us two things: faster retries (no re-synth on deploy failure) and artifact immutability. The CloudFormation templates that get deployed are exactly the ones that were synthesised - no chance of a source change sneaking in between synth and deploy.
We also parallelised the domain deploys. The shared-core domain deploys first (it contains cross-cutting resources like Cognito user pools), then the remaining five domains deploy in parallel. This cut backend deploy time from 15+ minutes to about 6.
Phase 2: Smoke Tests That Actually Smoke
Having a pipeline that deploys fast is useless if you cannot tell whether the deployment worked. We had zero post-deployment verification. The pipeline said “success” if the AWS API calls completed without errors. Whether the application actually served traffic was anyone’s guess.
We added smoke tests - a bash script that runs after every deployment. It checks:
- Frontend health: every portal returns HTTP 200
- Config integrity: every portal’s
config.jsonis valid JSON with the required fields (environment, auth config, API base URL) - API health: authenticated health-check endpoints on all three API services return a healthy status
- Backend availability: unauthenticated endpoints respond with expected status codes
The authenticated checks were the interesting part. We created lightweight /health endpoints on each API that verify the full request path: API Gateway receives the request, Lambda cold-starts, the authoriser validates a token, and the handler queries DynamoDB. If any link in that chain is broken, the smoke test catches it.
The results print as a markdown table in the GitHub Actions summary. At a glance you can see which checks passed and which failed, with the exact URLs and status codes. No more “I think the deploy worked, the CDK output looked fine.”
Phase 3: The Auto-Promote Pipeline
With smoke tests proving that staging was healthy, the next step was obvious: if staging passes its smoke tests, automatically promote to production.
The staging workflow now ends with a trigger-promotion job. If smoke tests pass, it tags the commit as staging-last-success and fires a workflow_dispatch event to the promote-to-production workflow, passing along the list of affected projects and a generated changelog.
The production promotion workflow is where the “Build Once, Deploy Everywhere” (BODE) philosophy really shines. For frontends, production does not rebuild the applications. It downloads the exact bundles that were deployed to staging - the same JavaScript, the same CSS, the same assets. The only thing that changes is config.json, which gets regenerated with production-specific values (API URLs, Cognito pool IDs, feature flag keys) pulled from production SSM parameters at deploy time.
This is a critical distinction. If you rebuild for production, you are deploying a different artifact than the one you tested in staging. Different Node.js version on the runner, different dependency resolution timing, different environment variable values at build time - any of these could produce a subtly different bundle. By reusing the staging bundle, the thing you tested is exactly the thing that reaches users.
For the backend, we cannot reuse the staging CDK output directly because CDK templates contain account-specific references. But we still benefit from the synth/deploy split - the production synth runs against the same source code that was already validated in staging.
The full flow now looks like this:
- Developer merges PR to main
- Staging workflow detects affected projects
- Backend: CDK synth, package to S3, deploy, create Sentry release
- Frontends: Vite build, deploy to S3, invalidate CloudFront, create Sentry release with source maps
- Smoke tests run against staging
- If green: tag commit, trigger production promotion
- Production downloads staging frontend bundles, generates production config
- Backend: CDK synth for production account, deploy
- Frontends: deploy staging bundles with production config
- Smoke tests run against production
- If green: tag as
production-last-success, post changelog to team chat
From merge to production: about 15 minutes for a frontend-only change, 20-25 for backend changes. No human intervention. No buttons to click. No prayers required.
Phase 4: Observability - Know What You Shipped
A pipeline that deploys automatically needs observability that keeps up. We integrated Sentry release tracking at every stage.
Each deployment - staging and production, backend and frontend - creates a Sentry release tagged with the git SHA. Frontend releases include source map uploads so that production stack traces show original TypeScript line numbers instead of minified gibberish. Backend releases associate commits so you can trace any error to the exact code change that introduced it.
The Sentry integration runs as a composite GitHub Action shared across all deploy jobs. It installs the Sentry CLI, creates the release, associates commits, uploads source maps if a build path is provided, and marks the deploy with the target environment. When something breaks in production, we can see not just the error, but which release introduced it and which commit to blame.
The Stuff That Broke Along the Way
This was not a clean, linear journey. Some highlights from the commit log:
The gh CLI doesn’t work for cross-workflow dispatch. Our initial trigger-promotion step used gh workflow run to kick off the production workflow. It silently failed with permissions issues. We replaced it with a raw curl to the GitHub API. Sometimes the low-level tool is the reliable one.
Frontend deploys need to be blocked on backend failures. We had a window where the frontend deploy would succeed even if the backend CDK deploy had failed. Users would land on a freshly deployed frontend that pointed at a broken backend. We added explicit dependency conditions: frontend jobs check needs.deploy-backend.result before proceeding.
Sentry tokens in GitHub Actions need the env context, not secrets. We had step-level if conditions checking secrets.SENTRY_AUTH_TOKEN. In GitHub Actions, you cannot reference secrets directly in if expressions - they’re always truthy as a string. We had to pass the token to an env variable first and check that instead.
Concurrency groups matter. Two rapid merges to main would trigger two staging deploys that would stomp on each other. We added cancel-in-progress: true for staging (latest wins) and cancel-in-progress: false for production (let it finish). We also replaced always() conditions with !cancelled() to prevent zombie workflow runs that kept executing after cancellation.
Where We Are Now
The before-and-after tells the story:
Before:
- Production deploy: create a GitHub release, wait for approval, watch the logs, check the site manually
- Time from merge to production: “whenever someone remembers to do it” (often days)
- Post-deploy verification: open the site in a browser, click around, hope for the best
- Confidence level: low (was production even hitting the right AWS account?)
After:
- Production deploy: merge your PR
- Time from merge to production: 15-25 minutes, fully automated
- Post-deploy verification: automated smoke tests checking frontends, configs, and authenticated API endpoints
- Confidence level: high (same artifact tested in staging, tracked in Sentry, verified by automation)
The pipeline is not done. We want to add Playwright E2E tests as a staging gate, canary deployments for the backend, and automated rollback when production smoke tests fail. But the foundation is solid: every merge to main reaches production through a deterministic, observable, verified path.
The best part? I have not manually deployed to production in weeks. I merge my PR, go make coffee, and by the time I sit back down, it is live. The machine does its job. I do mine.
That is what continuous delivery is supposed to feel like.
How We Got From 'Someone Remember to Deploy This' to 'Just Merge the PR'
Four Repos, Four Different Philosophies
When I audited our deployment setup earlier this year, I expected some inconsistency. We have four separate code repositories, each managing different parts of the platform.
What I found was four completely different approaches to getting code into production.
Our main platform repo — the one with all the web applications and backend services — was the most mature. Staging deployed automatically when you pushed to main. Production required you to manually create a release on GitHub, wait for someone to approve it, then watch the process run. Reasonable, if a bit ceremonial.
Our infrastructure repo was, ironically, the gold standard. Push to main, everything deploys, no human involvement required. The repo that manages our build infrastructure was actually practicing what continuous delivery preaches.
A third repo was manual-dispatch-only, with “dry run” (show me what would happen, don’t actually do it) as the default. You had to deliberately opt in to making changes. Cautious, but intentional.
And the fourth — the one that manages our entire cloud account structure — was intentionally manual. When a bad deployment can reassign ownership of cloud accounts across your whole organisation, “click the button carefully” is the responsible approach.
Three of the four made sense. The main platform repo — the one shipping user-facing features every day — had the biggest gap between what it needed and what it had.
The Bug That Kept Me Up at Night
While checking the deployment configuration, I found something that made my stomach drop.
The environment variable pointing to our production cloud account was set to the staging account value.
Our “production” was staging wearing a fake moustache.
The Lambdas that real users were hitting, the API services routing real traffic — all of it had been provisioning in the staging environment for who knows how long. Everything worked because the staging environment had production-grade resources in it. You’d only notice by looking at the cloud console and wondering why staging had suspiciously busy-looking resources, or checking the bill.
The fix was a one-liner: change one value in one file. But the implications were uncomfortable. We had been running in production for months with no actual isolation between environments. If staging had been taken down for maintenance, production would have gone with it.
This was the moment I decided: the machine needs to do this the same way every time. Humans make mistakes. Pipelines, if built correctly, don’t.
Making Deploys Faster and More Reliable
The first thing we tackled was our infrastructure deployment process. The tool we use to define cloud resources in code (CDK) has to compile your TypeScript definitions into cloud templates before it can deploy them. This compilation step was happening inside the deployment itself, which meant if anything went wrong halfway through and you had to retry, you’d start from scratch. On a codebase with six separate infrastructure domains, this compilation alone took 4-5 minutes.
We split it into two stages: compile first, save the output, then deploy from the saved output. This gave us faster retries, and more importantly, it meant the templates being deployed were the exact ones we’d reviewed — nothing could change between the compilation and the deployment.
We also parallelised the infrastructure deployments. One domain deploys first (it contains shared resources everything else depends on), then the other five deploy simultaneously. Backend deployment time dropped from 15+ minutes to about 6.
Actually Checking Whether Deploys Worked
Having a fast pipeline that deploys unreliable software quickly is just failing faster. We had zero automated verification of whether deployments actually worked. The pipeline declared success if the cloud provider accepted the instructions. Whether the application actually ran was anyone’s guess.
We added smoke tests. After every deployment, a script checks:
- Every web application returns a successful response
- Every application’s configuration file contains valid data with the required fields
- Every API service responds correctly, including requests that require authentication
That last one matters. An authenticated request to a health endpoint travels the full path: through the API gateway, into the server function, past the authentication check, and into the database. If any link in that chain is broken, the test catches it.
The results appear as a table in our CI dashboard. Green check per service. No more “I think the deploy worked.”
The Automatic Promotion Pipeline
With smoke tests confirming that staging was healthy, the next step was automatic. If staging passes its checks, promote it to production automatically.
The staging pipeline ends with a promotion trigger. If smoke tests pass, it tags the commit and fires off the production pipeline, passing along which parts of the codebase were affected.
For web applications, the production promotion doesn’t rebuild anything. It takes the exact bundles — the JavaScript, CSS, and assets — that were tested in staging, and puts them in production. The only thing that changes is a small configuration file that gets regenerated with production-specific values (the right API addresses, the right authentication settings) pulled from production’s configuration store.
This is deliberate. If you rebuild for production, you’re deploying something slightly different from what you tested. Different timing, different environment, potentially different output. By deploying the same artifacts, the thing users get is exactly the thing we tested.
For the infrastructure side, we can’t reuse the compiled templates directly because they contain account-specific references. But the same source code that was validated in staging is what gets compiled for production.
The full sequence now looks like this:
- Developer merges a pull request to main
- Automated pipeline detects which parts of the codebase changed
- Infrastructure compiles, packages, and deploys to staging
- Web applications build and deploy to staging
- Smoke tests run against staging
- If everything passes: tag the commit, trigger production
- Production grabs the staging web bundles, generates production configuration
- Infrastructure compiles and deploys to production
- Smoke tests run against production
- If everything passes: tag as confirmed, post a changelog to the team chat
From merge to production: about 15 minutes for web-only changes, 20-25 minutes when backend services are involved. No human intervention.
The Bits That Broke Along the Way
Cross-workflow triggers didn’t work the way we expected. Our initial approach to triggering the production pipeline from staging used a tool that turned out to have silent permission failures. We replaced it with a direct API call. Sometimes the simple approach is more reliable.
Backend failures weren’t blocking frontend deploys. For a while, web applications would successfully deploy to staging even if the backend infrastructure deployment had failed. Users would land on a fresh frontend pointing at a broken backend. We added explicit dependency checks: web deploys wait for backend deploys to succeed first.
Environment variable checks in CI are tricky. We had automated conditions that were supposed to skip a step when a credential wasn’t present. It turned out that checking whether a secret exists in GitHub Actions doesn’t work the way you’d expect — the secret is always truthy as a string, even when empty. We had to pass it to an intermediate variable first.
Concurrent deployments stomped on each other. Two quick merges to main would trigger two staging pipelines running simultaneously, interfering with each other. We set staging to cancel the earlier run in favour of the newer one (latest wins), and production to always run to completion (never cancel mid-deploy). We also fixed a related issue where cancelled runs kept executing steps they should have skipped.
Where We Are Now
Before: deploying to production meant creating a release on GitHub, waiting for approval, watching the logs, manually checking the site, hoping nothing was broken, and also hoping we hadn’t made the same mistake as before where we’d been deploying to the wrong AWS account for months.
After: merge your pull request. Go make coffee. By the time you sit back down, it’s live in production, smoke-tested, and logged in our error tracking system.
The pipeline isn’t finished. We want end-to-end browser tests as a staging gate, gradual rollouts for infrastructure changes, and automatic rollback when production smoke tests fail. But the foundation is solid: every merge reaches production through the same deterministic, verified path.
I haven’t manually deployed to production in weeks. That’s the whole point.
← Back to posts