← Back to posts

From Click-and-Pray to Merge-and-Forget: Our Road to Continuous Delivery

· 10 min read

The Starting Point: Four Repos, Four Different Stories

When I audited our CI/CD setup earlier this year, I expected some inconsistency. We have four independent git repos covering different operational concerns, each with their own GitHub Actions workflows. What I found was four completely different deployment philosophies coexisting under one roof.

Our main platform repo - a monorepo with multiple React/TypeScript frontends, an Astro marketing site, and an AWS CDK backend - was the most mature. Staging auto-deployed when you pushed to main. Production required you to manually create a GitHub release, then wait for an environment approval gate. It used affected-project detection to avoid deploying unchanged apps, which was clever. Integration tests gated staging but not production. Frontends went to S3 + CloudFront. Config lived in SSM Parameter Store. Respectable, if incomplete.

Our DevOps repo was, ironically, the gold standard. True continuous delivery - push to main, everything deploys, no human in the loop. It managed our self-hosted runner infrastructure and it practiced what it preached.

The operations module was manual-dispatch-only with dry-run as the default. You had to explicitly opt into doing something. Production required environment reviewer approval. Cautious, but intentional.

And then the organisation repo - the one that manages the entire AWS Organization structure. Intentionally manual. When a bad deploy there can reassign account ownership across your whole cloud estate, “click the button and pray” is actually the responsible choice.

So three of the four repos had deployment strategies that made sense for their blast radius. But the platform repo - the one shipping user-facing features every day - had the biggest gap between what it needed and what it had.

The Discovery That Kept Me Up at Night

While tracing the CDK deploy configuration, I found something that made my stomach drop.

The CDK_DEPLOY_ACCOUNT environment variable in the production workflow was pointing to the staging account.

Read that again. Production workloads were deploying to the staging account. The Lambdas that real users were hitting, the API Gateways routing real traffic - all of it was being provisioned in the wrong AWS account. Our “production” was staging wearing a fake moustache.

This is the kind of bug that hides in plain sight because everything works. The application runs. Users can log in. Data gets saved. You only notice when you check the AWS console and wonder why your staging account has suspiciously production-looking resources in it, or when you look at the bill and notice staging costs more than production.

The fix was a one-liner - change an account ID in a workflow file. But the implications were sobering. We had been operating in production for months with no account-level isolation between environments. If someone had torn down staging for maintenance, production would have gone with it.

This was the moment I decided: we need a pipeline we can actually trust, and “trust” means the machine does it the same way every time, with no room for a mistyped account ID to lurk undetected for months.

Phase 1: Split Build from Deploy

The first thing we tackled was CDK performance. Our CDK synth (the step that generates CloudFormation templates from TypeScript) was running inside the deploy job. This meant if the deploy failed halfway through and you retried, it would re-synth everything from scratch. On a monorepo with six CDK domains, synth alone took 4-5 minutes.

We split the pipeline into two stages: synth and deploy. The synth job runs CDK against all domains, packages the resulting cdk.out directories into a tarball, and uploads it to an S3 artifacts bucket. The deploy job downloads that tarball and runs cdk deploy against the pre-synthesised output.

This gave us two things: faster retries (no re-synth on deploy failure) and artifact immutability. The CloudFormation templates that get deployed are exactly the ones that were synthesised - no chance of a source change sneaking in between synth and deploy.

We also parallelised the domain deploys. The shared-core domain deploys first (it contains cross-cutting resources like Cognito user pools), then the remaining five domains deploy in parallel. This cut backend deploy time from 15+ minutes to about 6.

Phase 2: Smoke Tests That Actually Smoke

Having a pipeline that deploys fast is useless if you cannot tell whether the deployment worked. We had zero post-deployment verification. The pipeline said “success” if the AWS API calls completed without errors. Whether the application actually served traffic was anyone’s guess.

We added smoke tests - a bash script that runs after every deployment. It checks:

  • Frontend health: every portal returns HTTP 200
  • Config integrity: every portal’s config.json is valid JSON with the required fields (environment, auth config, API base URL)
  • API health: authenticated health-check endpoints on all three API services return a healthy status
  • Backend availability: unauthenticated endpoints respond with expected status codes

The authenticated checks were the interesting part. We created lightweight /health endpoints on each API that verify the full request path: API Gateway receives the request, Lambda cold-starts, the authoriser validates a token, and the handler queries DynamoDB. If any link in that chain is broken, the smoke test catches it.

The results print as a markdown table in the GitHub Actions summary. At a glance you can see which checks passed and which failed, with the exact URLs and status codes. No more “I think the deploy worked, the CDK output looked fine.”

Phase 3: The Auto-Promote Pipeline

With smoke tests proving that staging was healthy, the next step was obvious: if staging passes its smoke tests, automatically promote to production.

The staging workflow now ends with a trigger-promotion job. If smoke tests pass, it tags the commit as staging-last-success and fires a workflow_dispatch event to the promote-to-production workflow, passing along the list of affected projects and a generated changelog.

The production promotion workflow is where the “Build Once, Deploy Everywhere” (BODE) philosophy really shines. For frontends, production does not rebuild the applications. It downloads the exact bundles that were deployed to staging - the same JavaScript, the same CSS, the same assets. The only thing that changes is config.json, which gets regenerated with production-specific values (API URLs, Cognito pool IDs, feature flag keys) pulled from production SSM parameters at deploy time.

This is a critical distinction. If you rebuild for production, you are deploying a different artifact than the one you tested in staging. Different Node.js version on the runner, different dependency resolution timing, different environment variable values at build time - any of these could produce a subtly different bundle. By reusing the staging bundle, the thing you tested is exactly the thing that reaches users.

For the backend, we cannot reuse the staging CDK output directly because CDK templates contain account-specific references. But we still benefit from the synth/deploy split - the production synth runs against the same source code that was already validated in staging.

The full flow now looks like this:

  1. Developer merges PR to main
  2. Staging workflow detects affected projects
  3. Backend: CDK synth, package to S3, deploy, create Sentry release
  4. Frontends: Vite build, deploy to S3, invalidate CloudFront, create Sentry release with source maps
  5. Smoke tests run against staging
  6. If green: tag commit, trigger production promotion
  7. Production downloads staging frontend bundles, generates production config
  8. Backend: CDK synth for production account, deploy
  9. Frontends: deploy staging bundles with production config
  10. Smoke tests run against production
  11. If green: tag as production-last-success, post changelog to team chat

From merge to production: about 15 minutes for a frontend-only change, 20-25 for backend changes. No human intervention. No buttons to click. No prayers required.

Phase 4: Observability - Know What You Shipped

A pipeline that deploys automatically needs observability that keeps up. We integrated Sentry release tracking at every stage.

Each deployment - staging and production, backend and frontend - creates a Sentry release tagged with the git SHA. Frontend releases include source map uploads so that production stack traces show original TypeScript line numbers instead of minified gibberish. Backend releases associate commits so you can trace any error to the exact code change that introduced it.

The Sentry integration runs as a composite GitHub Action shared across all deploy jobs. It installs the Sentry CLI, creates the release, associates commits, uploads source maps if a build path is provided, and marks the deploy with the target environment. When something breaks in production, we can see not just the error, but which release introduced it and which commit to blame.

The Stuff That Broke Along the Way

This was not a clean, linear journey. Some highlights from the commit log:

The gh CLI doesn’t work for cross-workflow dispatch. Our initial trigger-promotion step used gh workflow run to kick off the production workflow. It silently failed with permissions issues. We replaced it with a raw curl to the GitHub API. Sometimes the low-level tool is the reliable one.

Frontend deploys need to be blocked on backend failures. We had a window where the frontend deploy would succeed even if the backend CDK deploy had failed. Users would land on a freshly deployed frontend that pointed at a broken backend. We added explicit dependency conditions: frontend jobs check needs.deploy-backend.result before proceeding.

Sentry tokens in GitHub Actions need the env context, not secrets. We had step-level if conditions checking secrets.SENTRY_AUTH_TOKEN. In GitHub Actions, you cannot reference secrets directly in if expressions - they’re always truthy as a string. We had to pass the token to an env variable first and check that instead.

Concurrency groups matter. Two rapid merges to main would trigger two staging deploys that would stomp on each other. We added cancel-in-progress: true for staging (latest wins) and cancel-in-progress: false for production (let it finish). We also replaced always() conditions with !cancelled() to prevent zombie workflow runs that kept executing after cancellation.

Where We Are Now

The before-and-after tells the story:

Before:

  • Production deploy: create a GitHub release, wait for approval, watch the logs, check the site manually
  • Time from merge to production: “whenever someone remembers to do it” (often days)
  • Post-deploy verification: open the site in a browser, click around, hope for the best
  • Confidence level: low (was production even hitting the right AWS account?)

After:

  • Production deploy: merge your PR
  • Time from merge to production: 15-25 minutes, fully automated
  • Post-deploy verification: automated smoke tests checking frontends, configs, and authenticated API endpoints
  • Confidence level: high (same artifact tested in staging, tracked in Sentry, verified by automation)

The pipeline is not done. We want to add Playwright E2E tests as a staging gate, canary deployments for the backend, and automated rollback when production smoke tests fail. But the foundation is solid: every merge to main reaches production through a deterministic, observable, verified path.

The best part? I have not manually deployed to production in weeks. I merge my PR, go make coffee, and by the time I sit back down, it is live. The machine does its job. I do mine.

That is what continuous delivery is supposed to feel like.