← Back to posts

What Production Taught Us About Self-Hosted Runners

· 10 min read

Two Weeks Later, Everything Broke

Two weeks after shipping our self-hosted runner infrastructure, I opened Slack to a wall of failed CI checks. InsufficientInstanceCapacity. Not one job — all of them. Every PR across every repo was stuck, queued behind runners that couldn’t be provisioned.

It wasn’t a one-off. Over the next few days, capacity errors kept appearing at unpredictable intervals. Sometimes runners would provision fine for hours, then suddenly nothing. PRs accumulated. Developers (well, developer — it’s just me) sat waiting. The infrastructure we’d built in a weekend was failing at its one job: running CI.

Building self-hosted runners is a weekend project. Operating them is the real work. Here are the lessons production taught us.

Lesson 1: Your Spot Pool Is Shallower Than You Think

Our original fleet was straightforward: c7i.large spot instances in eu-west-2 (London). One instance type, three availability zones. The fallback chain tried each AZ in sequence, then fell back to on-demand. Simple, clean, worked great in testing.

The problem is that eu-west-2 is one of AWS’s smaller European regions. The spot capacity pool for any single instance type in any single AZ is finite, and when it’s gone, it’s gone. We were fishing in a pond, not the ocean.

The first fix (ENG-1136) was reactive: add more instance types to the fallback chain. c6i.large, c7a.large, m7i.large. We expanded the Step Functions state machine to try every type in every AZ — 27 sequential attempts before giving up. It helped, but it was a band-aid on a structural problem.

The real fix was to change the pond entirely. I moved the runner fleet from eu-west-2 to eu-west-1 (Ireland) — AWS’s largest European region, with significantly deeper spot capacity pools and lower interruption rates. At the same time, we expanded from 4 instance types to 10:

CategoryInstance Types
AMD (lower demand, more available)c5a.large, c6a.large, m5a.large
Older Intel (larger pools)c5.large, c6i.large, m5.large
Current-gen Intelm7i.large, m6i.large, c7i.large, c7a.large

Ten instance types across three availability zones gives us 30 distinct capacity pools. The AMD types are the secret weapon — they tend to have lower demand and higher availability because most teams default to the latest Intel generation.

The lesson: region choice and fleet diversity matter more than retry logic. You can build the most sophisticated fallback chain in the world, but if you’re fishing in a shallow pool, you’ll still come up empty.

Lesson 2: Ephemeral Runners Break needs Graphs

This one blindsided us. After the region migration, multi-job workflows with needs dependencies started hanging indefinitely. Single-job workflows worked fine. Staging deploys — which chain build → deploy → smoke-test — were completely blocked.

The root cause is a subtle interaction between ephemeral runners and GitHub’s webhook model:

sequenceDiagram
    participant GH as GitHub
    participant WH as Webhook Lambda
    participant R1 as Runner 1

    GH->>GH: Workflow starts
    GH->>WH: workflow_job: queued (Job A)
    Note over GH: Job B (needs: A) not yet queued
    WH->>R1: Provision runner
    R1->>GH: Register, pick up Job A
    R1->>R1: Job A completes
    R1->>R1: Self-terminate (ephemeral)
    GH->>GH: Job B becomes runnable
    Note over GH: No webhook fires for Job B
    GH-->>GH: Job B waits forever

When a workflow starts, GitHub only queues the jobs that are immediately runnable. Jobs with needs dependencies aren’t queued until their prerequisites complete. Our ephemeral runner picks up Job A, runs it, and self-terminates. When Job B becomes runnable, GitHub doesn’t fire a new workflow_job: queued webhook — the event already fired when the workflow started, and the job wasn’t ready then.

The fix had two parts:

  1. Handle waiting events. GitHub sends workflow_job with action waiting (not queued) when a dependent job becomes runnable. We updated the webhook Lambda to treat waiting as a trigger for runner provisioning.
  2. Job-retry Lambda. A belt-and-suspenders safety net: an EventBridge-triggered Lambda that runs every 60 seconds, polls the GitHub API for queued jobs with no assigned runner, and provisions runners for any orphans it finds.
sequenceDiagram
    participant GH as GitHub
    participant WH as Webhook Lambda
    participant JR as Job Retry Lambda
    participant R1 as Runner 1
    participant R2 as Runner 2

    GH->>WH: workflow_job: queued (Job A)
    WH->>R1: Provision runner
    R1->>GH: Run Job A, self-terminate
    GH->>WH: workflow_job: waiting (Job B)
    WH->>R2: Provision runner
    R2->>GH: Run Job B
    Note over JR: Every 60s: poll for orphaned jobs
    JR->>GH: GET /actions/runs (queued, no runner?)
    JR-->>R2: Provision if webhook was missed

The lesson: ephemeral runners and multi-job workflows have a hidden interaction that GitHub’s webhook model doesn’t fully cover. If you’re running --ephemeral with needs dependencies, you need a mechanism to provision runners for dependent jobs. The webhook alone isn’t enough.

Lesson 3: Your Orchestrator Is Doing Too Much

Our original Step Functions state machine was a thing of beauty — in the same way that a Rube Goldberg machine is a thing of beauty. It tried each instance type in each availability zone, one at a time, in sequence. After the ENG-1136 expansion, that was 27 sequential steps: 15 spot attempts (5 types × 3 AZs) followed by 12 on-demand attempts (4 types × 3 AZs).

Each step invoked the consumer Lambda, caught InsufficientInstanceCapacity, and fell through to the next. The state machine definition was 400+ lines of CDK code. Adding a new instance type meant adding it in the right position in the chain, updating the catch handlers, and hoping you didn’t break the fallback ordering.

We replaced it with a 2-step state machine:

flowchart LR
    A[Start] --> B[Try Spot]
    B -->|Success| D[Done]
    B -->|All spot failed| C[Try On-Demand]
    C -->|Success| D
    C -->|All failed| E[Fail]

The combinatorial complexity moved into the consumer Lambda itself. Instead of the state machine iterating through type/AZ combinations, the Lambda receives {market: 'spot'} or {market: 'on-demand'}, builds the list of all (type, subnet) combinations, shuffles them randomly, and tries each one until it gets an instance:

const combos = instanceTypes.flatMap(type =>
  subnetIds.map(subnet => ({ type, subnet }))
)
// Shuffle to avoid thundering herd on the same type/AZ
for (let i = combos.length - 1; i > 0; i--) {
  const j = Math.floor(Math.random() * (i + 1))
  ;[combos[i], combos[j]] = [combos[j], combos[i]]
}

for (const { type, subnet } of combos) {
  try {
    return await launchInstance({ type, subnet, market })
  } catch (err) {
    if (isCapacityError(err)) continue
    throw err
  }
}

The random shuffle is important. Without it, every runner request would hammer the same instance type and AZ first, creating artificial contention. With the shuffle, requests spread naturally across the fleet.

The state machine went from 400+ lines to about 40. Adding a new instance type is now a one-line change in the config array. The consumer Lambda handles the complexity, the state machine handles the flow.

The lesson: push combinatorial complexity into code. Keep state machines for flow control, not iteration.

Lesson 4: Production Deploys Deserve On-Demand

Not all CI jobs carry the same risk. A PR check that gets interrupted by spot reclamation? Minor inconvenience — GitHub retries automatically. A CDK deploy to production that gets interrupted mid-stack-update? You’re now looking at a ROLLBACK_IN_PROGRESS CloudFormation stack and a potentially broken production environment.

We introduced label-based routing. Most jobs use [self-hosted, linux, x64, platform] and go to spot instances. Deploy-to-production jobs use [self-hosted, linux, x64, platform, on-demand] and go to dedicated on-demand instances via a separate SQS queue and orchestrator path.

The cost math works out cleanly:

Runner TypeMonthly MinutesCost/Month
GitHub-hosted (before)~11,300~$73
Self-hosted spot (most jobs)~4,665~$18
Self-hosted on-demand (prod deploys)~450~$13
Total self-hosted~5,115~$31

That’s a 58% cost reduction even with on-demand instances for production deploys. The savings from spot on the bulk of CI jobs more than offset the premium on the small number of production deployments.

The lesson: match your runner type to the blast radius of the job. Cheap and interruptible for tests. Reliable for deployments. The routing is trivial; the risk reduction is not.

Lesson 5: The Fine-Tuning Never Stops

After the big migration, a stream of smaller operational issues surfaced. None were crises, but each one was a reminder that “deployed and working” is not the same as “operationally mature.”

Concurrency groups were too coarse. We had a single workflow-level concurrency group on the staging deploy with cancel-in-progress: true. Two rapid merges to main, and the first deploy got cancelled — including its in-flight Terraform apply. We split into per-job-category concurrency: backend and infrastructure deploys never get cancelled (cancel-in-progress: false), while frontend deploys to S3 can be safely superseded (cancel-in-progress: true).

S3 cache permissions were incomplete. The runner instance role could read the pnpm cache bucket but couldn’t write to it. Jobs passed when the cache was warm and failed intermittently when it needed updating — the worst kind of flaky test. Adding s3:PutObject and the multipart upload permissions fixed it.

Runner metadata was invisible. When something went wrong, we couldn’t tell which instance type or AZ a job had run on. We added a log-runner-info step to every self-hosted job that queries the instance metadata service and logs the instance type, availability zone, and spot/on-demand status. You can’t optimise what you can’t see.

Lightweight jobs were wasting capacity. Change detection, deploy verification, summary jobs — none of these need a self-hosted runner. We moved them to ubuntu-latest, freeing up spot capacity for the jobs that actually need it.

The lesson: operational maturity is a gradient, not a milestone. Every week in production reveals another edge case, another assumption that doesn’t hold, another thing to tune.

Where We Are Now

The before-and-after, compared to the original post:

MetricOriginal (Apr 30)Now (May 14)
Regioneu-west-2 (London)eu-west-1 (Ireland)
Instance types410
Capacity pools1230
Orchestrator steps27 (sequential)2 (spot → on-demand)
Dependent job handlingNone (webhook only)Retry Lambda + waiting events
Cost vs GitHub-hosted~60% less~58% less
Boot-to-job~35s~35s

The architecture hasn’t changed. It’s still webhook → queue → orchestrator → consumer → ephemeral EC2. The original design was sound. What changed was everything around it: the region, the fleet composition, the error handling, the observability, the operational awareness.

Building infrastructure is the weekend project. Operating it is the actual job. And the job never finishes — we still have a backlog of improvements: a watchdog workflow for stale queued runs, smarter cache warming, and Lambda bundle size reduction to avoid OOM on constrained instances.

But the PRs are flowing. The runners are provisioning. And I haven’t seen InsufficientInstanceCapacity in a week.

That’ll do for now.