What Production Taught Us About Self-Hosted Runners

2026-05-14 · 10 min read

Two Weeks Later, Everything Broke

Two weeks after shipping our self-hosted runner infrastructure, I opened Slack to a wall of failed CI checks. InsufficientInstanceCapacity. Not one job — all of them. Every PR across every repo was stuck, queued behind runners that couldn’t be provisioned.

It wasn’t a one-off. Over the next few days, capacity errors kept appearing at unpredictable intervals. Sometimes runners would provision fine for hours, then suddenly nothing. PRs accumulated. Developers (well, developer — it’s just me) sat waiting. The infrastructure we’d built in a weekend was failing at its one job: running CI.

Building self-hosted runners is a weekend project. Operating them is the real work. Here are the lessons production taught us.

Lesson 1: Your Spot Pool Is Shallower Than You Think

Our original fleet was straightforward: c7i.large spot instances in eu-west-2 (London). One instance type, three availability zones. The fallback chain tried each AZ in sequence, then fell back to on-demand. Simple, clean, worked great in testing.

The problem is that eu-west-2 is one of AWS’s smaller European regions. The spot capacity pool for any single instance type in any single AZ is finite, and when it’s gone, it’s gone. We were fishing in a pond, not the ocean.

The first fix (ENG-1136) was reactive: add more instance types to the fallback chain. c6i.large, c7a.large, m7i.large. We expanded the Step Functions state machine to try every type in every AZ — 27 sequential attempts before giving up. It helped, but it was a band-aid on a structural problem.

The real fix was to change the pond entirely. I moved the runner fleet from eu-west-2 to eu-west-1 (Ireland) — AWS’s largest European region, with significantly deeper spot capacity pools and lower interruption rates. At the same time, we expanded from 4 instance types to 10:

Category	Instance Types
AMD (lower demand, more available)	c5a.large, c6a.large, m5a.large
Older Intel (larger pools)	c5.large, c6i.large, m5.large
Current-gen Intel	m7i.large, m6i.large, c7i.large, c7a.large

Ten instance types across three availability zones gives us 30 distinct capacity pools. The AMD types are the secret weapon — they tend to have lower demand and higher availability because most teams default to the latest Intel generation.

The lesson: region choice and fleet diversity matter more than retry logic. You can build the most sophisticated fallback chain in the world, but if you’re fishing in a shallow pool, you’ll still come up empty.

Lesson 2: Ephemeral Runners Break `needs` Graphs

This one blindsided us. After the region migration, multi-job workflows with needs dependencies started hanging indefinitely. Single-job workflows worked fine. Staging deploys — which chain build → deploy → smoke-test — were completely blocked.

The root cause is a subtle interaction between ephemeral runners and GitHub’s webhook model:

sequenceDiagram
    participant GH as GitHub
    participant WH as Webhook Lambda
    participant R1 as Runner 1

    GH->>GH: Workflow starts
    GH->>WH: workflow_job: queued (Job A)
    Note over GH: Job B (needs: A) not yet queued
    WH->>R1: Provision runner
    R1->>GH: Register, pick up Job A
    R1->>R1: Job A completes
    R1->>R1: Self-terminate (ephemeral)
    GH->>GH: Job B becomes runnable
    Note over GH: No webhook fires for Job B
    GH-->>GH: Job B waits forever

When a workflow starts, GitHub only queues the jobs that are immediately runnable. Jobs with needs dependencies aren’t queued until their prerequisites complete. Our ephemeral runner picks up Job A, runs it, and self-terminates. When Job B becomes runnable, GitHub doesn’t fire a new workflow_job: queued webhook — the event already fired when the workflow started, and the job wasn’t ready then.

The fix had two parts:

Handle waiting events. GitHub sends workflow_job with action waiting (not queued) when a dependent job becomes runnable. We updated the webhook Lambda to treat waiting as a trigger for runner provisioning.
Job-retry Lambda. A belt-and-suspenders safety net: an EventBridge-triggered Lambda that runs every 60 seconds, polls the GitHub API for queued jobs with no assigned runner, and provisions runners for any orphans it finds.

sequenceDiagram
    participant GH as GitHub
    participant WH as Webhook Lambda
    participant JR as Job Retry Lambda
    participant R1 as Runner 1
    participant R2 as Runner 2

    GH->>WH: workflow_job: queued (Job A)
    WH->>R1: Provision runner
    R1->>GH: Run Job A, self-terminate
    GH->>WH: workflow_job: waiting (Job B)
    WH->>R2: Provision runner
    R2->>GH: Run Job B
    Note over JR: Every 60s: poll for orphaned jobs
    JR->>GH: GET /actions/runs (queued, no runner?)
    JR-->>R2: Provision if webhook was missed

The lesson: ephemeral runners and multi-job workflows have a hidden interaction that GitHub’s webhook model doesn’t fully cover. If you’re running --ephemeral with needs dependencies, you need a mechanism to provision runners for dependent jobs. The webhook alone isn’t enough.

Lesson 3: Your Orchestrator Is Doing Too Much

Our original Step Functions state machine was a thing of beauty — in the same way that a Rube Goldberg machine is a thing of beauty. It tried each instance type in each availability zone, one at a time, in sequence. After the ENG-1136 expansion, that was 27 sequential steps: 15 spot attempts (5 types × 3 AZs) followed by 12 on-demand attempts (4 types × 3 AZs).

Each step invoked the consumer Lambda, caught InsufficientInstanceCapacity, and fell through to the next. The state machine definition was 400+ lines of CDK code. Adding a new instance type meant adding it in the right position in the chain, updating the catch handlers, and hoping you didn’t break the fallback ordering.

We replaced it with a 2-step state machine:

flowchart LR
    A[Start] --> B[Try Spot]
    B -->|Success| D[Done]
    B -->|All spot failed| C[Try On-Demand]
    C -->|Success| D
    C -->|All failed| E[Fail]

The combinatorial complexity moved into the consumer Lambda itself. Instead of the state machine iterating through type/AZ combinations, the Lambda receives {market: 'spot'} or {market: 'on-demand'}, builds the list of all (type, subnet) combinations, shuffles them randomly, and tries each one until it gets an instance:

const combos = instanceTypes.flatMap(type =>
  subnetIds.map(subnet => ({ type, subnet }))
)
// Shuffle to avoid thundering herd on the same type/AZ
for (let i = combos.length - 1; i > 0; i--) {
  const j = Math.floor(Math.random() * (i + 1))
  ;[combos[i], combos[j]] = [combos[j], combos[i]]
}

for (const { type, subnet } of combos) {
  try {
    return await launchInstance({ type, subnet, market })
  } catch (err) {
    if (isCapacityError(err)) continue
    throw err
  }
}

The random shuffle is important. Without it, every runner request would hammer the same instance type and AZ first, creating artificial contention. With the shuffle, requests spread naturally across the fleet.

The state machine went from 400+ lines to about 40. Adding a new instance type is now a one-line change in the config array. The consumer Lambda handles the complexity, the state machine handles the flow.

The lesson: push combinatorial complexity into code. Keep state machines for flow control, not iteration.

Lesson 4: Production Deploys Deserve On-Demand

Not all CI jobs carry the same risk. A PR check that gets interrupted by spot reclamation? Minor inconvenience — GitHub retries automatically. A CDK deploy to production that gets interrupted mid-stack-update? You’re now looking at a ROLLBACK_IN_PROGRESS CloudFormation stack and a potentially broken production environment.

We introduced label-based routing. Most jobs use [self-hosted, linux, x64, platform] and go to spot instances. Deploy-to-production jobs use [self-hosted, linux, x64, platform, on-demand] and go to dedicated on-demand instances via a separate SQS queue and orchestrator path.

The cost math works out cleanly:

Runner Type	Monthly Minutes	Cost/Month
GitHub-hosted (before)	~11,300	~$73
Self-hosted spot (most jobs)	~4,665	~$18
Self-hosted on-demand (prod deploys)	~450	~$13
Total self-hosted	~5,115	~$31

That’s a 58% cost reduction even with on-demand instances for production deploys. The savings from spot on the bulk of CI jobs more than offset the premium on the small number of production deployments.

The lesson: match your runner type to the blast radius of the job. Cheap and interruptible for tests. Reliable for deployments. The routing is trivial; the risk reduction is not.

Lesson 5: The Fine-Tuning Never Stops

After the big migration, a stream of smaller operational issues surfaced. None were crises, but each one was a reminder that “deployed and working” is not the same as “operationally mature.”

Concurrency groups were too coarse. We had a single workflow-level concurrency group on the staging deploy with cancel-in-progress: true. Two rapid merges to main, and the first deploy got cancelled — including its in-flight Terraform apply. We split into per-job-category concurrency: backend and infrastructure deploys never get cancelled (cancel-in-progress: false), while frontend deploys to S3 can be safely superseded (cancel-in-progress: true).

S3 cache permissions were incomplete. The runner instance role could read the pnpm cache bucket but couldn’t write to it. Jobs passed when the cache was warm and failed intermittently when it needed updating — the worst kind of flaky test. Adding s3:PutObject and the multipart upload permissions fixed it.

Runner metadata was invisible. When something went wrong, we couldn’t tell which instance type or AZ a job had run on. We added a log-runner-info step to every self-hosted job that queries the instance metadata service and logs the instance type, availability zone, and spot/on-demand status. You can’t optimise what you can’t see.

Lightweight jobs were wasting capacity. Change detection, deploy verification, summary jobs — none of these need a self-hosted runner. We moved them to ubuntu-latest, freeing up spot capacity for the jobs that actually need it.

The lesson: operational maturity is a gradient, not a milestone. Every week in production reveals another edge case, another assumption that doesn’t hold, another thing to tune.

Where We Are Now

The before-and-after, compared to the original post:

Metric	Original (Apr 30)	Now (May 14)
Region	eu-west-2 (London)	eu-west-1 (Ireland)
Instance types	4	10
Capacity pools	12	30
Orchestrator steps	27 (sequential)	2 (spot → on-demand)
Dependent job handling	None (webhook only)	Retry Lambda + `waiting` events
Cost vs GitHub-hosted	~60% less	~58% less
Boot-to-job	~35s	~35s

The architecture hasn’t changed. It’s still webhook → queue → orchestrator → consumer → ephemeral EC2. The original design was sound. What changed was everything around it: the region, the fleet composition, the error handling, the observability, the operational awareness.

Building infrastructure is the weekend project. Operating it is the actual job. And the job never finishes — we still have a backlog of improvements: a watchdog workflow for stale queued runs, smarter cache warming, and Lambda bundle size reduction to avoid OOM on constrained instances.

But the PRs are flowing. The runners are provisioning. And I haven’t seen InsufficientInstanceCapacity in a week.

That’ll do for now.

What Production Taught Us About Self-Hosted Runners

2026-05-14 · 9 min read

Two Weeks Later, Everything Broke

Two weeks after building our own CI server fleet, I opened Slack to a wall of failures. Every automated check across every code repository was stuck. The servers we’d set up to run those checks couldn’t be created — AWS simply didn’t have the capacity.

It wasn’t a one-off. Over the next few days, capacity errors kept appearing at unpredictable intervals. Sometimes servers would spin up fine for hours, then suddenly nothing. Pull requests accumulated. Nobody could merge. The infrastructure we’d built in a weekend was failing at its one job: running our automated checks.

Building CI servers is a weekend project. Operating them is the real work. Here are the lessons production taught us.

Lesson 1: When Cheap Servers Run Out

A quick bit of context: AWS offers “spot” servers — spare capacity sold at a steep discount. They’re perfect for CI because if one gets reclaimed, the job just retries. The catch is that availability depends on which type of server you ask for, in which region, at any given moment.

Our original setup asked for one specific server type in London (one of AWS’s smaller European regions). One type, three data centres. When that type was available, great. When it wasn’t, everything stopped.

The first fix was reactive: ask for more types. We expanded from one to four, and tried each one across each data centre — 27 attempts in total before giving up. It helped, but we were still fishing in a small pond.

The real fix was to move to a bigger pond. I migrated the entire fleet from London to Ireland — AWS’s largest European region, with significantly more spare capacity. At the same time, we expanded from 4 server types to 10, mixing different processor families and generations:

Category	Why It Helps
AMD processors (3 types)	Lower demand from other customers, more availability
Older Intel (3 types)	Larger pools, everyone else wants the newest
Current-gen Intel (4 types)	Best performance when available

Ten types across three data centres gives us 30 different places to find a server. The AMD types turned out to be the secret weapon — most teams default to the latest Intel, so the AMD pools are less contested.

The lesson: where you look for servers and how many types you’re willing to use matters more than how cleverly you retry.

Lesson 2: The Timing Bug That Blocked All Deploys

This one blindsided us. After the region migration, any automated workflow with multiple steps started hanging. Simple one-step workflows worked fine. Our deploy pipeline — which chains build, deploy, and verification steps — was completely stuck.

The problem was a subtle timing issue with how we provision servers:

sequenceDiagram
    participant GH as GitHub
    participant WH as Our Webhook
    participant R1 as Server 1

    GH->>GH: Workflow starts
    GH->>WH: "Job A needs a server"
    Note over GH: Job B (depends on A) waits
    WH->>R1: Spin up a server
    R1->>GH: Run Job A, shut down
    GH->>GH: Job B is now ready
    Note over GH: But no notification is sent!
    GH-->>GH: Job B waits forever

Here’s what happens: when a workflow starts, GitHub only tells us about jobs that can run immediately. Jobs that depend on other jobs aren’t announced until their prerequisites finish. Our server runs Job A and shuts down (each server runs exactly one job, then self-destructs). When Job B becomes ready, GitHub doesn’t send a new notification — it already sent one when the workflow started, and Job B wasn’t ready then.

The fix had two parts:

Listen for a different signal. GitHub sends a “waiting” notification (not “queued”) when a dependent job becomes ready. We updated our system to treat that as a trigger.
A safety net. A background process that runs every 60 seconds, checks GitHub for any jobs that are waiting but don’t have a server assigned, and spins one up. Belt and suspenders.

sequenceDiagram
    participant GH as GitHub
    participant WH as Our Webhook
    participant JR as Safety Net (every 60s)
    participant R1 as Server 1
    participant R2 as Server 2

    GH->>WH: "Job A needs a server"
    WH->>R1: Spin up server
    R1->>GH: Run Job A, shut down
    GH->>WH: "Job B is waiting"
    WH->>R2: Spin up server
    R2->>GH: Run Job B
    Note over JR: Also polling for orphaned jobs
    JR->>GH: Any stuck jobs? Spin up servers if so

The lesson: when each server runs only one job and then shuts down, you need a way to provision new servers for jobs that weren’t ready when the workflow started. The standard notification alone isn’t enough.

Lesson 3: Simplify the Decision-Making

Our original system for finding an available server was methodical but slow. It tried each server type in each data centre, one at a time, in a fixed order. After we expanded the fleet, that was 27 sequential attempts — try this combination, wait for the result, try the next one.

The orchestration service that managed this had grown to over 400 lines of configuration. Adding a new server type meant carefully inserting it in the right position and hoping you didn’t break the ordering.

We replaced it with a much simpler approach:

flowchart LR
    A[Start] --> B[Try cheap servers]
    B -->|Got one| D[Done]
    B -->|None available| C[Try standard servers]
    C -->|Got one| D
    C -->|None available| E[Fail]

Instead of the orchestration service iterating through 27 combinations, the server-provisioning code now receives a simple instruction (“try cheap” or “try standard”), builds the full list of possible combinations, shuffles them randomly, and tries each one until it finds availability.

The random shuffle is important — without it, every request would try the same combination first, creating artificial contention. With the shuffle, requests naturally spread across all available capacity.

The orchestration went from 400+ lines to about 40. Adding a new server type is now a one-line change. The complexity moved from a hard-to-maintain workflow into straightforward code.

The lesson: keep your orchestration simple. Let it decide what to do (try cheap, then try standard). Let regular code handle how to do it (shuffle and iterate through options).

Lesson 4: Not All Jobs Deserve Cheap Servers

Cheap spot servers can be reclaimed by AWS at any time. For a test suite, that’s a minor annoyance — the test just reruns. But for a production deployment, an interruption halfway through can leave your infrastructure in a partially updated state. That’s a genuine incident.

We introduced a routing system. Most jobs (tests, code checks, staging deploys) go to cheap spot servers. Production deployments go to standard servers that won’t be interrupted, using a separate queue and provisioning path. The distinction is based on a label in the workflow file — one extra word.

The cost still works out well:

Setup	Monthly Cost
GitHub’s servers (before)	~$73
Our spot servers (most jobs)	~$18
Our standard servers (prod deploys)	~$13
Our total	~$31

That’s a 58% cost reduction even with reliable servers for production work. The savings from cheap servers on the bulk of our work more than offset the premium on the small number of production deployments.

The lesson: match your server type to the consequences of interruption. Cheap and interruptible for tests. Reliable for deployments.

Lesson 5: The Fine-Tuning Never Stops

After the big migration, a stream of smaller issues surfaced. None were crises, but each one was a reminder that “deployed and working” is different from “operationally mature.”

Deploys were cancelling each other. Two rapid code merges would trigger two deployments, and the second would cancel the first — even if the first was halfway through updating infrastructure. We split the cancellation rules: infrastructure deployments never get cancelled (let them finish), while frontend deployments can be safely superseded.

Cache permissions were incomplete. Our servers could read the shared dependency cache but couldn’t write to it. Jobs passed when the cache was fresh and failed intermittently when it needed updating — the worst kind of inconsistent failure. A straightforward permissions fix.

We couldn’t see what was happening. When something went wrong, we had no way to tell which server type or data centre a job had used. We added a logging step to every job that records the server details. You can’t improve what you can’t measure.

Simple jobs were wasting capacity. Detection steps, summaries, and verification checks don’t need a powerful self-hosted server. We moved them to GitHub’s free-tier servers, freeing up capacity for the jobs that actually need it.

The lesson: operational maturity is a gradient, not a milestone. Every week in production reveals another assumption that doesn’t hold.

Where We Are Now

Compared to two weeks ago:

What Changed	Before	After
Region	London (smaller)	Ireland (largest in EU)
Server types	4	10
Available capacity pools	12	30
Decision steps	27 (one at a time)	2 (cheap → standard)
Multi-step workflow handling	Notifications only	Notifications + safety-net polling
Cost vs GitHub	~60% less	~58% less
Time to ready	~35 seconds	~35 seconds

The fundamental architecture hasn’t changed. It’s still: receive notification, queue the job, find a server, run the job, shut down the server. The original design was sound. What changed was everything around it — the region, the fleet composition, the error handling, the observability.

Building infrastructure is the weekend project. Operating it is the actual job. And the job doesn’t finish — we still have improvements planned for detecting stale jobs, smarter caching, and reducing memory pressure during builds.

But the pull requests are flowing. The servers are provisioning. And I haven’t seen a capacity error in a week.

That’ll do for now.

Two Weeks Later, Everything Broke

Lesson 1: Your Spot Pool Is Shallower Than You Think

Lesson 2: Ephemeral Runners Break needs Graphs

Lesson 3: Your Orchestrator Is Doing Too Much

Lesson 4: Production Deploys Deserve On-Demand

Lesson 5: The Fine-Tuning Never Stops

Where We Are Now

Two Weeks Later, Everything Broke

Lesson 1: When Cheap Servers Run Out

Lesson 2: The Timing Bug That Blocked All Deploys

Lesson 3: Simplify the Decision-Making

Lesson 4: Not All Jobs Deserve Cheap Servers

Lesson 5: The Fine-Tuning Never Stops

Where We Are Now

Lesson 2: Ephemeral Runners Break `needs` Graphs