What Production Taught Us About Self-Hosted Runners
Two Weeks Later, Everything Broke
Two weeks after shipping our self-hosted runner infrastructure, I opened Slack to a wall of failed CI checks. InsufficientInstanceCapacity. Not one job — all of them. Every PR across every repo was stuck, queued behind runners that couldn’t be provisioned.
It wasn’t a one-off. Over the next few days, capacity errors kept appearing at unpredictable intervals. Sometimes runners would provision fine for hours, then suddenly nothing. PRs accumulated. Developers (well, developer — it’s just me) sat waiting. The infrastructure we’d built in a weekend was failing at its one job: running CI.
Building self-hosted runners is a weekend project. Operating them is the real work. Here are the lessons production taught us.
Lesson 1: Your Spot Pool Is Shallower Than You Think
Our original fleet was straightforward: c7i.large spot instances in eu-west-2 (London). One instance type, three availability zones. The fallback chain tried each AZ in sequence, then fell back to on-demand. Simple, clean, worked great in testing.
The problem is that eu-west-2 is one of AWS’s smaller European regions. The spot capacity pool for any single instance type in any single AZ is finite, and when it’s gone, it’s gone. We were fishing in a pond, not the ocean.
The first fix (ENG-1136) was reactive: add more instance types to the fallback chain. c6i.large, c7a.large, m7i.large. We expanded the Step Functions state machine to try every type in every AZ — 27 sequential attempts before giving up. It helped, but it was a band-aid on a structural problem.
The real fix was to change the pond entirely. I moved the runner fleet from eu-west-2 to eu-west-1 (Ireland) — AWS’s largest European region, with significantly deeper spot capacity pools and lower interruption rates. At the same time, we expanded from 4 instance types to 10:
| Category | Instance Types |
|---|---|
| AMD (lower demand, more available) | c5a.large, c6a.large, m5a.large |
| Older Intel (larger pools) | c5.large, c6i.large, m5.large |
| Current-gen Intel | m7i.large, m6i.large, c7i.large, c7a.large |
Ten instance types across three availability zones gives us 30 distinct capacity pools. The AMD types are the secret weapon — they tend to have lower demand and higher availability because most teams default to the latest Intel generation.
The lesson: region choice and fleet diversity matter more than retry logic. You can build the most sophisticated fallback chain in the world, but if you’re fishing in a shallow pool, you’ll still come up empty.
Lesson 2: Ephemeral Runners Break needs Graphs
This one blindsided us. After the region migration, multi-job workflows with needs dependencies started hanging indefinitely. Single-job workflows worked fine. Staging deploys — which chain build → deploy → smoke-test — were completely blocked.
The root cause is a subtle interaction between ephemeral runners and GitHub’s webhook model:
sequenceDiagram
participant GH as GitHub
participant WH as Webhook Lambda
participant R1 as Runner 1
GH->>GH: Workflow starts
GH->>WH: workflow_job: queued (Job A)
Note over GH: Job B (needs: A) not yet queued
WH->>R1: Provision runner
R1->>GH: Register, pick up Job A
R1->>R1: Job A completes
R1->>R1: Self-terminate (ephemeral)
GH->>GH: Job B becomes runnable
Note over GH: No webhook fires for Job B
GH-->>GH: Job B waits forever
When a workflow starts, GitHub only queues the jobs that are immediately runnable. Jobs with needs dependencies aren’t queued until their prerequisites complete. Our ephemeral runner picks up Job A, runs it, and self-terminates. When Job B becomes runnable, GitHub doesn’t fire a new workflow_job: queued webhook — the event already fired when the workflow started, and the job wasn’t ready then.
The fix had two parts:
- Handle
waitingevents. GitHub sendsworkflow_jobwith actionwaiting(notqueued) when a dependent job becomes runnable. We updated the webhook Lambda to treatwaitingas a trigger for runner provisioning. - Job-retry Lambda. A belt-and-suspenders safety net: an EventBridge-triggered Lambda that runs every 60 seconds, polls the GitHub API for queued jobs with no assigned runner, and provisions runners for any orphans it finds.
sequenceDiagram
participant GH as GitHub
participant WH as Webhook Lambda
participant JR as Job Retry Lambda
participant R1 as Runner 1
participant R2 as Runner 2
GH->>WH: workflow_job: queued (Job A)
WH->>R1: Provision runner
R1->>GH: Run Job A, self-terminate
GH->>WH: workflow_job: waiting (Job B)
WH->>R2: Provision runner
R2->>GH: Run Job B
Note over JR: Every 60s: poll for orphaned jobs
JR->>GH: GET /actions/runs (queued, no runner?)
JR-->>R2: Provision if webhook was missed
The lesson: ephemeral runners and multi-job workflows have a hidden interaction that GitHub’s webhook model doesn’t fully cover. If you’re running --ephemeral with needs dependencies, you need a mechanism to provision runners for dependent jobs. The webhook alone isn’t enough.
Lesson 3: Your Orchestrator Is Doing Too Much
Our original Step Functions state machine was a thing of beauty — in the same way that a Rube Goldberg machine is a thing of beauty. It tried each instance type in each availability zone, one at a time, in sequence. After the ENG-1136 expansion, that was 27 sequential steps: 15 spot attempts (5 types × 3 AZs) followed by 12 on-demand attempts (4 types × 3 AZs).
Each step invoked the consumer Lambda, caught InsufficientInstanceCapacity, and fell through to the next. The state machine definition was 400+ lines of CDK code. Adding a new instance type meant adding it in the right position in the chain, updating the catch handlers, and hoping you didn’t break the fallback ordering.
We replaced it with a 2-step state machine:
flowchart LR
A[Start] --> B[Try Spot]
B -->|Success| D[Done]
B -->|All spot failed| C[Try On-Demand]
C -->|Success| D
C -->|All failed| E[Fail]
The combinatorial complexity moved into the consumer Lambda itself. Instead of the state machine iterating through type/AZ combinations, the Lambda receives {market: 'spot'} or {market: 'on-demand'}, builds the list of all (type, subnet) combinations, shuffles them randomly, and tries each one until it gets an instance:
const combos = instanceTypes.flatMap(type =>
subnetIds.map(subnet => ({ type, subnet }))
)
// Shuffle to avoid thundering herd on the same type/AZ
for (let i = combos.length - 1; i > 0; i--) {
const j = Math.floor(Math.random() * (i + 1))
;[combos[i], combos[j]] = [combos[j], combos[i]]
}
for (const { type, subnet } of combos) {
try {
return await launchInstance({ type, subnet, market })
} catch (err) {
if (isCapacityError(err)) continue
throw err
}
}
The random shuffle is important. Without it, every runner request would hammer the same instance type and AZ first, creating artificial contention. With the shuffle, requests spread naturally across the fleet.
The state machine went from 400+ lines to about 40. Adding a new instance type is now a one-line change in the config array. The consumer Lambda handles the complexity, the state machine handles the flow.
The lesson: push combinatorial complexity into code. Keep state machines for flow control, not iteration.
Lesson 4: Production Deploys Deserve On-Demand
Not all CI jobs carry the same risk. A PR check that gets interrupted by spot reclamation? Minor inconvenience — GitHub retries automatically. A CDK deploy to production that gets interrupted mid-stack-update? You’re now looking at a ROLLBACK_IN_PROGRESS CloudFormation stack and a potentially broken production environment.
We introduced label-based routing. Most jobs use [self-hosted, linux, x64, platform] and go to spot instances. Deploy-to-production jobs use [self-hosted, linux, x64, platform, on-demand] and go to dedicated on-demand instances via a separate SQS queue and orchestrator path.
The cost math works out cleanly:
| Runner Type | Monthly Minutes | Cost/Month |
|---|---|---|
| GitHub-hosted (before) | ~11,300 | ~$73 |
| Self-hosted spot (most jobs) | ~4,665 | ~$18 |
| Self-hosted on-demand (prod deploys) | ~450 | ~$13 |
| Total self-hosted | ~5,115 | ~$31 |
That’s a 58% cost reduction even with on-demand instances for production deploys. The savings from spot on the bulk of CI jobs more than offset the premium on the small number of production deployments.
The lesson: match your runner type to the blast radius of the job. Cheap and interruptible for tests. Reliable for deployments. The routing is trivial; the risk reduction is not.
Lesson 5: The Fine-Tuning Never Stops
After the big migration, a stream of smaller operational issues surfaced. None were crises, but each one was a reminder that “deployed and working” is not the same as “operationally mature.”
Concurrency groups were too coarse. We had a single workflow-level concurrency group on the staging deploy with cancel-in-progress: true. Two rapid merges to main, and the first deploy got cancelled — including its in-flight Terraform apply. We split into per-job-category concurrency: backend and infrastructure deploys never get cancelled (cancel-in-progress: false), while frontend deploys to S3 can be safely superseded (cancel-in-progress: true).
S3 cache permissions were incomplete. The runner instance role could read the pnpm cache bucket but couldn’t write to it. Jobs passed when the cache was warm and failed intermittently when it needed updating — the worst kind of flaky test. Adding s3:PutObject and the multipart upload permissions fixed it.
Runner metadata was invisible. When something went wrong, we couldn’t tell which instance type or AZ a job had run on. We added a log-runner-info step to every self-hosted job that queries the instance metadata service and logs the instance type, availability zone, and spot/on-demand status. You can’t optimise what you can’t see.
Lightweight jobs were wasting capacity. Change detection, deploy verification, summary jobs — none of these need a self-hosted runner. We moved them to ubuntu-latest, freeing up spot capacity for the jobs that actually need it.
The lesson: operational maturity is a gradient, not a milestone. Every week in production reveals another edge case, another assumption that doesn’t hold, another thing to tune.
Where We Are Now
The before-and-after, compared to the original post:
| Metric | Original (Apr 30) | Now (May 14) |
|---|---|---|
| Region | eu-west-2 (London) | eu-west-1 (Ireland) |
| Instance types | 4 | 10 |
| Capacity pools | 12 | 30 |
| Orchestrator steps | 27 (sequential) | 2 (spot → on-demand) |
| Dependent job handling | None (webhook only) | Retry Lambda + waiting events |
| Cost vs GitHub-hosted | ~60% less | ~58% less |
| Boot-to-job | ~35s | ~35s |
The architecture hasn’t changed. It’s still webhook → queue → orchestrator → consumer → ephemeral EC2. The original design was sound. What changed was everything around it: the region, the fleet composition, the error handling, the observability, the operational awareness.
Building infrastructure is the weekend project. Operating it is the actual job. And the job never finishes — we still have a backlog of improvements: a watchdog workflow for stale queued runs, smarter cache warming, and Lambda bundle size reduction to avoid OOM on constrained instances.
But the PRs are flowing. The runners are provisioning. And I haven’t seen InsufficientInstanceCapacity in a week.
That’ll do for now.
What Production Taught Us About Self-Hosted Runners
Two Weeks Later, Everything Broke
Two weeks after building our own CI server fleet, I opened Slack to a wall of failures. Every automated check across every code repository was stuck. The servers we’d set up to run those checks couldn’t be created — AWS simply didn’t have the capacity.
It wasn’t a one-off. Over the next few days, capacity errors kept appearing at unpredictable intervals. Sometimes servers would spin up fine for hours, then suddenly nothing. Pull requests accumulated. Nobody could merge. The infrastructure we’d built in a weekend was failing at its one job: running our automated checks.
Building CI servers is a weekend project. Operating them is the real work. Here are the lessons production taught us.
Lesson 1: When Cheap Servers Run Out
A quick bit of context: AWS offers “spot” servers — spare capacity sold at a steep discount. They’re perfect for CI because if one gets reclaimed, the job just retries. The catch is that availability depends on which type of server you ask for, in which region, at any given moment.
Our original setup asked for one specific server type in London (one of AWS’s smaller European regions). One type, three data centres. When that type was available, great. When it wasn’t, everything stopped.
The first fix was reactive: ask for more types. We expanded from one to four, and tried each one across each data centre — 27 attempts in total before giving up. It helped, but we were still fishing in a small pond.
The real fix was to move to a bigger pond. I migrated the entire fleet from London to Ireland — AWS’s largest European region, with significantly more spare capacity. At the same time, we expanded from 4 server types to 10, mixing different processor families and generations:
| Category | Why It Helps |
|---|---|
| AMD processors (3 types) | Lower demand from other customers, more availability |
| Older Intel (3 types) | Larger pools, everyone else wants the newest |
| Current-gen Intel (4 types) | Best performance when available |
Ten types across three data centres gives us 30 different places to find a server. The AMD types turned out to be the secret weapon — most teams default to the latest Intel, so the AMD pools are less contested.
The lesson: where you look for servers and how many types you’re willing to use matters more than how cleverly you retry.
Lesson 2: The Timing Bug That Blocked All Deploys
This one blindsided us. After the region migration, any automated workflow with multiple steps started hanging. Simple one-step workflows worked fine. Our deploy pipeline — which chains build, deploy, and verification steps — was completely stuck.
The problem was a subtle timing issue with how we provision servers:
sequenceDiagram
participant GH as GitHub
participant WH as Our Webhook
participant R1 as Server 1
GH->>GH: Workflow starts
GH->>WH: "Job A needs a server"
Note over GH: Job B (depends on A) waits
WH->>R1: Spin up a server
R1->>GH: Run Job A, shut down
GH->>GH: Job B is now ready
Note over GH: But no notification is sent!
GH-->>GH: Job B waits forever
Here’s what happens: when a workflow starts, GitHub only tells us about jobs that can run immediately. Jobs that depend on other jobs aren’t announced until their prerequisites finish. Our server runs Job A and shuts down (each server runs exactly one job, then self-destructs). When Job B becomes ready, GitHub doesn’t send a new notification — it already sent one when the workflow started, and Job B wasn’t ready then.
The fix had two parts:
- Listen for a different signal. GitHub sends a “waiting” notification (not “queued”) when a dependent job becomes ready. We updated our system to treat that as a trigger.
- A safety net. A background process that runs every 60 seconds, checks GitHub for any jobs that are waiting but don’t have a server assigned, and spins one up. Belt and suspenders.
sequenceDiagram
participant GH as GitHub
participant WH as Our Webhook
participant JR as Safety Net (every 60s)
participant R1 as Server 1
participant R2 as Server 2
GH->>WH: "Job A needs a server"
WH->>R1: Spin up server
R1->>GH: Run Job A, shut down
GH->>WH: "Job B is waiting"
WH->>R2: Spin up server
R2->>GH: Run Job B
Note over JR: Also polling for orphaned jobs
JR->>GH: Any stuck jobs? Spin up servers if so
The lesson: when each server runs only one job and then shuts down, you need a way to provision new servers for jobs that weren’t ready when the workflow started. The standard notification alone isn’t enough.
Lesson 3: Simplify the Decision-Making
Our original system for finding an available server was methodical but slow. It tried each server type in each data centre, one at a time, in a fixed order. After we expanded the fleet, that was 27 sequential attempts — try this combination, wait for the result, try the next one.
The orchestration service that managed this had grown to over 400 lines of configuration. Adding a new server type meant carefully inserting it in the right position and hoping you didn’t break the ordering.
We replaced it with a much simpler approach:
flowchart LR
A[Start] --> B[Try cheap servers]
B -->|Got one| D[Done]
B -->|None available| C[Try standard servers]
C -->|Got one| D
C -->|None available| E[Fail]
Instead of the orchestration service iterating through 27 combinations, the server-provisioning code now receives a simple instruction (“try cheap” or “try standard”), builds the full list of possible combinations, shuffles them randomly, and tries each one until it finds availability.
The random shuffle is important — without it, every request would try the same combination first, creating artificial contention. With the shuffle, requests naturally spread across all available capacity.
The orchestration went from 400+ lines to about 40. Adding a new server type is now a one-line change. The complexity moved from a hard-to-maintain workflow into straightforward code.
The lesson: keep your orchestration simple. Let it decide what to do (try cheap, then try standard). Let regular code handle how to do it (shuffle and iterate through options).
Lesson 4: Not All Jobs Deserve Cheap Servers
Cheap spot servers can be reclaimed by AWS at any time. For a test suite, that’s a minor annoyance — the test just reruns. But for a production deployment, an interruption halfway through can leave your infrastructure in a partially updated state. That’s a genuine incident.
We introduced a routing system. Most jobs (tests, code checks, staging deploys) go to cheap spot servers. Production deployments go to standard servers that won’t be interrupted, using a separate queue and provisioning path. The distinction is based on a label in the workflow file — one extra word.
The cost still works out well:
| Setup | Monthly Cost |
|---|---|
| GitHub’s servers (before) | ~$73 |
| Our spot servers (most jobs) | ~$18 |
| Our standard servers (prod deploys) | ~$13 |
| Our total | ~$31 |
That’s a 58% cost reduction even with reliable servers for production work. The savings from cheap servers on the bulk of our work more than offset the premium on the small number of production deployments.
The lesson: match your server type to the consequences of interruption. Cheap and interruptible for tests. Reliable for deployments.
Lesson 5: The Fine-Tuning Never Stops
After the big migration, a stream of smaller issues surfaced. None were crises, but each one was a reminder that “deployed and working” is different from “operationally mature.”
Deploys were cancelling each other. Two rapid code merges would trigger two deployments, and the second would cancel the first — even if the first was halfway through updating infrastructure. We split the cancellation rules: infrastructure deployments never get cancelled (let them finish), while frontend deployments can be safely superseded.
Cache permissions were incomplete. Our servers could read the shared dependency cache but couldn’t write to it. Jobs passed when the cache was fresh and failed intermittently when it needed updating — the worst kind of inconsistent failure. A straightforward permissions fix.
We couldn’t see what was happening. When something went wrong, we had no way to tell which server type or data centre a job had used. We added a logging step to every job that records the server details. You can’t improve what you can’t measure.
Simple jobs were wasting capacity. Detection steps, summaries, and verification checks don’t need a powerful self-hosted server. We moved them to GitHub’s free-tier servers, freeing up capacity for the jobs that actually need it.
The lesson: operational maturity is a gradient, not a milestone. Every week in production reveals another assumption that doesn’t hold.
Where We Are Now
Compared to two weeks ago:
| What Changed | Before | After |
|---|---|---|
| Region | London (smaller) | Ireland (largest in EU) |
| Server types | 4 | 10 |
| Available capacity pools | 12 | 30 |
| Decision steps | 27 (one at a time) | 2 (cheap → standard) |
| Multi-step workflow handling | Notifications only | Notifications + safety-net polling |
| Cost vs GitHub | ~60% less | ~58% less |
| Time to ready | ~35 seconds | ~35 seconds |
The fundamental architecture hasn’t changed. It’s still: receive notification, queue the job, find a server, run the job, shut down the server. The original design was sound. What changed was everything around it — the region, the fleet composition, the error handling, the observability.
Building infrastructure is the weekend project. Operating it is the actual job. And the job doesn’t finish — we still have improvements planned for detecting stale jobs, smarter caching, and reducing memory pressure during builds.
But the pull requests are flowing. The servers are provisioning. And I haven’t seen a capacity error in a week.
That’ll do for now.
← Back to posts