Building Self-Hosted GitHub Actions Runners on AWS: From Webhook to Custom AMI

2026-04-30 · 12 min read

The Bill That Started It All

GitHub-hosted runners are a beautiful abstraction. You write runs-on: ubuntu-latest and a fresh VM materialises in someone else’s data centre. You don’t think about AMIs or instance types or security groups. It just works.

It also costs money. And when you’re a startup running a monorepo with CDK synth, TypeScript compilation, Vitest suites, and Playwright tests across multiple repos, those minutes add up. We were burning through GitHub Actions minutes at a rate that made the finance spreadsheet uncomfortable. Worse, the cold-start latency of GitHub-hosted runners was adding 30-60 seconds to every job just waiting for the VM to be ready.

So we did what any reasonable team does: we built our own.

The Architecture

The core idea is simple: when GitHub needs a runner, spin up an EC2 instance, run exactly one job, then kill it. Ephemeral, stateless, disposable. The implementation is… less simple.

Here’s the full flow:

GitHub (workflow_job: queued)
        │
        ▼
  API Gateway (REST)
        │
        ▼
  Webhook Lambda ──── HMAC-SHA256 verification
        │                    │
        ▼                    ▼
    SQS Queue          On-Demand Queue
        │                    │
        ▼                    ▼
  Poller Lambda        On-Demand Poller
        │                    │
        ▼                    ▼
  Step Functions       Step Functions
   (Spot chain)        (On-Demand chain)
        │                    │
        ▼                    ▼
  Consumer Lambda ──── EC2 RunInstances
        │
        ▼
  EC2 Instance (ephemeral)
   ├── Fetch runner token from SSM
   ├── Register with GitHub
   ├── Run exactly one job
   ├── Delete token from SSM
   └── Self-terminate

Five Lambda functions, two SQS queues, two Step Functions state machines, an API Gateway, and an EC2 fleet. All defined in CDK, all deployed to a dedicated DevOps AWS account that exists purely to run CI infrastructure.

The Webhook

GitHub fires a workflow_job event with action queued when a job needs a runner. We registered a GitHub App with a webhook pointing at our API Gateway endpoint. The webhook Lambda does three things:

Validates the HMAC-SHA256 signature using a shared secret stored in SSM Parameter Store
Checks whether the requested labels match our runner fleet (you don’t want to accidentally claim jobs meant for GitHub-hosted runners)
Routes the message to either the spot queue or the on-demand queue based on job labels

The routing is a key design decision. Most CI jobs - PR checks, linting, tests - are fine on spot instances. If a spot instance gets reclaimed mid-job, the workflow retries and we spin up another one. But deploy jobs are different. A spot reclamation during cdk deploy can leave your infrastructure in a partially-deployed state. Those get routed to on-demand instances.

The Step Functions Orchestrator

This is where it gets interesting. Spot instances are cheap but unreliable. You can’t just call RunInstances with spot and hope for the best - capacity varies by instance type and availability zone.

Our Step Functions state machine implements a fallback chain:

Spot m7i.large in eu-west-2a
  ├── Success → done
  └── InsufficientCapacity →
      Spot m6i.large in eu-west-2b
        ├── Success → done
        └── InsufficientCapacity →
            Spot c7i.large in eu-west-2c
              ├── Success → done
              └── InsufficientCapacity →
                  Spot r7i.large in eu-west-2a
                    ├── Success → done
                    └── InsufficientCapacity →
                        On-Demand m7i.large in eu-west-2a
                          ├── Success → done
                          └── All failed → ☠️

The chain tries four different spot instance types across three availability zones before falling back to on-demand. Each step catches InsufficientInstanceCapacity and MaxSpotInstanceCountExceeded errors with exponential backoff retries. The whole state machine has a 5-minute timeout.

The CDK code that builds this chain is satisfyingly terse:

let spotFallback: sfn.IChainable = launchOnDemand
for (let i = spotTypes.length - 1; i >= 0; i--) {
  const task = makeSpotTask(
    `LaunchSpot${i}`,
    spotTypes[i],
    azs[i % azs.length],
  )
  task.addCatch(
    spotFallback as sfn.IChainable,
    { errors: ['InsufficientInstanceCapacity'], resultPath: '$.error' },
  )
  task.next(success)
  spotFallback = task
}

It builds the chain backwards - the last fallback (on-demand) is defined first, then each spot attempt catches failures and falls through to the next. The result is a linear chain that reads top-to-bottom in the Step Functions console but was built bottom-to-top in code.

The Consumer Lambda

The consumer Lambda is the one that actually calls the EC2 API. For each invocation it:

Generates a GitHub App JWT and exchanges it for an organisation-level runner registration token
Stores the registration token in SSM Parameter Store (not Secrets Manager - cheaper and we delete it within seconds)
Calls RunInstances with the specified instance type, AZ, and purchase option
Tags the instance with RunnerManagedBy: self-hosted-runner and the SSM parameter name so the instance can find its own token

The instance boots with a UserData script that:

# Get instance ID from IMDSv2
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/instance-id)

# Fetch runner token from SSM (placed there by the Consumer Lambda)
RUNNER_TOKEN=$(aws ssm get-parameter \
  --name "$TOKEN_SECRET_NAME" \
  --with-decryption \
  --query Parameter.Value --output text)

# Register as ephemeral runner - one job, then exit
./config.sh --url "https://github.com/OurOrg" \
  --token "$RUNNER_TOKEN" \
  --labels "self-hosted,linux,x64,platform" \
  --name "runner-${INSTANCE_ID}" \
  --ephemeral

# Delete the token immediately
aws ssm delete-parameter --name "$TOKEN_SECRET_NAME"

# Disable auto-update (we control the version via AMI)
export RUNNER_CFG_DISABLE_AUTO_UPDATE=1
./run.sh

# Self-terminate after job completes
aws ec2 terminate-instances --instance-ids "$INSTANCE_ID"

The --ephemeral flag is critical. It tells the runner agent to accept exactly one job and then exit. No lingering instances, no state leakage between jobs, no “why is my test seeing files from the previous run” debugging sessions.

The Cleanup Lambda

Even with ephemeral runners and self-termination, things go wrong. The runner process might crash. The EC2 instance might lose network connectivity. The self-terminate call might fail. So we run a cleanup Lambda every 15 minutes that scans for instances tagged RunnerManagedBy: self-hosted-runner that have been alive longer than 30 minutes and terminates them.

Belt and suspenders. In infrastructure, paranoia is a feature.

The VPC

The runner VPC is intentionally minimal: three public subnets across three AZs, no private subnets, no NAT gateway. The runners only need outbound HTTPS (port 443) to talk to GitHub and AWS APIs. No NAT gateway means no $30/month/AZ charge for a resource that’s only used during CI runs.

The security group allows outbound 443 and nothing else. No SSH, no inbound traffic. If you need to debug a runner, you use SSM Session Manager (the instance role includes the AmazonSSMManagedInstanceCore policy).

The ARM64 Detour

With the basic architecture working, I made what seemed like an obvious optimisation: switch to ARM64 Graviton instances. AWS prices Graviton about 20% cheaper than equivalent x86 instances, and ARM64 has better power efficiency. The GitHub runner agent supports ARM64. Our Lambda functions were already running on ARM64. Why not the runners too?

We switched the fleet over. The m7g.large and m6g.large instances booted fine. CI jobs ran. Cache keys got an architecture suffix to prevent cross-architecture contamination:

key: pnpm-store-${{ runner.arch }}-${{ hashFiles('**/pnpm-lock.yaml') }}

This was important - a pnpm store cached from an x86 runner would cause native module failures on ARM64. The runner.arch prefix kept the caches separate.

Everything looked great for about two weeks. Then Vitest started hanging.

Not failing. Not erroring. Just… stopping. Jobs would hit the 30-minute timeout with no output. We investigated for days. Different Vitest configurations, different test isolation modes, memory limits, CPU affinity. Nothing helped. The hangs were intermittent and unreproducible locally (our dev machines are x86).

We never found the root cause. Commit 2e1f217 in our git history tells the story in one line:

feat: switch runner fleet from ARM64 to x86 - vitest hangs on ARM64

Sometimes the right engineering decision is to retreat. We switched back to x86 (m7i.large, m6i.large, c7i.large, r7i.large) and the hangs disappeared. The 20% savings weren’t worth the reliability tax.

Lesson learned: cheaper is only cheaper if it works.

The AMI Pipeline

Even after the architecture was stable, we had a boot time problem. Fresh instances booted from a vanilla Amazon Linux 2023 AMI, and the UserData script had to:

Download the GitHub Actions runner binary (~90MB)
Install system dependencies via dnf (libicu, git, jq, and about 20 Playwright browser dependencies)
Install AWS CLI v2
Install Node.js and pnpm

That’s a lot of downloading and installing on every single job. Each runner took about 3 minutes from EC2 launch to “ready for work.” For a PR check that runs lint + typecheck + tests in parallel, that’s 3 minutes of idle waiting multiplied by every parallel job.

The fix was obvious: pre-bake everything into a custom AMI.

EC2 Image Builder

We built an AMI pipeline using EC2 Image Builder. Five components, executed in order:

┌─────────────────────────────────────────┐
│           AMI Build Pipeline            │
│         (Sunday 03:00 UTC)              │
├─────────────────────────────────────────┤
│ 1. update-linux (AWS managed)           │
│    └── dnf update, security patches     │
│                                         │
│ 2. install-deps                         │
│    └── libicu, git, jq, Playwright libs │
│                                         │
│ 3. install-aws-cli                      │
│    └── AWS CLI v2 from zip              │
│                                         │
│ 4. install-node                         │
│    └── Node.js 24 + pnpm via corepack   │
│                                         │
│ 5. install-runner                       │
│    └── GitHub Actions runner v2.333.1   │
│        extracted to /opt/actions-runner  │
└─────────────────────────────────────────┘
              │
              ▼
    AMI: github-runner-2026-05-04
              │
              ▼
    SSM: /runners/ami/current = ami-0abc123...

The pipeline runs weekly on Sunday at 03:00 UTC. When the build succeeds, an EventBridge event triggers a success Lambda that writes the new AMI ID to an SSM parameter. The consumer Lambda reads this parameter when launching instances - so new instances automatically pick up the latest AMI without any code changes or deploys.

If the build fails? Nothing happens. The SSM parameter keeps its current value, and runners continue booting from the last known-good AMI. The failure shows up in CloudWatch, we investigate, and we fix it. No downtime, no manual intervention.

The boot time improvement was dramatic:

Phase	Before (vanilla AMI)	After (custom AMI)
EC2 launch to running	~30s	~30s
UserData: install deps	~60s	0s (pre-baked)
UserData: download runner	~30s	0s (pre-baked)
UserData: install Node/pnpm	~40s	0s (pre-baked)
Runner registration	~5s	~5s
Total boot-to-job	~3 min	~35s

An 80% reduction. Every CI job, every PR, every deploy - 2.5 minutes faster.

The Version Drift Trap

There’s a subtle footgun with pre-baked AMIs: version drift. The runner binary version baked into the AMI must match the version the UserData script expects. If GitHub releases a new runner version and you update the config but forget to rebuild the AMI (or vice versa), the runner registration fails with a cryptic error about version mismatch.

We solved this with a single source of truth:

// ami-config.ts
export const AMI_PIPELINE_CONFIG = {
  /** Must match RUNNER_VERSION in scripts/userdata.sh */
  runnerVersion: '2.333.1',
  schedule: 'cron(0 3 ? * SUN *)',
  // ...
}

The comment is load-bearing. When you update runnerVersion, you update it in one place, and the AMI pipeline and UserData script both read from it. But you still need to trigger a new AMI build and wait for it to complete before the new version is live. The weekly schedule handles routine updates; urgent version bumps require a manual pipeline trigger.

Bootstrap Chicken-and-Egg

When you first deploy the AMI stack, there’s no AMI yet. The SSM parameter contains PENDING_FIRST_BUILD. If a runner tries to launch with that “AMI ID,” it fails spectacularly. The solution is unglamorous: after the first deploy, you manually trigger the Image Builder pipeline, wait 15-20 minutes for it to build, and then your runners are live.

We could automate this with a custom resource that triggers the first build on stack creation. We haven’t. It’s a one-time operation per account, and the manual step serves as a forcing function to verify the pipeline works before trusting it with production CI.

The Numbers

After all three phases - basic architecture, x86 stabilisation, custom AMI - here’s where we landed:

Boot-to-job time: ~35 seconds (down from 3+ minutes on vanilla AMI, ~0 on GitHub-hosted but with queuing delays)
Cost per runner-minute: ~60% less than GitHub-hosted (spot pricing on m7i.large)
Spot success rate: ~95% (the remaining 5% fall through to on-demand)
Monthly runner cost: predictable and proportional to actual CI usage
Peak concurrent runners: 16 (during cross-repo deploy storms)

What We’d Do Differently

Start with x86. The ARM64 detour cost us two weeks of debugging before we retreated. Unless your entire stack is ARM-native and thoroughly tested, start with the architecture you know works.

Build the AMI pipeline from day one. We ran on vanilla AMIs for weeks before building the custom pipeline. Every developer on the team experienced the 3-minute boot delay on every PR. The AMI pipeline took about a day to build and saved cumulative hours in the first week.

Use SSM Parameter Store, not Secrets Manager, for ephemeral tokens. We initially used Secrets Manager for runner registration tokens. At $0.40 per secret per month, that’s fine for long-lived secrets. For tokens that exist for 30 seconds before being deleted, Parameter Store (free for standard parameters) is the right choice.

Wrapping Up

Self-hosted runners are not a weekend project. The webhook, queue, orchestrator, consumer, cleanup, and AMI pipeline are six distinct components, each with their own failure modes and operational concerns. But for a team running serious CI workloads, the control and cost savings are worth it.

The key insight is treating runners as cattle, not pets. Every instance is ephemeral, stateless, and disposable. The AMI is rebuilt weekly. The runner binary auto-updates are disabled because we control the version. If anything goes wrong, the cleanup Lambda terminates it and the next job gets a fresh instance.

It’s not glamorous infrastructure. But it’s the kind of infrastructure that lets you merge a PR and have it running in staging 8 minutes later instead of 20. And that compounds into something that matters.

We Built Our Own CI Machines on AWS (And It Was Worth It)

2026-04-30 · 6 min read

The Bill That Started It All

GitHub offers hosted build servers for running automated tests and deployments. You write one line saying what kind of machine you want, and a fresh server appears, does your work, and disappears. It’s a genuinely elegant service.

It also charges by the minute. And when you’re running a monorepo with multiple apps, several test suites, and deployments across multiple environments, those minutes compound quickly. The finance spreadsheet had started giving me looks.

Beyond cost, there was a performance problem: GitHub’s servers took 30 to 60 seconds just to become ready for each job. On a busy day with many parallel jobs, that’s a lot of idle waiting.

So we did what any team that enjoys making things harder for themselves does: we built our own.

The Core Idea

The concept is simple: when GitHub needs a machine to run a job, we spin up our own server on AWS, run exactly one job on it, and then shut it down. Stateless, disposable, no lingering state from previous jobs.

The implementation is less simple.

When a job needs to run, GitHub sends us a notification. Our system receives that notification, checks it’s genuine, and decides what kind of machine to spin up. The machine boots up, registers itself with GitHub, runs the job, and then terminates itself when done. If anything goes wrong, a cleanup process runs every 15 minutes looking for machines that have been running too long and shuts them down.

There’s a nice detail in how we handle different kinds of jobs: most jobs (running tests, checking code quality) run on cheaper “spot” servers, which are spare AWS capacity sold at a discount. These can occasionally be reclaimed mid-job, but CI jobs retry naturally. Deployment jobs are different though. If a deployment gets interrupted partway through, you can end up with infrastructure in a broken state. Those get routed to more reliable standard servers.

The Spot Instance Fallback Chain

Cheap spot servers are only available when AWS has spare capacity of a particular type. To handle situations where one type isn’t available, we built a fallback chain: try this server type first, if that’s full try a different type, if that’s full try a third, and so on. Only if everything is full do we fall back to a standard server.

This kind of setup requires some reasonably involved orchestration, but once it’s working it hums along invisibly. Our spot success rate is around 95%, meaning we only pay full price about one in twenty times.

The ARM64 Detour

Once the basic system was working, I spotted what seemed like an obvious improvement. AWS offers server types based on ARM64 processors (the same chip architecture used in modern phones and Apple’s M-series chips) at about 20% less cost than equivalent traditional servers. Our serverless functions already ran on ARM64. The CI runner software supports it. Why not?

We switched the fleet over. Jobs ran. Cache keys got updated to keep ARM and traditional caches separate (otherwise code compiled for one architecture might accidentally be used on the other, causing strange failures). Everything looked great for about two weeks.

Then our test runner started hanging. Not failing with an error. Just… stopping. Jobs would hit the 30-minute timeout with no output, no clues. We dug into it for days: different configurations, different isolation modes, different memory limits. Nothing helped. The hangs were intermittent and impossible to reproduce on our development machines, which use the traditional architecture.

We never found the root cause. The commit message when we switched back tells the story concisely:

feat: switch runner fleet from ARM64 to x86 - vitest hangs on ARM64

Sometimes the right call is to retreat. The 20% savings were not worth the reliability problem. Lesson: cheaper is only cheaper if it works.

The Custom Machine Image

Even after the architecture was stable, we had a boot time problem. Each new server had to spend its first three minutes downloading and installing a lot of software before it could run any CI jobs. That’s three minutes multiplied by every parallel job, every pull request, every deployment.

The fix was to pre-install everything onto a custom machine template. Instead of downloading software at runtime, the server boots with everything already there. The template gets rebuilt automatically every week to pick up security updates. When a new template is ready, a script updates a central record with the new template ID. Future servers automatically use the latest template without any manual intervention.

The difference was dramatic:

Phase	Before	After
Machine startup	30 seconds	30 seconds
Installing software	About 2.5 minutes	0 (pre-installed)
Registering with GitHub	5 seconds	5 seconds
Total time to ready	About 3 minutes	About 35 seconds

An 80% reduction in startup time, felt on every single CI job.

The Chicken-and-Egg Problem

There’s one slightly awkward aspect of this setup: when you first deploy it, the template doesn’t exist yet. The system references a template ID that is initially just a placeholder. The first time, you have to manually trigger a template build, wait 15-20 minutes for it to complete, and then the system becomes operational.

We could automate this. We haven’t. It’s a one-time setup step per environment, and doing it manually is a useful forcing function to verify the pipeline works before trusting it with real CI jobs.

The Numbers

After all three phases (basic setup, switching back to traditional architecture, custom machine templates):

Machines are ready for work in about 35 seconds
We pay roughly 60% less per build minute compared to GitHub’s hosted runners
About 95% of jobs run on cheaper spot capacity
At peak, we’ve had 16 machines running simultaneously during busy deploy periods

What We’d Do Differently

Start with the traditional x86 architecture. The ARM64 detour cost two weeks and taught us nothing except “this doesn’t work reliably.” Unless you have a specific reason to use ARM64 and confidence that your tools work on it, use what you know works.

Build the custom machine template pipeline from the very beginning. We ran without it for weeks and every developer experienced those three-minute waits on every pull request. The template pipeline took about a day to build and paid back that investment within the first week.

Closing Thoughts

This is not a weekend project. There are half a dozen moving parts, each with their own failure modes. But for a team running serious CI workloads, the control and cost savings are worth it.

The key insight is thinking of each machine as completely disposable. They boot, do one job, and die. The template is rebuilt weekly. If anything misbehaves, the cleanup process handles it. The result is a system you can trust without thinking about it, which frees you up to think about what you’re actually building.

It’s not glamorous infrastructure. But “merge a pull request and have it running in staging 8 minutes later” compounds into something that matters enormously over a year of development.