← Back to posts

Building Self-Hosted GitHub Actions Runners on AWS: From Webhook to Custom AMI

· 12 min read

The Bill That Started It All

GitHub-hosted runners are a beautiful abstraction. You write runs-on: ubuntu-latest and a fresh VM materialises in someone else’s data centre. You don’t think about AMIs or instance types or security groups. It just works.

It also costs money. And when you’re a startup running a monorepo with CDK synth, TypeScript compilation, Vitest suites, and Playwright tests across multiple repos, those minutes add up. We were burning through GitHub Actions minutes at a rate that made the finance spreadsheet uncomfortable. Worse, the cold-start latency of GitHub-hosted runners was adding 30-60 seconds to every job just waiting for the VM to be ready.

So we did what any reasonable team does: we built our own.

The Architecture

The core idea is simple: when GitHub needs a runner, spin up an EC2 instance, run exactly one job, then kill it. Ephemeral, stateless, disposable. The implementation is… less simple.

Here’s the full flow:

GitHub (workflow_job: queued)


  API Gateway (REST)


  Webhook Lambda ──── HMAC-SHA256 verification
        │                    │
        ▼                    ▼
    SQS Queue          On-Demand Queue
        │                    │
        ▼                    ▼
  Poller Lambda        On-Demand Poller
        │                    │
        ▼                    ▼
  Step Functions       Step Functions
   (Spot chain)        (On-Demand chain)
        │                    │
        ▼                    ▼
  Consumer Lambda ──── EC2 RunInstances


  EC2 Instance (ephemeral)
   ├── Fetch runner token from SSM
   ├── Register with GitHub
   ├── Run exactly one job
   ├── Delete token from SSM
   └── Self-terminate

Five Lambda functions, two SQS queues, two Step Functions state machines, an API Gateway, and an EC2 fleet. All defined in CDK, all deployed to a dedicated DevOps AWS account that exists purely to run CI infrastructure.

The Webhook

GitHub fires a workflow_job event with action queued when a job needs a runner. We registered a GitHub App with a webhook pointing at our API Gateway endpoint. The webhook Lambda does three things:

  1. Validates the HMAC-SHA256 signature using a shared secret stored in SSM Parameter Store
  2. Checks whether the requested labels match our runner fleet (you don’t want to accidentally claim jobs meant for GitHub-hosted runners)
  3. Routes the message to either the spot queue or the on-demand queue based on job labels

The routing is a key design decision. Most CI jobs - PR checks, linting, tests - are fine on spot instances. If a spot instance gets reclaimed mid-job, the workflow retries and we spin up another one. But deploy jobs are different. A spot reclamation during cdk deploy can leave your infrastructure in a partially-deployed state. Those get routed to on-demand instances.

The Step Functions Orchestrator

This is where it gets interesting. Spot instances are cheap but unreliable. You can’t just call RunInstances with spot and hope for the best - capacity varies by instance type and availability zone.

Our Step Functions state machine implements a fallback chain:

Spot m7i.large in eu-west-2a
  ├── Success → done
  └── InsufficientCapacity →
      Spot m6i.large in eu-west-2b
        ├── Success → done
        └── InsufficientCapacity →
            Spot c7i.large in eu-west-2c
              ├── Success → done
              └── InsufficientCapacity →
                  Spot r7i.large in eu-west-2a
                    ├── Success → done
                    └── InsufficientCapacity →
                        On-Demand m7i.large in eu-west-2a
                          ├── Success → done
                          └── All failed → ☠️

The chain tries four different spot instance types across three availability zones before falling back to on-demand. Each step catches InsufficientInstanceCapacity and MaxSpotInstanceCountExceeded errors with exponential backoff retries. The whole state machine has a 5-minute timeout.

The CDK code that builds this chain is satisfyingly terse:

let spotFallback: sfn.IChainable = launchOnDemand
for (let i = spotTypes.length - 1; i >= 0; i--) {
  const task = makeSpotTask(
    `LaunchSpot${i}`,
    spotTypes[i],
    azs[i % azs.length],
  )
  task.addCatch(
    spotFallback as sfn.IChainable,
    { errors: ['InsufficientInstanceCapacity'], resultPath: '$.error' },
  )
  task.next(success)
  spotFallback = task
}

It builds the chain backwards - the last fallback (on-demand) is defined first, then each spot attempt catches failures and falls through to the next. The result is a linear chain that reads top-to-bottom in the Step Functions console but was built bottom-to-top in code.

The Consumer Lambda

The consumer Lambda is the one that actually calls the EC2 API. For each invocation it:

  1. Generates a GitHub App JWT and exchanges it for an organisation-level runner registration token
  2. Stores the registration token in SSM Parameter Store (not Secrets Manager - cheaper and we delete it within seconds)
  3. Calls RunInstances with the specified instance type, AZ, and purchase option
  4. Tags the instance with RunnerManagedBy: self-hosted-runner and the SSM parameter name so the instance can find its own token

The instance boots with a UserData script that:

# Get instance ID from IMDSv2
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/instance-id)

# Fetch runner token from SSM (placed there by the Consumer Lambda)
RUNNER_TOKEN=$(aws ssm get-parameter \
  --name "$TOKEN_SECRET_NAME" \
  --with-decryption \
  --query Parameter.Value --output text)

# Register as ephemeral runner - one job, then exit
./config.sh --url "https://github.com/OurOrg" \
  --token "$RUNNER_TOKEN" \
  --labels "self-hosted,linux,x64,platform" \
  --name "runner-${INSTANCE_ID}" \
  --ephemeral

# Delete the token immediately
aws ssm delete-parameter --name "$TOKEN_SECRET_NAME"

# Disable auto-update (we control the version via AMI)
export RUNNER_CFG_DISABLE_AUTO_UPDATE=1
./run.sh

# Self-terminate after job completes
aws ec2 terminate-instances --instance-ids "$INSTANCE_ID"

The --ephemeral flag is critical. It tells the runner agent to accept exactly one job and then exit. No lingering instances, no state leakage between jobs, no “why is my test seeing files from the previous run” debugging sessions.

The Cleanup Lambda

Even with ephemeral runners and self-termination, things go wrong. The runner process might crash. The EC2 instance might lose network connectivity. The self-terminate call might fail. So we run a cleanup Lambda every 15 minutes that scans for instances tagged RunnerManagedBy: self-hosted-runner that have been alive longer than 30 minutes and terminates them.

Belt and suspenders. In infrastructure, paranoia is a feature.

The VPC

The runner VPC is intentionally minimal: three public subnets across three AZs, no private subnets, no NAT gateway. The runners only need outbound HTTPS (port 443) to talk to GitHub and AWS APIs. No NAT gateway means no $30/month/AZ charge for a resource that’s only used during CI runs.

The security group allows outbound 443 and nothing else. No SSH, no inbound traffic. If you need to debug a runner, you use SSM Session Manager (the instance role includes the AmazonSSMManagedInstanceCore policy).

The ARM64 Detour

With the basic architecture working, I made what seemed like an obvious optimisation: switch to ARM64 Graviton instances. AWS prices Graviton about 20% cheaper than equivalent x86 instances, and ARM64 has better power efficiency. The GitHub runner agent supports ARM64. Our Lambda functions were already running on ARM64. Why not the runners too?

We switched the fleet over. The m7g.large and m6g.large instances booted fine. CI jobs ran. Cache keys got an architecture suffix to prevent cross-architecture contamination:

key: pnpm-store-${{ runner.arch }}-${{ hashFiles('**/pnpm-lock.yaml') }}

This was important - a pnpm store cached from an x86 runner would cause native module failures on ARM64. The runner.arch prefix kept the caches separate.

Everything looked great for about two weeks. Then Vitest started hanging.

Not failing. Not erroring. Just… stopping. Jobs would hit the 30-minute timeout with no output. We investigated for days. Different Vitest configurations, different test isolation modes, memory limits, CPU affinity. Nothing helped. The hangs were intermittent and unreproducible locally (our dev machines are x86).

We never found the root cause. Commit 2e1f217 in our git history tells the story in one line:

feat: switch runner fleet from ARM64 to x86 - vitest hangs on ARM64

Sometimes the right engineering decision is to retreat. We switched back to x86 (m7i.large, m6i.large, c7i.large, r7i.large) and the hangs disappeared. The 20% savings weren’t worth the reliability tax.

Lesson learned: cheaper is only cheaper if it works.

The AMI Pipeline

Even after the architecture was stable, we had a boot time problem. Fresh instances booted from a vanilla Amazon Linux 2023 AMI, and the UserData script had to:

  1. Download the GitHub Actions runner binary (~90MB)
  2. Install system dependencies via dnf (libicu, git, jq, and about 20 Playwright browser dependencies)
  3. Install AWS CLI v2
  4. Install Node.js and pnpm

That’s a lot of downloading and installing on every single job. Each runner took about 3 minutes from EC2 launch to “ready for work.” For a PR check that runs lint + typecheck + tests in parallel, that’s 3 minutes of idle waiting multiplied by every parallel job.

The fix was obvious: pre-bake everything into a custom AMI.

EC2 Image Builder

We built an AMI pipeline using EC2 Image Builder. Five components, executed in order:

┌─────────────────────────────────────────┐
│           AMI Build Pipeline            │
│         (Sunday 03:00 UTC)              │
├─────────────────────────────────────────┤
│ 1. update-linux (AWS managed)           │
│    └── dnf update, security patches     │
│                                         │
│ 2. install-deps                         │
│    └── libicu, git, jq, Playwright libs │
│                                         │
│ 3. install-aws-cli                      │
│    └── AWS CLI v2 from zip              │
│                                         │
│ 4. install-node                         │
│    └── Node.js 24 + pnpm via corepack   │
│                                         │
│ 5. install-runner                       │
│    └── GitHub Actions runner v2.333.1   │
│        extracted to /opt/actions-runner  │
└─────────────────────────────────────────┘


    AMI: github-runner-2026-05-04


    SSM: /runners/ami/current = ami-0abc123...

The pipeline runs weekly on Sunday at 03:00 UTC. When the build succeeds, an EventBridge event triggers a success Lambda that writes the new AMI ID to an SSM parameter. The consumer Lambda reads this parameter when launching instances - so new instances automatically pick up the latest AMI without any code changes or deploys.

If the build fails? Nothing happens. The SSM parameter keeps its current value, and runners continue booting from the last known-good AMI. The failure shows up in CloudWatch, we investigate, and we fix it. No downtime, no manual intervention.

The boot time improvement was dramatic:

PhaseBefore (vanilla AMI)After (custom AMI)
EC2 launch to running~30s~30s
UserData: install deps~60s0s (pre-baked)
UserData: download runner~30s0s (pre-baked)
UserData: install Node/pnpm~40s0s (pre-baked)
Runner registration~5s~5s
Total boot-to-job~3 min~35s

An 80% reduction. Every CI job, every PR, every deploy - 2.5 minutes faster.

The Version Drift Trap

There’s a subtle footgun with pre-baked AMIs: version drift. The runner binary version baked into the AMI must match the version the UserData script expects. If GitHub releases a new runner version and you update the config but forget to rebuild the AMI (or vice versa), the runner registration fails with a cryptic error about version mismatch.

We solved this with a single source of truth:

// ami-config.ts
export const AMI_PIPELINE_CONFIG = {
  /** Must match RUNNER_VERSION in scripts/userdata.sh */
  runnerVersion: '2.333.1',
  schedule: 'cron(0 3 ? * SUN *)',
  // ...
}

The comment is load-bearing. When you update runnerVersion, you update it in one place, and the AMI pipeline and UserData script both read from it. But you still need to trigger a new AMI build and wait for it to complete before the new version is live. The weekly schedule handles routine updates; urgent version bumps require a manual pipeline trigger.

Bootstrap Chicken-and-Egg

When you first deploy the AMI stack, there’s no AMI yet. The SSM parameter contains PENDING_FIRST_BUILD. If a runner tries to launch with that “AMI ID,” it fails spectacularly. The solution is unglamorous: after the first deploy, you manually trigger the Image Builder pipeline, wait 15-20 minutes for it to build, and then your runners are live.

We could automate this with a custom resource that triggers the first build on stack creation. We haven’t. It’s a one-time operation per account, and the manual step serves as a forcing function to verify the pipeline works before trusting it with production CI.

The Numbers

After all three phases - basic architecture, x86 stabilisation, custom AMI - here’s where we landed:

  • Boot-to-job time: ~35 seconds (down from 3+ minutes on vanilla AMI, ~0 on GitHub-hosted but with queuing delays)
  • Cost per runner-minute: ~60% less than GitHub-hosted (spot pricing on m7i.large)
  • Spot success rate: ~95% (the remaining 5% fall through to on-demand)
  • Monthly runner cost: predictable and proportional to actual CI usage
  • Peak concurrent runners: 16 (during cross-repo deploy storms)

What We’d Do Differently

Start with x86. The ARM64 detour cost us two weeks of debugging before we retreated. Unless your entire stack is ARM-native and thoroughly tested, start with the architecture you know works.

Build the AMI pipeline from day one. We ran on vanilla AMIs for weeks before building the custom pipeline. Every developer on the team experienced the 3-minute boot delay on every PR. The AMI pipeline took about a day to build and saved cumulative hours in the first week.

Use SSM Parameter Store, not Secrets Manager, for ephemeral tokens. We initially used Secrets Manager for runner registration tokens. At $0.40 per secret per month, that’s fine for long-lived secrets. For tokens that exist for 30 seconds before being deleted, Parameter Store (free for standard parameters) is the right choice.

Wrapping Up

Self-hosted runners are not a weekend project. The webhook, queue, orchestrator, consumer, cleanup, and AMI pipeline are six distinct components, each with their own failure modes and operational concerns. But for a team running serious CI workloads, the control and cost savings are worth it.

The key insight is treating runners as cattle, not pets. Every instance is ephemeral, stateless, and disposable. The AMI is rebuilt weekly. The runner binary auto-updates are disabled because we control the version. If anything goes wrong, the cleanup Lambda terminates it and the next job gets a fresh instance.

It’s not glamorous infrastructure. But it’s the kind of infrastructure that lets you merge a PR and have it running in staging 8 minutes later instead of 20. And that compounds into something that matters.