← Back to posts

Kraftwork: Building a Workflow Engine for AI-Assisted Development

· 11 min read

In Part 2 of the AI coding tools series, I described how we settled on Claude Code as the backbone of our development workflow. Subagents, skills, hooks, and a well-structured CLAUDE.md got us a long way. But as the projects grew and I started juggling multiple tickets across multiple repos, something started to crack.

The AI was brilliant at the micro level. Give it a focused task with clear context and it delivered. But at the macro level, coordinating work across tickets, keeping specs alive during implementation, splitting large features into reviewable PRs, and remembering what we learned last week, that was all on me. I was the workflow engine, and I was dropping things.

So I built one.


The Problem: AI Without Structure

Here’s what unstructured AI-assisted development looks like on a Monday morning:

  1. Open Claude Code in a repo
  2. Tell it to work on PROJ-1234
  3. It makes changes directly on your working branch
  4. Meanwhile, PROJ-5678 needs a hotfix
  5. You stash, switch branches, context is gone
  6. Come back to PROJ-1234, half the changes conflict with something that landed while you were away
  7. Finally done, you create a PR with 47 files changed. Reviewer opens it and closes their laptop.

I was doing this daily. The AI could write excellent code, but the surrounding workflow (isolation, planning, review sizing, knowledge retention) was manual and fragile. Every time I started a new session, Claude had no memory of what we’d learned last week. Every time I finished a feature, the PR was too big. Every time I switched between tickets, I paid a context-switching tax that the AI couldn’t help with.

The individual tools existed. Git worktrees for isolation. Markdown specs for planning. GitHub CLI for PRs. But nothing tied them together into a coherent flow.


Kraftwork: The Idea

Kraftwork is a monorepo of Claude Code plugins that orchestrate the full development lifecycle. Not a new tool to learn, just a structured layer on top of the tools we already use.

The core insight: instead of the developer being the workflow engine, make the AI the workflow engine. Give it nine commands that cover the complete cycle from ticket to merge:

CommandWhat It Does
/kraft-configInitialize workspace, discover and configure providers
/kraft-work TICKET-123Create or resume a worktree for a ticket
/kraft-planBrainstorm, spec, decompose into merge-sized tasks
/kraft-implementExecute tasks from the spec
/kraft-splitSplit a branch into stacked PRs
/kraft-syncPull latest across all repos
/kraft-importImport remote branches into local workspace
/kraft-archiveClean up completed worktrees
/kraft-retroPost-merge retrospective

Each skill is a markdown file that Claude Code loads as a prompt. No compiled code in the core. No framework. Just very opinionated instructions for how to approach each phase.


Worktree Isolation: One Ticket, One World

The first problem I solved was context bleeding between tickets. The answer was git worktrees, but automated.

When you run /kraft-work PROJ-1234, Kraftwork:

  1. Checks if a worktree already exists for that ticket (resume mode)
  2. If not, creates a fresh worktree branched from main
  3. Sets up the spec directory at docs/specs/PROJ-1234/
  4. Drops you into the worktree ready to plan

The workspace layout:

workspace/
├── modules/              Source repos (read-only reference)
├── trees/
│   ├── PROJ-1234-add-auth/       Worktree per ticket
│   ├── PROJ-5678-fix-validation/
│   └── PROJ-3782-MR1-persistence/
│       └── PROJ-3782-MR2-controller/   Stacked worktree
└── docs/
    └── specs/
        ├── PROJ-1234/
        │   ├── idea.md
        │   ├── spec.md
        │   └── tasks.md
        └── PROJ-5678/

The beauty of worktrees is that they share the same .git directory. No cloning, no disk waste, instant creation. And because each ticket lives in its own directory tree, you can switch between them without stashing, without conflict, without losing context.

Stacked work was the interesting extension. Sometimes PROJ-3782 is too big for one PR. /kraft-work has a stack mode that creates a child worktree branched from the current one, with metadata tracking the parent-child relationship. When it’s time to submit, /kraft-split creates the PRs in dependency order.


Spec-Driven Planning

This is where Kraftwork gets opinionated, and where I think it adds the most value.

When you run /kraft-plan, the AI doesn’t start coding. It starts thinking. The process has three phases:

1. Brainstorm (idea.md): What are we building? Why? What are the constraints? What approaches could work? This is a conversation, not a document. The AI asks questions, proposes options, and captures the discussion.

2. Spec (spec.md): The agreed design, written up properly. Architecture, components, data flow, edge cases. This becomes the source of truth for implementation.

3. Tasks (tasks.md): The spec decomposed into merge-request-sized chunks. Each task should produce a reviewable, independently mergeable PR.

The key rule: once tasks.md exists, the spec is frozen. If something needs to change during implementation (and it always does), the change becomes a numbered change record (changes/001-revised-auth-flow.md) with an impact classification: additive, blocking, or replacing. This forces you to think about whether a mid-implementation change is really necessary or whether you’re just scope-creeping.

This sounds heavy. In practice, for a small feature, the brainstorm is two messages and the spec is half a page. The structure scales to the size of the work.

What surprised me was how much better the implementation phase went with a spec in place. Claude Code with a spec is a different animal to Claude Code with a vague prompt. It knows what it’s building, it knows what’s out of scope, and it can check its own work against the requirements. The spec pays for itself in the first hour.


Provider Interfaces: The Part I’m Most Proud Of

Here’s the problem I didn’t anticipate. At the startup, we use GitHub and ClickUp. At Aircall, I use GitLab and Jira. The workflow is identical. The tools are different. I didn’t want to maintain two copies of Kraftwork with different API calls hardcoded into every skill.

The solution was a provider interface system. Six categories of capability:

CategoryWhat It Abstracts
git-hostingBranches, PRs/MRs, repo operations
ciPipeline status, job logs
ticket-managementSearch, create, update tickets
document-storageRead/write project docs
memoryKnowledge storage and recall
messagingNotifications, chat

Each provider is a separate Claude Code plugin that implements one or more categories. kraftwork-github implements git-hosting. kraftwork-gitlab implements both git-hosting and ci. kraftwork-clickup implements ticket-management and document-storage.

When a core skill needs to, say, create a pull request, it doesn’t call GitHub. It looks up which provider is configured for git-hosting in workspace.json, constructs the fully qualified skill name (kraftwork-github:git-hosting-create), and invokes it. The core skill doesn’t know or care which provider is behind the interface.

Core skill needs "create PR"
    → reads workspace.json → providers["git-hosting"] = "kraftwork-github"
    → invokes kraftwork-github:git-hosting-create
    → GitHub PR created

Swapping from GitHub to GitLab means changing one line in workspace.json. Every core skill works identically.

The fallback design was the second insight. If no provider is configured for a category, Kraftwork falls back to local implementations. No ClickUp? Specs are stored as markdown files. No Jira? Ticket context is captured manually. The workflow degrades gracefully instead of breaking.

I built a template plugin (kraftwork-template) that you can copy and rename to build a new provider. It includes a CHECKLIST.md walking through what to implement. Adding a new provider is a few hours of work, not a rewrite.


Local Intelligence: Memory That Persists

The most experimental part of Kraftwork is kraftwork-intel, the memory provider. It solves the problem that every Claude Code session starts from zero.

It has three layers:

SQLite metrics: Session statistics, skill usage counts, interaction patterns. Lightweight structured data. Useful for understanding how the workflow is actually being used.

LanceDB knowledge base: This is the interesting one. A vector database storing codebase learnings, debugging insights, architecture decisions, anything worth remembering. When a new session starts, relevant knowledge is retrieved by semantic similarity. “We’re working on the auth module” surfaces past learnings about auth gotchas, even if the exact words don’t match.

Embeddings are computed locally using all-MiniLM-L6-v2 via Hugging Face Transformers. No API calls, no cloud dependency, no data leaving the machine. This was a deliberate choice: developer knowledge is sensitive and shouldn’t live on someone else’s server.

Skill evaluations: Quality scoring for completed tasks using both heuristic checks (did the PR get approved without changes?) and optional LLM-based rubrics (via local Ollama with llama3.2:3b). The scores feed back into the knowledge base. Over time, Kraftwork gets a signal on which approaches work well and which don’t.

I’ll be honest: the intelligence layer is the least mature part. The metrics work well. The knowledge base is useful but needs curation (it captures too much noise). The skill evaluations are promising but I haven’t used them enough to know if the signal is real. It’s the part I’m most excited to iterate on.


What We Didn’t Expect

Writing prompts is harder than writing code

Each Kraftwork skill is a markdown prompt. The quality of the entire system depends on the quality of those prompts. I rewrote /kraft-plan four times before it consistently produced specs at the right level of detail. Too vague and the implementation phase falls apart. Too detailed and the planning phase takes longer than just writing the code.

The breakthrough was separating concerns: the brainstorm prompt encourages exploration, the spec prompt enforces structure, and the task decomposition prompt thinks in terms of PRs. Trying to do all three in one prompt produced mediocre results at every stage.

The provider interface was worth 10x the effort

I built the provider system because I needed GitHub and GitLab support. What I got was a clean separation between “what the workflow needs” and “how it’s done.” This made every core skill simpler to write, simpler to test, and simpler to reason about. The indirection felt like over-engineering until the third provider, when it started feeling like the only sane approach.

Worktrees changed how I think about work

Before Kraftwork, I’d work on one thing at a time because context-switching was expensive. With worktrees, switching is free. I now routinely have three or four tickets in flight, each in its own isolated world. When I’m blocked on a review for one, I /kraft-work into another. The mental model shifted from “I’m working on feature X” to “I have a workspace with multiple active threads.”

Specs survive contact with reality (mostly)

I expected specs to become stale the moment implementation started. The change record system helps, but the real reason specs survive is simpler: the AI reads them before every implementation step. A spec that’s referenced constantly stays relevant. A spec that’s written and forgotten dies. Kraftwork forces the former.


The Stack

  • Core: Markdown prompts (Claude Code skills), Bash scripts
  • Intelligence: Bun, SQLite, LanceDB, all-MiniLM-L6-v2, optionally Ollama
  • Providers: One plugin per vendor (GitHub, GitLab, Jira, ClickUp, Slack)
  • Dependencies: git, jq, and the relevant vendor CLI (gh, glab, etc.)

The entire core is zero-dependency. No npm install. No build step. Just markdown and shell scripts that Claude Code interprets. The intelligence layer is the only part with a real runtime (Bun + native dependencies for the vector database).


Is This Over-Engineered?

Probably. A solo developer building a plugin system with provider interfaces and a vector database for an AI coding assistant is, objectively, a lot.

But here’s the thing: I use this every day. At the startup and at Aircall. The worktree isolation alone saves me an hour a week in context-switching. The specs have caught scope creep that would have cost days. The provider system means I maintain one workflow across two completely different toolchains.

And the intelligence layer, even in its rough state, has already surfaced a past learning that saved me from repeating a mistake. Once. That one time probably paid for the entire effort.

The real question isn’t whether this is over-engineered. It’s whether the workflow patterns here should be built into the AI tools themselves. Spec-driven planning, worktree isolation, persistent memory: these aren’t niche needs. Every developer using AI assistance at scale runs into these problems.

Until the tools catch up, there’s Kraftwork.


Kraftwork is open source at github.com/filipeestacio/kraftwork. It’s a Claude Code plugin system, so you’ll need Claude Code to use it. If you’re building something similar for a different AI tool, the architecture patterns (provider interfaces, spec-driven planning, worktree isolation) are tool-agnostic.