Kraftwork: Building a Workflow Engine for AI-Assisted Development

2026-05-06 · 11 min read

In Part 2 of the AI coding tools series, I described how we settled on Claude Code as the backbone of our development workflow. Subagents, skills, hooks, and a well-structured CLAUDE.md got us a long way. But as the projects grew and I started juggling multiple tickets across multiple repos, something started to crack.

The AI was brilliant at the micro level. Give it a focused task with clear context and it delivered. But at the macro level, coordinating work across tickets, keeping specs alive during implementation, splitting large features into reviewable PRs, and remembering what we learned last week, that was all on me. I was the workflow engine, and I was dropping things.

So I built one.

The Problem: AI Without Structure

Here’s what unstructured AI-assisted development looks like on a Monday morning:

Open Claude Code in a repo
Tell it to work on PROJ-1234
It makes changes directly on your working branch
Meanwhile, PROJ-5678 needs a hotfix
You stash, switch branches, context is gone
Come back to PROJ-1234, half the changes conflict with something that landed while you were away
Finally done, you create a PR with 47 files changed. Reviewer opens it and closes their laptop.

I was doing this daily. The AI could write excellent code, but the surrounding workflow (isolation, planning, review sizing, knowledge retention) was manual and fragile. Every time I started a new session, Claude had no memory of what we’d learned last week. Every time I finished a feature, the PR was too big. Every time I switched between tickets, I paid a context-switching tax that the AI couldn’t help with.

The individual tools existed. Git worktrees for isolation. Markdown specs for planning. GitHub CLI for PRs. But nothing tied them together into a coherent flow.

Kraftwork: The Idea

Kraftwork is a monorepo of Claude Code plugins that orchestrate the full development lifecycle. Not a new tool to learn, just a structured layer on top of the tools we already use.

The core insight: instead of the developer being the workflow engine, make the AI the workflow engine. Give it nine commands that cover the complete cycle from ticket to merge:

Command	What It Does
`/kraft-config`	Initialize workspace, discover and configure providers
`/kraft-work TICKET-123`	Create or resume a worktree for a ticket
`/kraft-plan`	Brainstorm, spec, decompose into merge-sized tasks
`/kraft-implement`	Execute tasks from the spec
`/kraft-split`	Split a branch into stacked PRs
`/kraft-sync`	Pull latest across all repos
`/kraft-import`	Import remote branches into local workspace
`/kraft-archive`	Clean up completed worktrees
`/kraft-retro`	Post-merge retrospective

Each skill is a markdown file that Claude Code loads as a prompt. No compiled code in the core. No framework. Just very opinionated instructions for how to approach each phase.

Worktree Isolation: One Ticket, One World

The first problem I solved was context bleeding between tickets. The answer was git worktrees, but automated.

When you run /kraft-work PROJ-1234, Kraftwork:

Checks if a worktree already exists for that ticket (resume mode)
If not, creates a fresh worktree branched from main
Sets up the spec directory at docs/specs/PROJ-1234/
Drops you into the worktree ready to plan

The workspace layout:

workspace/
├── modules/              Source repos (read-only reference)
├── trees/
│   ├── PROJ-1234-add-auth/       Worktree per ticket
│   ├── PROJ-5678-fix-validation/
│   └── PROJ-3782-MR1-persistence/
│       └── PROJ-3782-MR2-controller/   Stacked worktree
└── docs/
    └── specs/
        ├── PROJ-1234/
        │   ├── idea.md
        │   ├── spec.md
        │   └── tasks.md
        └── PROJ-5678/

The beauty of worktrees is that they share the same .git directory. No cloning, no disk waste, instant creation. And because each ticket lives in its own directory tree, you can switch between them without stashing, without conflict, without losing context.

Stacked work was the interesting extension. Sometimes PROJ-3782 is too big for one PR. /kraft-work has a stack mode that creates a child worktree branched from the current one, with metadata tracking the parent-child relationship. When it’s time to submit, /kraft-split creates the PRs in dependency order.

Spec-Driven Planning

This is where Kraftwork gets opinionated, and where I think it adds the most value.

When you run /kraft-plan, the AI doesn’t start coding. It starts thinking. The process has three phases:

1. Brainstorm (idea.md): What are we building? Why? What are the constraints? What approaches could work? This is a conversation, not a document. The AI asks questions, proposes options, and captures the discussion.

2. Spec (spec.md): The agreed design, written up properly. Architecture, components, data flow, edge cases. This becomes the source of truth for implementation.

3. Tasks (tasks.md): The spec decomposed into merge-request-sized chunks. Each task should produce a reviewable, independently mergeable PR.

The key rule: once tasks.md exists, the spec is frozen. If something needs to change during implementation (and it always does), the change becomes a numbered change record (changes/001-revised-auth-flow.md) with an impact classification: additive, blocking, or replacing. This forces you to think about whether a mid-implementation change is really necessary or whether you’re just scope-creeping.

This sounds heavy. In practice, for a small feature, the brainstorm is two messages and the spec is half a page. The structure scales to the size of the work.

What surprised me was how much better the implementation phase went with a spec in place. Claude Code with a spec is a different animal to Claude Code with a vague prompt. It knows what it’s building, it knows what’s out of scope, and it can check its own work against the requirements. The spec pays for itself in the first hour.

Provider Interfaces: The Part I’m Most Proud Of

Here’s the problem I didn’t anticipate. At the startup, we use GitHub and ClickUp. At Aircall, I use GitLab and Jira. The workflow is identical. The tools are different. I didn’t want to maintain two copies of Kraftwork with different API calls hardcoded into every skill.

The solution was a provider interface system. Six categories of capability:

Category	What It Abstracts
`git-hosting`	Branches, PRs/MRs, repo operations
`ci`	Pipeline status, job logs
`ticket-management`	Search, create, update tickets
`document-storage`	Read/write project docs
`memory`	Knowledge storage and recall
`messaging`	Notifications, chat

Each provider is a separate Claude Code plugin that implements one or more categories. kraftwork-github implements git-hosting. kraftwork-gitlab implements both git-hosting and ci. kraftwork-clickup implements ticket-management and document-storage.

When a core skill needs to, say, create a pull request, it doesn’t call GitHub. It looks up which provider is configured for git-hosting in workspace.json, constructs the fully qualified skill name (kraftwork-github:git-hosting-create), and invokes it. The core skill doesn’t know or care which provider is behind the interface.

Core skill needs "create PR"
    → reads workspace.json → providers["git-hosting"] = "kraftwork-github"
    → invokes kraftwork-github:git-hosting-create
    → GitHub PR created

Swapping from GitHub to GitLab means changing one line in workspace.json. Every core skill works identically.

The fallback design was the second insight. If no provider is configured for a category, Kraftwork falls back to local implementations. No ClickUp? Specs are stored as markdown files. No Jira? Ticket context is captured manually. The workflow degrades gracefully instead of breaking.

I built a template plugin (kraftwork-template) that you can copy and rename to build a new provider. It includes a CHECKLIST.md walking through what to implement. Adding a new provider is a few hours of work, not a rewrite.

Local Intelligence: Memory That Persists

The most experimental part of Kraftwork is kraftwork-intel, the memory provider. It solves the problem that every Claude Code session starts from zero.

It has three layers:

SQLite metrics: Session statistics, skill usage counts, interaction patterns. Lightweight structured data. Useful for understanding how the workflow is actually being used.

LanceDB knowledge base: This is the interesting one. A vector database storing codebase learnings, debugging insights, architecture decisions, anything worth remembering. When a new session starts, relevant knowledge is retrieved by semantic similarity. “We’re working on the auth module” surfaces past learnings about auth gotchas, even if the exact words don’t match.

Embeddings are computed locally using all-MiniLM-L6-v2 via Hugging Face Transformers. No API calls, no cloud dependency, no data leaving the machine. This was a deliberate choice: developer knowledge is sensitive and shouldn’t live on someone else’s server.

Skill evaluations: Quality scoring for completed tasks using both heuristic checks (did the PR get approved without changes?) and optional LLM-based rubrics (via local Ollama with llama3.2:3b). The scores feed back into the knowledge base. Over time, Kraftwork gets a signal on which approaches work well and which don’t.

I’ll be honest: the intelligence layer is the least mature part. The metrics work well. The knowledge base is useful but needs curation (it captures too much noise). The skill evaluations are promising but I haven’t used them enough to know if the signal is real. It’s the part I’m most excited to iterate on.

What We Didn’t Expect

Writing prompts is harder than writing code

Each Kraftwork skill is a markdown prompt. The quality of the entire system depends on the quality of those prompts. I rewrote /kraft-plan four times before it consistently produced specs at the right level of detail. Too vague and the implementation phase falls apart. Too detailed and the planning phase takes longer than just writing the code.

The breakthrough was separating concerns: the brainstorm prompt encourages exploration, the spec prompt enforces structure, and the task decomposition prompt thinks in terms of PRs. Trying to do all three in one prompt produced mediocre results at every stage.

The provider interface was worth 10x the effort

I built the provider system because I needed GitHub and GitLab support. What I got was a clean separation between “what the workflow needs” and “how it’s done.” This made every core skill simpler to write, simpler to test, and simpler to reason about. The indirection felt like over-engineering until the third provider, when it started feeling like the only sane approach.

Worktrees changed how I think about work

Before Kraftwork, I’d work on one thing at a time because context-switching was expensive. With worktrees, switching is free. I now routinely have three or four tickets in flight, each in its own isolated world. When I’m blocked on a review for one, I /kraft-work into another. The mental model shifted from “I’m working on feature X” to “I have a workspace with multiple active threads.”

Specs survive contact with reality (mostly)

I expected specs to become stale the moment implementation started. The change record system helps, but the real reason specs survive is simpler: the AI reads them before every implementation step. A spec that’s referenced constantly stays relevant. A spec that’s written and forgotten dies. Kraftwork forces the former.

The Stack

Core: Markdown prompts (Claude Code skills), Bash scripts
Intelligence: Bun, SQLite, LanceDB, all-MiniLM-L6-v2, optionally Ollama
Providers: One plugin per vendor (GitHub, GitLab, Jira, ClickUp, Slack)
Dependencies: git, jq, and the relevant vendor CLI (gh, glab, etc.)

The entire core is zero-dependency. No npm install. No build step. Just markdown and shell scripts that Claude Code interprets. The intelligence layer is the only part with a real runtime (Bun + native dependencies for the vector database).

Is This Over-Engineered?

Probably. A solo developer building a plugin system with provider interfaces and a vector database for an AI coding assistant is, objectively, a lot.

But here’s the thing: I use this every day. At the startup and at Aircall. The worktree isolation alone saves me an hour a week in context-switching. The specs have caught scope creep that would have cost days. The provider system means I maintain one workflow across two completely different toolchains.

And the intelligence layer, even in its rough state, has already surfaced a past learning that saved me from repeating a mistake. Once. That one time probably paid for the entire effort.

The real question isn’t whether this is over-engineered. It’s whether the workflow patterns here should be built into the AI tools themselves. Spec-driven planning, worktree isolation, persistent memory: these aren’t niche needs. Every developer using AI assistance at scale runs into these problems.

Until the tools catch up, there’s Kraftwork.

Kraftwork is open source at github.com/filipeestacio/kraftwork. It’s a Claude Code plugin system, so you’ll need Claude Code to use it. If you’re building something similar for a different AI tool, the architecture patterns (provider interfaces, spec-driven planning, worktree isolation) are tool-agnostic.

Kraftwork: Building a Proper Workflow for AI-Assisted Development

2026-05-06 · 9 min read

In Part 2 of the AI coding tools series, I described how we settled on Claude Code as our main development tool. It works remarkably well. But as projects grew and I started juggling multiple pieces of work across multiple repositories, cracks started showing.

The AI was great at the detail level. Give it a focused task with clear context and it delivered. But at the coordination level — tracking what we’re building, keeping the plan alive when reality pushes back, splitting large features into reviewable chunks, remembering what we learned last month — all of that was still on me. I was acting as the workflow engine, and I was dropping things.

So I built one.

The Problem

Here’s what a typical Monday morning looked like before I fixed this:

Open the AI assistant on a working branch. Start on a ticket. Halfway through, an urgent fix is needed on a different ticket. Stash everything, switch branches, lose the thread. Come back to the first ticket to find half the changes conflict with something that landed while I was away. Finally submit a pull request with 47 files changed. The reviewer opens it, sighs visibly through the screen, and closes their laptop.

The AI could write excellent code. The surrounding workflow — keeping tickets isolated, planning before diving in, sizing pull requests for human review, building up shared knowledge over time — was all manual and fragile. Every new session started from zero. Every large feature became a review nightmare.

The individual tools existed. Isolated working directories for each ticket. Markdown documents for planning. The GitHub command-line tool for submitting work. Nothing tied them together.

Kraftwork

Kraftwork is a collection of plugins for Claude Code that handle the full development cycle from “I have a ticket” to “this is merged.”

The core idea: instead of the developer acting as workflow manager, the AI acts as workflow manager. Nine commands cover the complete cycle:

Command	What It Does
`/kraft-config`	Set up the workspace and connect to your tools
`/kraft-work TICKET-123`	Start or resume work on a ticket
`/kraft-plan`	Think before building: brainstorm, spec, break into tasks
`/kraft-implement`	Execute the plan
`/kraft-split`	Break a large branch into smaller, reviewable pieces
`/kraft-sync`	Pull in the latest changes from the team
`/kraft-import`	Bring in a branch from another repo or team member
`/kraft-archive`	Clean up after merging
`/kraft-retro`	Capture what we learned while it’s fresh

Each command is a carefully written set of instructions that Claude Code follows. No compiled software. No framework to install. Just very opinionated guidance for how to approach each phase of development.

One Ticket, One World

The first problem I solved was context bleeding between tickets.

When you run /kraft-work on a ticket, Kraftwork creates a completely isolated working environment for that ticket — separate directory, separate branch, fresh start. If that ticket already exists from a previous session, it resumes where you left off.

All tickets sit alongside each other in a workspace:

workspace/
├── trees/
│   ├── PROJ-1234-add-auth/
│   ├── PROJ-5678-fix-validation/
│   └── PROJ-3782-MR1-persistence/
└── docs/
    └── specs/
        ├── PROJ-1234/
        │   ├── idea.md
        │   ├── spec.md
        │   └── tasks.md
        └── PROJ-5678/

Switching between tickets is instant. No stashing, no conflicts, no lost context. This is git worktrees under the hood — a feature that lets you check out multiple branches at once without the overhead of cloning the repository multiple times.

For large features that need to be submitted in multiple installments, there’s a stacking mode: a second ticket can branch off the first, creating a chain of pull requests where each one builds on the previous. /kraft-split handles submitting them in the right order.

Plan Before You Build

This is where Kraftwork gets opinionated — and where it adds the most value.

When you run /kraft-plan, the AI doesn’t start coding. It starts thinking. Three phases:

Brainstorm: What are we building? Why? What are the constraints? What could go wrong? This is a conversation, not a deliverable. Questions get asked, options get explored, decisions get made.

Spec: The agreed design, written up properly. Architecture, components, how data flows, edge cases. This becomes the single source of truth for everything that follows.

Tasks: The spec broken down into pull-request-sized chunks. Each task should produce something small enough that a human reviewer can reasonably engage with it.

Once the task list exists, the spec is frozen. If something needs to change during implementation (and it always does), the change gets documented as a numbered record with a clear label: does this add to the plan, block progress, or replace something we’d decided? The forced documentation slows down scope creep just enough to be useful.

In practice, for a small feature, the brainstorm takes ten minutes and the spec is half a page. The structure scales to the work. What surprised me was how much better implementation went once there was a spec in place. Claude Code with a clear plan is a different animal to Claude Code with a vague description. It knows what it’s building, knows what’s out of scope, and can check its own work against the requirements.

Works Wherever You Work

Here’s a problem I didn’t anticipate. At our startup, we use GitHub for code and ClickUp for tickets. At another client, I use GitLab and Jira. The workflow is identical. The tools are different. I didn’t want to maintain two separate versions of Kraftwork with different API calls hardcoded everywhere.

The solution was a provider interface system. Kraftwork’s core commands don’t call any specific service. Instead, they look up which service is configured for each category of capability in a settings file, then call that service’s plugin.

Six categories cover everything the workflow needs:

Category	What It Handles
Code hosting	Branches, pull requests, repository operations
Build system	Pipeline status, build logs
Ticket management	Search, create, update tickets
Document storage	Read/write planning documents
Memory	Knowledge storage and recall
Messaging	Notifications, team communication

Swapping from GitHub to GitLab means changing one line in the settings file. Every core command works identically.

If no service is configured for a category, Kraftwork falls back to local alternatives. No ClickUp? Planning documents stay as local files. No Jira? Ticket context is captured manually. The workflow gets simpler but doesn’t break.

Memory That Persists

Every Claude Code session starts from zero. No memory of what worked last time, what patterns to avoid, what we discovered about this particular codebase three weeks ago. That’s the most frustrating limitation of AI coding tools at scale.

The memory provider in Kraftwork takes a run at solving this. It has three parts:

Usage tracking: Basic statistics on which commands get used, how sessions go. Lightweight and useful for understanding how the workflow is actually working in practice.

Knowledge base: The interesting one. A local database that stores learnings — debugging insights, architectural decisions, gotchas discovered the hard way. When a new session starts on a related area of the codebase, relevant past learnings surface automatically. “We’re working on the authentication module” retrieves past notes about authentication edge cases, even if the exact words don’t match. All search is done locally, with no data leaving the machine.

Quality signals: Quality scoring for completed work, drawing on simple checks (did the pull request get approved without requested changes?) and optional AI-based evaluation. The scores feed back into the knowledge base over time.

I’ll be honest about where this is at: the usage tracking works well. The knowledge base is useful but captures too much noise — it needs curation. The quality signals are promising but I haven’t used them enough to trust them yet. It’s the part I’m most excited about and the least confident in.

Surprises Along the Way

Writing instructions is harder than writing code. The quality of the whole system depends on the quality of the prompts. I rewrote the planning command four times before it consistently produced plans at the right level of detail. Too vague and implementation falls apart. Too detailed and planning takes longer than just building. The breakthrough was treating brainstorm, spec, and task decomposition as three separate concerns with three separate instructions, rather than trying to do all three at once.

The provider system was worth far more than the engineering effort. I built it because I needed GitHub and GitLab support. What I got was a clean separation between “what the workflow needs” and “how it’s done.” This made every core command simpler to write and easier to reason about. The abstraction felt like over-engineering until the third provider, when it started feeling like the only sensible approach.

Parallel work changed how I think about work itself. Before Kraftwork, I worked on one thing at a time because switching was expensive. With isolated working environments per ticket, switching is free. I routinely have three or four tickets in flight simultaneously. When I’m waiting for a review on one, I switch to another. The mental model shifted from “I’m working on feature X” to “I have a workspace with several active threads.”

Is This Overkill?

Probably. A solo developer building a plugin system with vendor-agnostic provider interfaces and a local AI-powered knowledge base for a coding assistant is, objectively, a lot of architecture for one person.

But I use this every day. The isolated working environments alone save me an hour a week in context-switching costs. The planning phase has caught scope creep that would have cost days to unpick. The provider system means I maintain one workflow across completely different toolchains at different clients.

And the memory layer, even in its rough state, has already surfaced a past learning that saved me from repeating a mistake. Once. That one time probably justified the entire effort.

The bigger question isn’t whether this is over-engineered. It’s whether these patterns — planning before building, keeping work isolated, memory that persists between sessions — should be built into AI coding tools by default. Every developer using AI assistance seriously runs into these problems eventually.

Until the tools catch up, there’s Kraftwork.

Kraftwork is open source at github.com/filipeestacio/kraftwork. It requires Claude Code to use. The architecture patterns — provider interfaces, spec-driven planning, worktree isolation — are not Claude-specific and could be adapted to other AI coding tools.