Skip to content

Issue claim state machine

Full doc at docs/STATE_MACHINE.md. This page is the executive summary.

Two actors can race on the same issue:

  • Two agent firings (rare; with_lock serializes per codename, but cross-codename collisions exist).
  • One agent + you pushing a manual branch.
  • Two human users if you ever expand beyond solo (out of scope today, but the design accommodates it).

Without a coordination primitive, you get duplicate work. A real failure mode: one agent ships a quick PR for an issue in the morning, then you open a careful PR for the same issue later, neither aware of the other.

State carried entirely on GitHub labels + structured HTML comments. No shared database, no shared filesystem, no Slack lock. GitHub is the synchronisation point. The lifecycle is single-host today, but the contract works the same way if you ever spread the fleet across machines.

stateDiagram-v2
    [*] --> agent_implement : drake / human files

    agent_implement : agent:implement
    agent_implement --> agent_in_flight : claim_issue()
    agent_implement --> needs_human_scope : 3+ failed attempts

    agent_in_flight : agent:in-flight
    agent_in_flight --> plan_pending : Batman posts plan to Slack
    agent_in_flight --> agent_implement : release(transition_to=None)
    agent_in_flight --> agent_pr_open : release(transition_to=agent:pr-open)
    agent_in_flight --> agent_implement : stale-claim sweep (>4h)
    agent_in_flight --> race_yield : earlier claim detected
    race_yield --> agent_implement : yield + post race-yielded comment

    plan_pending : agent:plan-pending-approval
    plan_pending --> agent_in_flight : approver reacts yes
    plan_pending --> agent_implement : approver rejects or timeout

    agent_pr_open : agent:pr-open
    agent_pr_open --> agent_done : automerge / human merge
    agent_pr_open --> agent_implement : PR closed without merge

    agent_done : agent:done
    needs_human_scope : needs:human-scope
    do_not_pickup : do-not-pickup (sticky, orthogonal)

    agent_done --> [*]
    needs_human_scope --> [*]
LabelMeaningSet by
agent:implementEligible for autonomous pickupDrake (or human)
agent:in-flightAn agent is actively working itclaim_issue()
agent:plan-pending-approvalPlan posted for approval, waiting on go/no-goSlack approval gate (lib/slack_approval.py) before reaction resolves
agent:pr-openA PR exists for this issuerelease_issue(transition_to=...)
agent:doneClosed and shippedexternal (PR merge handler)

The agent:plan-pending-approval label is set by Batman when it posts a plan to Slack and is waiting for an approval reaction. The Slack approval gate polls one message’s reactions, and on the approver’s reply the agent either transitions back to agent:in-flight (approved, worker pickup) or returns to agent:implement (rejected or timed out). See docs/SLACK_APPROVAL.md for the full setup walkthrough.

LabelMeaning
do-not-pickupOperator override; agents skip this issue
needs:human-scopeIssue is too vague; not eligible for autonomous pickup

Posted alongside every label change so the audit trail survives manual label edits:

<!-- agent-claim:codename=lucius firing_id=20260501-194217-643a ts=2026-05-01T19:42:33Z -->
<!-- agent-release:codename=lucius firing_id=20260501-194217-643a outcome=success pr=https://github.com/foo/bar/pull/42 ts=2026-05-01T20:08:11Z -->

find_stale_claims() reads these to decide who currently holds an in-flight claim and how old that claim is, without depending on label-event timestamps.

claim_issue():

  1. Reads current label set; refuses if any blocker label is present.
  2. Atomically adds agent:in-flight + posts the claim comment.
  3. Re-reads recent comments to detect any unreleased earlier claim.
  4. If an earlier claimant exists (by createdAt timestamp), the loser:
    • Removes its own agent:in-flight label
    • Restores agent:implement
    • Posts a release comment with outcome=race-yielded-to=<earlier_codename>:<earlier_firing_id>
  5. The earlier claimant keeps the issue uncontested.

The loser exits the firing without burning a Claude turn on duplicate work. The race window collapses from ~20 minutes (between agent pick + PR open) to the sub-second gap between read-labels and add-label.

sequenceDiagram
    participant L as Lucius (firing A)
    participant gh as GitHub issue #303
    participant B as Lucius (firing B)

    L->>gh: read labels (agent:implement)
    B->>gh: read labels (agent:implement)
    L->>gh: add agent:in-flight + claim comment (ts=T1)
    B->>gh: add agent:in-flight + claim comment (ts=T2)
    L->>gh: re-read comments
    Note over L: only my claim, T1 is earliest
    L->>gh: keep issue, make worktree
    B->>gh: re-read comments
    Note over B: A's claim at T1 < my T2 -> I lost
    B->>gh: remove my agent:in-flight, restore agent:implement
    B->>gh: post release: outcome=race-yielded-to=lucius:A
    Note over B: exit firing, 0 Claude turns spent

A runner crashing between claim_issue and release_issue would normally leave an issue blocked indefinitely. find_stale_claims() reads claim comments and surfaces any in-flight claim with no matching release after max_age_hours (default 4). force_release_stale_claim() then transitions the issue back to agent:implement so the queue picks it up again.

Wire it into your fleet’s daily cleanup runner. The shipped bin/alfred-label-state.py binary exposes this as alfred-label-state sweep-claims [--max-age-hours N] [--dry-run]. deploy.sh copies it into $ALFRED_HOME/bin/ alongside the other alfred-* binaries.

Two ways to take an issue manually without racing an agent:

Terminal window
# Mark a single issue do-not-pickup
alfred-label-state claim your-org/your-backend#42
# ... do your work ...
alfred-label-state release your-org/your-backend#42
Terminal window
# Take a whole repo offline from the fleet
alfred-label-state repo pause your-backend
# ... refactor in peace ...
alfred-label-state repo resume your-backend

sweep-claims reads LABEL_STATE_SWEEP_REPOS (comma-separated) for its default repo set:

Terminal window
LABEL_STATE_SWEEP_REPOS="your-backend,your-frontend,your-mobile" \
alfred-label-state sweep-claims --max-age-hours 4 --dry-run

The pre-push git hook (examples/git-hooks/pre-push) enforces this symmetrically. Push a branch whose commits reference Closes #N and that issue is currently in-flight or has a PR open, the push is refused.

Override per-push: git push --no-verify. Override globally: LABEL_STATE_SKIP_DEDUP_CHECK=1 in your shell rc.

# State transitions
claim_issue(repo, num, *, codename, firing_id) -> bool
release_issue(repo, num, *, codename, firing_id,
outcome="success", transition_to=None, pr_url=None) -> bool
# Inspection
issue_dedup_check(repo, num) -> dict
find_stale_claims(repo, *, max_age_hours=4) -> list[dict]
# Recovery
force_release_stale_claim(repo, num, *, sweep_id,
released_codename=None,
released_firing_id=None) -> bool
# Operator overrides
is_repo_paused(repo) -> bool
list_paused_repos() -> list[str]
set_repo_paused(repo, paused) -> list[str]
# Constants
LIFECYCLE_LABELS: list[tuple[str, str, str]]
CLAIM_COMMENT_PREFIX: str
RELEASE_COMMENT_PREFIX: str
PAUSED_REPOS_FILE: Path

See agent_runner API reference for the full module surface.

Every label string lives in lib/labels.py. Import from there rather than duplicating string literals:

from labels import (
IMPLEMENT, # "agent:implement"
IN_FLIGHT, # "agent:in-flight"
PR_OPEN, # "agent:pr-open"
DONE, # "agent:done"
DO_NOT_PICKUP, # "do-not-pickup"
NEEDS_HUMAN_SCOPE, # "needs:human-scope"
AUTHORED, # "agent:authored"
LARGE_FEATURE, # "agent:large-feature"
bundle_label, # builds "agent:bundle:<slug>"
is_legal_transition, # documents the state-machine moves
)

For multi-repo features (one issue, N PRs across N repos), Alfred ships:

  • lib/multi_worktree.py: MultiWorktree(requests, agent, feature_id) context manager that creates per-repo git worktrees with synchronised branch names and cleans them up on exit. Git interaction is injected via a GitRunner Protocol so tests don’t touch real worktrees.
  • lib/cross_repo_pr.py: CrossRepoPRChain plan/execute coordinator. chain.plan(...) returns a Plan dataclass (pure, no I/O); chain.execute(plan) opens each PR, persists state to $ALFRED_HOME/state/pr-chains/<feature_id>.json atomically, and refreshes earlier PR bodies as later siblings open so the cross-links stay current.

Both modules use Protocol-based dependency injection so consumers can swap the default subprocess implementation for tests or alternative GitHub clients.