Operating the fleet
Week one is install and verify. Week two is operating the thing. This is what week two looks like.
By the time you reach this page you have a deployed Alfred, an alfred status that comes back green, and at least one codename that has shipped a PR. The questions that remain are operational: how often should I look at Slack, which CLI command tells me what, what do the sentinels mean, and what do I do when the fleet goes quiet?
This page is the runbook. Full doc at docs/OPERATING_THE_FLEET.md.
Daily Slack rhythm
Section titled “Daily Slack rhythm”If you wired up the Slack webhook, three posts on the configured channel are your operating loop.
| Post | When | What it carries |
|---|---|---|
fleet-recap (morning) | 07:30 daily | Yesterday’s PRs shipped, in-flight work, doctor status, anything red. |
shipped-summary-daily | end of day | Today’s merged PRs, issues closed, LOC, model and config changes. |
fleet-recap (evening) | 22:00 daily | Per-agent spend, firing count, success rate, anything that paused itself. |
Plus the per-firing posts: shipped, blocked, partial. Quiet days are the goal. A morning post that says “yesterday: 6 PRs shipped, 0 blockers” is the steady state.
When you see a [BLOCKED] or a pause event in Slack, that is the only signal that needs same-day attention. Everything else can wait until you have time.
CLI recipes
Section titled “CLI recipes”The alfred CLI is your day-to-day verb set. Pure stdlib, no daemon. Common recipes:
alfred status # local fleet health, locks, pauses, approval waitsalfred agents # configured agents, schedule, enable state, host-unit statusalfred shipped # merged PRs, issues, LOC, and config changesalfred shipped --period weeklyalfred engine status # one line per codename, resolved Claude/Codex modealfred auth status # Claude + Codex auth surface checkalfred codex probe # one tiny Codex request end-to-endalfred claude status # show which Claude account scheduled firings usealfred enabled-agents # print the runner-gate listalfred pause <codename> # stop scheduled firings for one agentalfred pause all # stop the entire fleetalfred resume <codename> # reverse a pausealfred run <codename> # kick one firing now (one-shot)alfred run <codename> --force # ignore the pause marker for this kickThe two recipes you will use most:
- “Did anything ship today?”
alfred shippedand (if you want more detail)alfred shipped --period weekly. - “Why is Lucius quiet?”
alfred statusfirst. If it shows Lucius as paused,alfred resume lucius. If it shows a lock,alfred agentstells you whether the unit is loaded. If both look fine, the agent had noagent:implementissue to claim and exited silently.
Three places to look.
Per-agent host-scheduler logs. Every scheduled firing’s stdout and stderr land at /tmp/my.fleet.<codename>.stdout and /tmp/my.fleet.<codename>.stderr. macOS clears /tmp on reboot; copy anything you need to keep. To watch a firing live:
tail -f /tmp/my.fleet.lucius.stdout /tmp/my.fleet.lucius.stderrPer-firing transcripts. Under $ALFRED_HOME/state/transcripts/<codename>/<YYYY-MM>/<firing-id>.jsonl when transcripts are enabled, and $ALFRED_HOME/state/codex/<codename>/<YYYY-MM>/<firing-id>.{last.md,stdout.txt,stderr.txt} for Codex firings. The Codex artifacts are the most useful when a [BLOCKED] post does not explain itself; last.md is the engine’s own final message.
Per-firing event log. $ALFRED_HOME/state/<codename>/events/ carries a structured trail of what each firing tried. alfred status summarizes this; for raw inspection, cat the JSONL.
See State and memory for the full directory tree.
Reading the sentinels
Section titled “Reading the sentinels”Every firing prints exactly one sentinel string on its way out. The scheduler logs them; Slack posts the user-facing ones. Knowing them by sight is how you read the fleet at a glance.
| Sentinel | Meaning | What to do |
|---|---|---|
[OK] commit <sha> | The firing committed and opened a PR. | Nothing. The PR is in the queue. |
[ALREADY-IMPLEMENTED] | Work was already in the codebase. Issue closed. | Nothing. Often a sign Drake filed something Lucius had already solved. |
[PARTIAL] | Hit error_max_turns mid-work. Worktree left for the next firing. | Nothing. The next firing retries. If you see two in a row on the same issue, the issue may be too large; consider splitting it. |
[BLOCKED] | Engine could not resolve an error. Slack posts the reason at warn. | Read the Slack post. Common causes: missing dep, failing test the agent could not fix, repo convention the agent does not know. |
[<AGENT>-LOCKED] | A previous firing of this codename is still running. | Nothing, unless you see it for hours; then run alfred clear-lock <codename> --check. Cleanup preserves dirty or ahead worktrees and creates local recovery/* refs when commits need a handle. |
[<AGENT>-PREFLIGHT-FAILED] | A required CLI is missing, gh auth expired, or a watched repo is gone. | Run bash bin/doctor.sh. It names the missing piece. |
[<AGENT>-DOCTOR-OK] | ALFRED_DOCTOR=1 was set; the firing verified preflight and exited. | Nothing. This is how doctor.sh validates each agent. |
[<AGENT>-GLOBAL-BLOCKED] | A Claude provider limit is in effect; this firing exited silently. | Nothing for ~1 hour; the block clears automatically. Hybrid agents are unaffected and keep going via Codex. |
[<AGENT>-DEDUP-SKIP] | Another firing already claimed this issue, or the repo is paused. | Nothing. Cooperative behavior. |
[<AGENT>-NO-COMMIT] | The engine reported success but no commit landed. | Inspect the salvage draft PR (if one was opened) or check the firing log to learn why. |
[SILENT] | No matching issue. | Nothing. The non-event is the signal. |
Anything ending in -BLOCKED, -FAILED, or -NO-COMMIT needs your attention. Everything else is the fleet doing its job.
When to run alfred run <codename> --force
Section titled “When to run alfred run <codename> --force”alfred run <codename> kicks a one-shot firing now. Use it when:
- You filed an
agent:implementissue and want to see it picked up before the next 20-minute cycle. - You changed a prompt template and want to verify the change end-to-end.
- You resumed an agent after a pause and want to confirm it is alive.
--force overrides the pause marker. Use it when:
- You deliberately want one firing despite the pause (debugging, validating a fix).
- The pause marker was set automatically by a self-pause event and you want to retry once before resuming the schedule.
Do not use --force:
- During a global rate-limit block. The block is the fleet’s way of waiting out a wall. Forcing more firings burns turns to re-confirm the wall.
- When
alfred statusshows the agent is already locked. Two firings of the same codename at once is exactly whatwith_lockexists to prevent. - When the underlying issue is “the agent ran out of spend for the day”. Wait for midnight, or raise the cap deliberately.
”Fleet went quiet” troubleshooting
Section titled “”Fleet went quiet” troubleshooting”If you expected activity and Slack is silent, walk this list in order. Each step is cheap and most failures are caught in the first two.
1. Claude auth expired
Section titled “1. Claude auth expired”alfred auth statusIf Claude reports unauthenticated, run claude interactively once. The auth blob at ~/.claude/ expires rarely but does expire. The hybrid agents will have been failing over to Codex; pure-Claude agents will have been silent. See Engine routing.
2. Global block engaged
Section titled “2. Global block engaged”cat $ALFRED_HOME/state/global-blocked-until.jsonIf present and the until is in the future, a Claude-backed firing tripped a provider limit. Pure-Claude agents are silent until expiry. There is nothing to do except wait, or set hybrid on the codenames you want to keep working.
3. All repos paused
Section titled “3. All repos paused”cat $ALFRED_HOME/state/paused-repos.jsonalfred statusIf every repo is paused, every consumer’s pick_* helper returns nothing and every firing exits as [SILENT]. You may have run label-state repo pause <repo> and forgotten to resume.
4. Agent self-paused
Section titled “4. Agent self-paused”ls $ALFRED_HOME/state/_paused/A self-pause writes a marker file here. Common causes: spend cap exceeded, eight consecutive failures, an explicit Slack-posted reason. alfred status summarizes this. alfred resume <codename> reverses it.
5. Schedule conflict or wrong unit
Section titled “5. Schedule conflict or wrong unit”alfred agentslaunchctl list | grep my.fleet # macOSsystemctl --user list-timers # LinuxIf the unit is not loaded, the host scheduler will never fire it. Common after a deploy.sh that did not complete cleanly. Re-run bash deploy.sh && bash bin/doctor.sh.
6. launchd plist load failure (macOS)
Section titled “6. launchd plist load failure (macOS)”tail -n 200 /tmp/my.fleet.<codename>.stderrA plist that fails to load on launchctl bootstrap writes the reason to the unit’s stderr. Most common cause: a typo in agents.conf after a manual edit, or a PATH entry that no longer exists. Re-run bash launchd/render.sh && bash deploy.sh.
Weekly hygiene
Section titled “Weekly hygiene”Block thirty minutes once a week. The fleet does most of its own housekeeping via agent-cleanup (daily 03:00), but a human pass catches drift the daily cleanup will not.
alfred shipped --period weekly: read the digest. Anything missing from what you expected? Anything that landed and you did not notice?bash bin/doctor.sh: confirm preflight still passes for every configured agent.bash bin/scrub-check.sh: if you contribute to Alfred itself, run this before pushing.ls $ALFRED_HOME/state/claims/if it exists: stale claims should be empty after the daily sweep, but inspect anything older than 24h.- Rotate
/tmp/my.fleet.*logs if/tmpis filling up. macOS clears them on reboot, but a host that stays up for weeks accumulates a lot. - Prune
$ALFRED_HOME/state/transcripts/and$ALFRED_HOME/state/codex/older than a month if you do not need them for forensics.agent-cleanupalready handles spend files and worktrees.
See also
Section titled “See also”- Architecture: the design that makes this runbook short.
- How it works: one firing traced end to end.
- State and memory: every state file this page reads or writes.
- Engine routing: per-codename Claude / Codex / hybrid.
- Issue claim state machine: cooperative coordination via GitHub labels and comments.
- Alfred CLI reference: every
alfredsubcommand with one-line help.