docs(roadmap): add prompt echo glyph mismatch gap

2026-05-22 21:56:45 +00:00 · 2026-05-21 21:00:45 +00:00
parent bc55711600
commit 9fd61af086
1 changed files with 2 additions and 0 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -6667,3 +6667,5 @@ Original filing (2026-04-18): the session emitted `SessionStart hook (completed)
 555. **Workspace-test stale-branch preflight can compare against stale local `main` instead of `origin/main`, letting branches behind the remote base run full workspace tests as “fresh”** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 20:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@a036293` and binary built from source SHA `25d663d`. Live channel context had multiple open claw-code PRs whose head ref is `main`, and the watchdog target is specifically stale-branch confusion. Code inspection: `tools/src/lib.rs::workspace_test_branch_preflight` reads the current branch then calls `resolve_main_ref(&branch)` before `check_freshness`. `resolve_main_ref` at `tools/src/lib.rs:2020-2032` returns local `main` whenever it exists, except when the current branch itself is `main` and `origin/main` exists. In the common feature-branch case with both refs present, the stale-branch guard compares feature branch to local `main`, not `origin/main`. If local `main` has not been fetched/updated, a branch can be behind `origin/main` but equal to local `main`, so `check_freshness` returns `Fresh` and `cargo test --workspace` proceeds without the preflight block. **Required fix shape:** (a) prefer `origin/main` (or the configured protected/base remote ref) for non-main branches when present; (b) fetch or verify the remote ref freshness before using it, or emit a degraded `branch.remote_base_unknown`/`branch.base_ref_stale` lane event instead of silently falling back; (c) include `baseRefSource` and `baseRefCommit` in the blocked lane event payload so operators know whether freshness was checked against local or remote state; (d) add a regression with local `main` stale, `origin/main` ahead, and a feature branch equal to local `main`, proving workspace tests are blocked; (e) keep the current `branch == main -> origin/main` behavior but cover it separately. **Why this matters:** full workspace tests are used as green evidence. If the guard checks an outdated local main, agents can burn time and report green against a stale base while missing fixes already in the remote protected branch. Source: gaebal-gajae dogfood response to Clawhip message `1507110763301437440` on 2026-05-21.
 556. **Workspace-test stale-branch preflight only matches fixed argument order, so broad workspace test commands like `cargo test --all-targets --workspace` bypass the stale-base guard** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 20:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@6ef5457` and binary built from source SHA `25d663d`. Active tmux list was empty at probe time. Code inspection: `tools/src/lib.rs::is_workspace_test_command` normalizes whitespace/lowercase, then checks substring needles in this exact order: `cargo test --workspace`, `cargo test --all`, `cargo nextest run --workspace`, `cargo nextest run --all`. The existing regression `bash_workspace_tests_are_blocked_when_branch_is_behind_main` uses `cargo test --workspace --all-targets`, which matches the fixed-order needle. But semantically equivalent broad commands such as `cargo test --all-targets --workspace`, `cargo test --locked --workspace`, `cargo test --all-features --workspace`, or `cargo nextest run --all-features --workspace` do not contain the exact substring `cargo test --workspace` / `cargo nextest run --workspace`, so `workspace_test_branch_preflight` returns `None` and the command executes even on a stale branch. Targeted-test skip coverage does not protect this because the bypassed commands are still workspace-wide tests. **Required fix shape:** (a) parse shell command tokens enough to identify `cargo test` and `cargo nextest run` invocations independent of flag order; (b) classify workspace-wide tests when any token is `--workspace` or `--all` for those subcommands, regardless of intervening flags; (c) add negative coverage for targeted package tests that include workspace-looking strings only in quoted args/comments; (d) add regressions proving stale-branch preflight blocks `cargo test --all-targets --workspace`, `cargo test --locked --workspace`, and `cargo nextest run --all-features --workspace`; (e) include the normalized detected test scope in the structured branch-divergence event so operators can see why a command was blocked. **Why this matters:** agents often reorder cargo flags. A stale-branch safety guard that depends on one flag order gives false confidence and lets expensive full-suite green evidence be produced against stale code. Source: gaebal-gajae dogfood response to Clawhip message `1507118317528158370` on 2026-05-21.
 557. **Wrong-task prompt-misdelivery detection only recognizes `›` prompt echoes, so `>` / `❯` agent prompts can hide mismatched-task receipts until timeout** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 21:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@bc55711` and binary built from source SHA `25d663d`. Active tmux session at probe time: `gajae-issue-311-auto-merge-race-receipt`. Code inspection: worker readiness accepts multiple prompt glyphs (`>`, `›`, `❯`) in `detect_ready_for_prompt` at `runtime/src/worker_boot.rs:1079-1113`, but `detect_prompt_echo` at `worker_boot.rs:1210-1217` only strips a leading `›`. `detect_prompt_misdelivery` relies on `detect_prompt_echo` for the `mismatched_prompt_visible` path that catches wrong-task receipts when the screen shows a different task prompt. The existing regression `wrong_task_receipt_mismatch_is_detected_before_execution_continues` uses `› Explain this KakaoTalk screenshot...`, so it exercises only the single supported glyph. If the coding agent UI echoes `> Explain...` or `❯ Explain...`, `observed_prompt_preview` is `None`; when the expected prompt text is not also visible, the wrong-task mismatch is not detected and the worker stays `Running` until coarse startup timeout classification. **Required fix shape:** (a) make prompt-echo parsing share the same glyph set as `detect_ready_for_prompt` (`>`, `›`, `❯`, and boxed `│ >` variants if present); (b) add wrong-task receipt tests for `>`, `›`, and `❯` echoes; (c) include the raw echo line/glyph in `WorkerEventPayload::PromptDelivery` or event detail so operators can diagnose UI variant drift; (d) ensure shell prompt detection remains separate so real shell prompts are still classified as `Shell`, not wrong-task agent echoes; (e) add a timeout evidence regression proving observed prompt preview is populated for all supported glyphs. **Why this matters:** prompt-misdelivery protection is only as good as the UI echo parser. Supporting multiple ready glyphs but only one echo glyph creates event/log opacity: operators see a generic timeout instead of a precise wrong-task replay condition for common terminal themes or agent UIs. Source: gaebal-gajae dogfood response to Clawhip message `1507125863408341102` on 2026-05-21.