docs(roadmap): add tolerant tool permission prompt gap

2026-05-22 13:46:44 +00:00 · 2026-05-21 21:30:41 +00:00
parent 9fd61af086
commit 9ef521bb98
1 changed files with 2 additions and 0 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -6669,3 +6669,5 @@ Original filing (2026-04-18): the session emitted `SessionStart hook (completed)
 556. **Workspace-test stale-branch preflight only matches fixed argument order, so broad workspace test commands like `cargo test --all-targets --workspace` bypass the stale-base guard** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 20:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@6ef5457` and binary built from source SHA `25d663d`. Active tmux list was empty at probe time. Code inspection: `tools/src/lib.rs::is_workspace_test_command` normalizes whitespace/lowercase, then checks substring needles in this exact order: `cargo test --workspace`, `cargo test --all`, `cargo nextest run --workspace`, `cargo nextest run --all`. The existing regression `bash_workspace_tests_are_blocked_when_branch_is_behind_main` uses `cargo test --workspace --all-targets`, which matches the fixed-order needle. But semantically equivalent broad commands such as `cargo test --all-targets --workspace`, `cargo test --locked --workspace`, `cargo test --all-features --workspace`, or `cargo nextest run --all-features --workspace` do not contain the exact substring `cargo test --workspace` / `cargo nextest run --workspace`, so `workspace_test_branch_preflight` returns `None` and the command executes even on a stale branch. Targeted-test skip coverage does not protect this because the bypassed commands are still workspace-wide tests. **Required fix shape:** (a) parse shell command tokens enough to identify `cargo test` and `cargo nextest run` invocations independent of flag order; (b) classify workspace-wide tests when any token is `--workspace` or `--all` for those subcommands, regardless of intervening flags; (c) add negative coverage for targeted package tests that include workspace-looking strings only in quoted args/comments; (d) add regressions proving stale-branch preflight blocks `cargo test --all-targets --workspace`, `cargo test --locked --workspace`, and `cargo nextest run --all-features --workspace`; (e) include the normalized detected test scope in the structured branch-divergence event so operators can see why a command was blocked. **Why this matters:** agents often reorder cargo flags. A stale-branch safety guard that depends on one flag order gives false confidence and lets expensive full-suite green evidence be produced against stale code. Source: gaebal-gajae dogfood response to Clawhip message `1507118317528158370` on 2026-05-21.

 557. **Wrong-task prompt-misdelivery detection only recognizes `›` prompt echoes, so `>` / `❯` agent prompts can hide mismatched-task receipts until timeout** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 21:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@bc55711` and binary built from source SHA `25d663d`. Active tmux session at probe time: `gajae-issue-311-auto-merge-race-receipt`. Code inspection: worker readiness accepts multiple prompt glyphs (`>`, `›`, `❯`) in `detect_ready_for_prompt` at `runtime/src/worker_boot.rs:1079-1113`, but `detect_prompt_echo` at `worker_boot.rs:1210-1217` only strips a leading `›`. `detect_prompt_misdelivery` relies on `detect_prompt_echo` for the `mismatched_prompt_visible` path that catches wrong-task receipts when the screen shows a different task prompt. The existing regression `wrong_task_receipt_mismatch_is_detected_before_execution_continues` uses `› Explain this KakaoTalk screenshot...`, so it exercises only the single supported glyph. If the coding agent UI echoes `> Explain...` or `❯ Explain...`, `observed_prompt_preview` is `None`; when the expected prompt text is not also visible, the wrong-task mismatch is not detected and the worker stays `Running` until coarse startup timeout classification. **Required fix shape:** (a) make prompt-echo parsing share the same glyph set as `detect_ready_for_prompt` (`>`, `›`, `❯`, and boxed `│ >` variants if present); (b) add wrong-task receipt tests for `>`, `›`, and `❯` echoes; (c) include the raw echo line/glyph in `WorkerEventPayload::PromptDelivery` or event detail so operators can diagnose UI variant drift; (d) ensure shell prompt detection remains separate so real shell prompts are still classified as `Shell`, not wrong-task agent echoes; (e) add a timeout evidence regression proving observed prompt preview is populated for all supported glyphs. **Why this matters:** prompt-misdelivery protection is only as good as the UI echo parser. Supporting multiple ready glyphs but only one echo glyph creates event/log opacity: operators see a generic timeout instead of a precise wrong-task replay condition for common terminal themes or agent UIs. Source: gaebal-gajae dogfood response to Clawhip message `1507125863408341102` on 2026-05-21.
+
+558. **Tool-permission gate detection is hard-coded to one English MCP prompt shape, so alternate MCP approval wording can fall through as startup-no-evidence instead of `ToolPermissionRequired`** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 21:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@9fd61af` and binary built from source SHA `25d663d`. Active tmux session at probe time: `omx-issue-2443-ralplan-consensus-resume`. Code inspection: `detect_tool_permission_prompt` in `runtime/src/worker_boot.rs:958-999` only enters when the full screen contains either `allow the` + `server` + `tool` + `run`, or `allow tool` + `run`. The only production-shaped tests at `worker_boot.rs:1387-1474` use exactly `Allow the omx_memory MCP server to run tool "..."?`. Common equivalent approval copy such as `Allow MCP server omx_memory to call tool "project_memory_read"?`, `Allow server omx_memory to execute tool ...`, `Approve tool project_memory_read from omx_memory?`, or localized/shorter plugin prompts do not contain the exact `allow the ... server ... run tool` / `allow tool ... run` token pattern. When those appear during boot, `observe` will not set `WorkerStatus::ToolPermissionRequired`, no structured `ToolPermissionPrompt` payload is emitted, and later timeout evidence can degrade to generic `startup_no_evidence` or worker-crashed classification even though the pane clearly showed an approval gate. **Required fix shape:** (a) replace phrase-order checks with a tolerant classifier over permission verbs (`allow`/`approve`/`permit`), execution verbs (`run`/`call`/`execute`), and MCP/tool tokens independent of order; (b) add fixture tests for at least three real-world prompt variants, including `call tool` and prompts where the tool name appears before the server; (c) preserve extracted `server_name`, `tool_name`, allow-scope, and raw `prompt_preview` even when fields are partial; (d) emit an `Unknown`-scope tool-permission event rather than falling through when the approval intent is clear but parsing is incomplete; (e) include classifier confidence/reason in startup timeout evidence so UI wording drift is visible. **Why this matters:** MCP permission prompts are exactly the kind of boot blocker operators need to resolve quickly. A brittle single-template detector converts an actionable “click allow” condition into opaque startup failure noise whenever plugin/UI copy drifts. Source: gaebal-gajae dogfood response to Clawhip message `1507133416884404254` on 2026-05-21.