docs(roadmap): add workspace test flag order preflight gap

2026-05-22 13:46:44 +00:00 · 2026-05-21 20:30:47 +00:00
parent 6ef54578e8
commit bc55711600
1 changed files with 2 additions and 0 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -6665,3 +6665,5 @@ Original filing (2026-04-18): the session emitted `SessionStart hook (completed)
 554. **Recovery ledger tests assert only `started_at.is_some()` / `finished_at.is_some()`, so fake tick-counter timestamps are explicitly accepted by the machine-readable ledger suite** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 19:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@1540e3a` and binary built from source SHA `25d663d`. Code inspection: `RecoveryContext::next_timestamp` in `runtime/src/recovery_recipes.rs:276-279` returns `recovery-ledger-tick-N`, and `attempt_recovery` stores those strings into public `RecoveryLedgerEntry.started_at` / `finished_at`. The test named `recovery_context_exposes_machine_readable_ledger` at `recovery_recipes.rs:688-720` only asserts `entry.started_at.is_some()` and `entry.finished_at.is_some()`; exhaustion/failure ledger tests likewise validate state/result/command details but not timestamp parseability. Therefore the test suite labels the ledger machine-readable while allowing non-date sentinel strings in fields named `started_at` and `finished_at`. This is the test-coverage sibling of Jobdori's #555 public API timestamp bug: even after production is fixed, nothing in the ledger tests prevents a regression back to tick strings or other unparseable data. **Required fix shape:** (a) add a recovery timestamp assertion helper that parses `started_at` and `finished_at` as RFC3339/ISO-8601 UTC; (b) update success, exhausted, and failed ledger tests to use it; (c) add a negative unit test proving `recovery-ledger-tick-1` is rejected by the helper/contract; (d) document whether recovery ledger timestamps are wall-clock instants or monotonic attempt IDs, and if both are needed, add separate `attempt_seq` instead of overloading timestamp fields; (e) align with the timestamp contract fixes in #548-#551. **Why this matters:** tests currently make the wrong semantic promise: "machine-readable" only means present. Recovery ledgers drive retries/escalation audit trails, so timestamp fields must be parseable dates or consumers cannot sort, correlate, or display recovery attempts reliably. Source: gaebal-gajae dogfood response to Clawhip message `1507103214665859264` on 2026-05-21.

 555. **Workspace-test stale-branch preflight can compare against stale local `main` instead of `origin/main`, letting branches behind the remote base run full workspace tests as “fresh”** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 20:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@a036293` and binary built from source SHA `25d663d`. Live channel context had multiple open claw-code PRs whose head ref is `main`, and the watchdog target is specifically stale-branch confusion. Code inspection: `tools/src/lib.rs::workspace_test_branch_preflight` reads the current branch then calls `resolve_main_ref(&branch)` before `check_freshness`. `resolve_main_ref` at `tools/src/lib.rs:2020-2032` returns local `main` whenever it exists, except when the current branch itself is `main` and `origin/main` exists. In the common feature-branch case with both refs present, the stale-branch guard compares feature branch to local `main`, not `origin/main`. If local `main` has not been fetched/updated, a branch can be behind `origin/main` but equal to local `main`, so `check_freshness` returns `Fresh` and `cargo test --workspace` proceeds without the preflight block. **Required fix shape:** (a) prefer `origin/main` (or the configured protected/base remote ref) for non-main branches when present; (b) fetch or verify the remote ref freshness before using it, or emit a degraded `branch.remote_base_unknown`/`branch.base_ref_stale` lane event instead of silently falling back; (c) include `baseRefSource` and `baseRefCommit` in the blocked lane event payload so operators know whether freshness was checked against local or remote state; (d) add a regression with local `main` stale, `origin/main` ahead, and a feature branch equal to local `main`, proving workspace tests are blocked; (e) keep the current `branch == main -> origin/main` behavior but cover it separately. **Why this matters:** full workspace tests are used as green evidence. If the guard checks an outdated local main, agents can burn time and report green against a stale base while missing fixes already in the remote protected branch. Source: gaebal-gajae dogfood response to Clawhip message `1507110763301437440` on 2026-05-21.
+
+556. **Workspace-test stale-branch preflight only matches fixed argument order, so broad workspace test commands like `cargo test --all-targets --workspace` bypass the stale-base guard** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 20:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@6ef5457` and binary built from source SHA `25d663d`. Active tmux list was empty at probe time. Code inspection: `tools/src/lib.rs::is_workspace_test_command` normalizes whitespace/lowercase, then checks substring needles in this exact order: `cargo test --workspace`, `cargo test --all`, `cargo nextest run --workspace`, `cargo nextest run --all`. The existing regression `bash_workspace_tests_are_blocked_when_branch_is_behind_main` uses `cargo test --workspace --all-targets`, which matches the fixed-order needle. But semantically equivalent broad commands such as `cargo test --all-targets --workspace`, `cargo test --locked --workspace`, `cargo test --all-features --workspace`, or `cargo nextest run --all-features --workspace` do not contain the exact substring `cargo test --workspace` / `cargo nextest run --workspace`, so `workspace_test_branch_preflight` returns `None` and the command executes even on a stale branch. Targeted-test skip coverage does not protect this because the bypassed commands are still workspace-wide tests. **Required fix shape:** (a) parse shell command tokens enough to identify `cargo test` and `cargo nextest run` invocations independent of flag order; (b) classify workspace-wide tests when any token is `--workspace` or `--all` for those subcommands, regardless of intervening flags; (c) add negative coverage for targeted package tests that include workspace-looking strings only in quoted args/comments; (d) add regressions proving stale-branch preflight blocks `cargo test --all-targets --workspace`, `cargo test --locked --workspace`, and `cargo nextest run --all-features --workspace`; (e) include the normalized detected test scope in the structured branch-divergence event so operators can see why a command was blocked. **Why this matters:** agents often reorder cargo flags. A stale-branch safety guard that depends on one flag order gives false confidence and lets expensive full-suite green evidence be produced against stale code. Source: gaebal-gajae dogfood response to Clawhip message `1507118317528158370` on 2026-05-21.