docs(roadmap): add worker restart timeout anchor gap

2026-05-22 13:46:44 +00:00 · 2026-05-21 19:00:52 +00:00
parent 51c05ac359
commit 1540e3a9ab
1 changed files with 2 additions and 0 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -6659,3 +6659,5 @@ Original filing (2026-04-18): the session emitted `SessionStart hook (completed)
 551. **Lane completion ignores timestamp validity and recency entirely, so stale or epoch-formatted AgentOutput manifests can be auto-marked completed if status/tests/push flags are green** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 18:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@2bc85c5` and binary built from source SHA `25d663d`. Code inspection: `tools/src/lane_completion.rs::detect_lane_completion` decides completion only from `error.is_none()`, `status` being completed/finished, `current_blocker.is_none()`, `test_green`, and `has_pushed`. It never reads or validates `AgentOutput.createdAt`, `startedAt`, `completedAt`, or `laneEvents[].emittedAt`. The test fixture at `lane_completion.rs:103-120` uses RFC3339-looking timestamps, but there is no negative coverage for epoch strings or stale completedAt values. Therefore a manifest with `completedAt:"1748004000"` (from the broken `iso8601_now()` producer) or a very old completion timestamp can still generate a completed `LaneContext` as long as the status/tests/push booleans are favorable. This is distinct from #550's manifest test coverage gap: the completion policy itself treats temporal fields as irrelevant even though stale-session cleanup and lane closeout depend on trustworthy event time. **Required fix shape:** (a) parse and validate `completedAt` (and ideally `createdAt`/`startedAt`/terminal lane event `emittedAt`) before auto-completion; (b) reject or mark degraded manifests whose timestamp fields are missing, non-RFC3339, or implausibly stale relative to the current run; (c) thread a typed temporal blocker/degraded reason into `LaneContext` instead of silently completing; (d) add tests where status/tests/push are green but `completedAt` is `"1748004000"`, `"not-a-date"`, missing, or older than a configured freshness window, proving completion is withheld or degraded; (e) align this with #548-#550 so timestamp producer, validator, tests, and completion policy share one contract. **Why this matters:** completion is an action trigger, not just a display label. If stale or malformed manifests can auto-close lanes, claws may cleanup sessions, suppress reminders, or report success based on artifacts whose time identity is untrustworthy. Source: gaebal-gajae dogfood response to Clawhip message `1507080567731261655` on 2026-05-21.

 552. **Anthropic retry tests cover `/v1/messages` but not the preflight `/v1/messages/count_tokens` endpoint, so a transient 429 during token counting can be silently swallowed or skip refined context-window enforcement with no regression coverage** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 18:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@762ee65` and binary built from source SHA `25d663d`. Code/test inspection: `api/src/providers/anthropic.rs::preflight_message_request` calls `self.count_tokens(request).await`, but on any error it immediately `return Ok(())` and falls back to the coarse local byte guard. The `count_tokens` implementation sends one bare request to `/v1/messages/count_tokens` with no retry. Existing integration tests in `api/tests/client_integration.rs` cover retry behavior for `/v1/messages` (`retries_retryable_failures_before_succeeding`, `surfaces_retry_exhaustion_for_persistent_retryable_errors`, `retries_multiple_retryable_failures_with_exponential_backoff_and_jitter`) and local oversized-byte blocking, but there is no test where count_tokens returns 429/503 once then succeeds, nor a test asserting refined context-window rejection still happens after a retry. Therefore a count-token rate limit can disable the precise token guard and still let `send_message` proceed to `/v1/messages`, while the retry suite stays green. **Required fix shape:** (a) add integration tests with mock server sequence `429 count_tokens -> 200 count_tokens -> ...` proving the same retry policy applies before `/v1/messages`; (b) add a test where retried count_tokens returns a token count that exceeds the model window and assert no `/v1/messages` call is sent; (c) when count_tokens errors are intentionally best-effort, expose structured telemetry/warnings so skipped refined preflight is observable; (d) then implement retry or explicitly document degraded behavior; (e) keep request-count assertions path-aware so count_tokens and messages attempts are distinguishable. **Why this matters:** the retry suite currently gives false confidence because it exercises only the final message endpoint. Under rate pressure, the preflight endpoint is exactly where retries matter for safe compaction/context-window decisions. Source: gaebal-gajae dogfood response to Clawhip message `1507088113892331622` on 2026-05-21.
+
+553. **Worker restart reuses the original `created_at`, so startup-timeout elapsed time after restart includes the previous worker lifetime and stale pre-restart events** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 19:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@51c05ac` and binary built from source SHA `25d663d`. Code inspection: `WorkerRegistry::restart` at `runtime/src/worker_boot.rs:600-621` resets status, prompt fields, trust, and error state, then appends `WorkerEventKind::Restarted`, but it does not reset `worker.created_at` or clear/carry a new boot-start timestamp. `observe_startup_timeout` at `worker_boot.rs:713-735` computes `elapsed = now.saturating_sub(worker.created_at)` and reports `command_started_at: worker.created_at`. Therefore a worker restarted after a long previous lifetime can immediately report a huge startup timeout elapsed/command_started_at from the original boot, not the restart. Because events are also retained, the timeout evidence can mix pre-restart trust/tool-permission detections with the new boot attempt. This is distinct from Jobdori's #554 O(3N) scan/unbounded-events finding: even with O(1) caches, the temporal anchor for a restarted boot is wrong. **Required fix shape:** (a) add `boot_started_at` or `current_attempt_started_at` distinct from immutable worker creation time; (b) set that timestamp on create and restart, and use it for startup timeout elapsed/command_started_at; (c) either scope trust/tool-permission evidence to events since the current boot attempt or store per-attempt cached flags; (d) include `attempt_index`/`restart_count` in startup evidence and worker events so old/new attempts are separable; (e) add a regression where a worker is created, time advances/restart occurs, then `observe_startup_timeout` reports elapsed from restart rather than original creation and ignores pre-restart prompts. **Why this matters:** restart is supposed to create a fresh startup attempt. If timeout evidence is anchored to the first creation, operators see misleading "stalled for hours" reports and stale blocker classifications for a brand-new restart, which breaks recovery decisions. Source: gaebal-gajae dogfood response to Clawhip message `1507095667959402638` on 2026-05-21.