docs(roadmap): add count_tokens retry coverage gap

2026-05-22 13:46:44 +00:00 · 2026-05-21 18:30:53 +00:00
parent 762ee656f3
commit 51c05ac359
1 changed files with 2 additions and 0 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -6657,3 +6657,5 @@ Original filing (2026-04-18): the session emitted `SessionStart hook (completed)
 550. **Agent manifest timestamp tests only assert non-empty/presence, so epoch-seconds `createdAt`/`startedAt`/`completedAt` and lane-event timestamps are baked into tests as acceptable output** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 17:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@02b6855` and binary built from source SHA `25d663d`. Code inspection: `tools/src/lib.rs:2734-2755` defines `AgentOutput.createdAt`, `startedAt`, `completedAt`, and `laneEvents`. Production fills them via `iso8601_now()` at `tools/src/lib.rs:3663`, `3693-3696`, `3912`, and terminal `LaneEvent::*` calls. But the main manifest regression at `tools/src/lib.rs:8147-8152` only asserts `!manifest.created_at.is_empty()`, `started_at.is_some()`, and `completed_at.is_none()`; later completed/failed manifest tests assert event names/status/details/data but never assert that `createdAt`, `startedAt`, `completedAt`, or `laneEvents[].emittedAt` parse as RFC3339. As a result, the current epoch-seconds strings from `iso8601_now()` satisfy the test suite, and a future producer fix could regress again without a red test. **Required fix shape:** (a) add a shared test helper that validates RFC3339/ISO-8601 UTC strings for AgentOutput top-level timestamps and every lane event `emittedAt`; (b) update creation, completion, failure, spawn-error, recovery, review, selection, and artifact manifest tests to call it; (c) add explicit negative/unit coverage proving epoch strings like `"1748004000"` fail the helper; (d) align test fixtures with the G004 conformance timestamp validation required by #549; (e) ensure tests check serialized JSON fields, not only in-memory struct presence. **Why this matters:** tests are currently encoding the broken contract as acceptable by checking only non-empty strings. Without semantic timestamp assertions, timestamp-format fixes in #548/#549 have no durable safety net and downstream parser breakage can reappear silently. Source: gaebal-gajae dogfood response to Clawhip message `1507073019296874608` on 2026-05-21.

 551. **Lane completion ignores timestamp validity and recency entirely, so stale or epoch-formatted AgentOutput manifests can be auto-marked completed if status/tests/push flags are green** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 18:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@2bc85c5` and binary built from source SHA `25d663d`. Code inspection: `tools/src/lane_completion.rs::detect_lane_completion` decides completion only from `error.is_none()`, `status` being completed/finished, `current_blocker.is_none()`, `test_green`, and `has_pushed`. It never reads or validates `AgentOutput.createdAt`, `startedAt`, `completedAt`, or `laneEvents[].emittedAt`. The test fixture at `lane_completion.rs:103-120` uses RFC3339-looking timestamps, but there is no negative coverage for epoch strings or stale completedAt values. Therefore a manifest with `completedAt:"1748004000"` (from the broken `iso8601_now()` producer) or a very old completion timestamp can still generate a completed `LaneContext` as long as the status/tests/push booleans are favorable. This is distinct from #550's manifest test coverage gap: the completion policy itself treats temporal fields as irrelevant even though stale-session cleanup and lane closeout depend on trustworthy event time. **Required fix shape:** (a) parse and validate `completedAt` (and ideally `createdAt`/`startedAt`/terminal lane event `emittedAt`) before auto-completion; (b) reject or mark degraded manifests whose timestamp fields are missing, non-RFC3339, or implausibly stale relative to the current run; (c) thread a typed temporal blocker/degraded reason into `LaneContext` instead of silently completing; (d) add tests where status/tests/push are green but `completedAt` is `"1748004000"`, `"not-a-date"`, missing, or older than a configured freshness window, proving completion is withheld or degraded; (e) align this with #548-#550 so timestamp producer, validator, tests, and completion policy share one contract. **Why this matters:** completion is an action trigger, not just a display label. If stale or malformed manifests can auto-close lanes, claws may cleanup sessions, suppress reminders, or report success based on artifacts whose time identity is untrustworthy. Source: gaebal-gajae dogfood response to Clawhip message `1507080567731261655` on 2026-05-21.
+
+552. **Anthropic retry tests cover `/v1/messages` but not the preflight `/v1/messages/count_tokens` endpoint, so a transient 429 during token counting can be silently swallowed or skip refined context-window enforcement with no regression coverage** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 18:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@762ee65` and binary built from source SHA `25d663d`. Code/test inspection: `api/src/providers/anthropic.rs::preflight_message_request` calls `self.count_tokens(request).await`, but on any error it immediately `return Ok(())` and falls back to the coarse local byte guard. The `count_tokens` implementation sends one bare request to `/v1/messages/count_tokens` with no retry. Existing integration tests in `api/tests/client_integration.rs` cover retry behavior for `/v1/messages` (`retries_retryable_failures_before_succeeding`, `surfaces_retry_exhaustion_for_persistent_retryable_errors`, `retries_multiple_retryable_failures_with_exponential_backoff_and_jitter`) and local oversized-byte blocking, but there is no test where count_tokens returns 429/503 once then succeeds, nor a test asserting refined context-window rejection still happens after a retry. Therefore a count-token rate limit can disable the precise token guard and still let `send_message` proceed to `/v1/messages`, while the retry suite stays green. **Required fix shape:** (a) add integration tests with mock server sequence `429 count_tokens -> 200 count_tokens -> ...` proving the same retry policy applies before `/v1/messages`; (b) add a test where retried count_tokens returns a token count that exceeds the model window and assert no `/v1/messages` call is sent; (c) when count_tokens errors are intentionally best-effort, expose structured telemetry/warnings so skipped refined preflight is observable; (d) then implement retry or explicitly document degraded behavior; (e) keep request-count assertions path-aware so count_tokens and messages attempts are distinguishable. **Why this matters:** the retry suite currently gives false confidence because it exercises only the final message endpoint. Under rate pressure, the preflight endpoint is exactly where retries matter for safe compaction/context-window decisions. Source: gaebal-gajae dogfood response to Clawhip message `1507088113892331622` on 2026-05-21.