docs(roadmap): add auto compaction telemetry gap

2026-05-22 21:56:45 +00:00 · 2026-05-21 22:30:42 +00:00
parent a8c67b08e2
commit f45b651e18
1 changed files with 2 additions and 0 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -6673,3 +6673,5 @@ Original filing (2026-04-18): the session emitted `SessionStart hook (completed)
 558. **Tool-permission gate detection is hard-coded to one English MCP prompt shape, so alternate MCP approval wording can fall through as startup-no-evidence instead of `ToolPermissionRequired`** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 21:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@9fd61af` and binary built from source SHA `25d663d`. Active tmux session at probe time: `omx-issue-2443-ralplan-consensus-resume`. Code inspection: `detect_tool_permission_prompt` in `runtime/src/worker_boot.rs:958-999` only enters when the full screen contains either `allow the` + `server` + `tool` + `run`, or `allow tool` + `run`. The only production-shaped tests at `worker_boot.rs:1387-1474` use exactly `Allow the omx_memory MCP server to run tool "..."?`. Common equivalent approval copy such as `Allow MCP server omx_memory to call tool "project_memory_read"?`, `Allow server omx_memory to execute tool ...`, `Approve tool project_memory_read from omx_memory?`, or localized/shorter plugin prompts do not contain the exact `allow the ... server ... run tool` / `allow tool ... run` token pattern. When those appear during boot, `observe` will not set `WorkerStatus::ToolPermissionRequired`, no structured `ToolPermissionPrompt` payload is emitted, and later timeout evidence can degrade to generic `startup_no_evidence` or worker-crashed classification even though the pane clearly showed an approval gate. **Required fix shape:** (a) replace phrase-order checks with a tolerant classifier over permission verbs (`allow`/`approve`/`permit`), execution verbs (`run`/`call`/`execute`), and MCP/tool tokens independent of order; (b) add fixture tests for at least three real-world prompt variants, including `call tool` and prompts where the tool name appears before the server; (c) preserve extracted `server_name`, `tool_name`, allow-scope, and raw `prompt_preview` even when fields are partial; (d) emit an `Unknown`-scope tool-permission event rather than falling through when the approval intent is clear but parsing is incomplete; (e) include classifier confidence/reason in startup timeout evidence so UI wording drift is visible. **Why this matters:** MCP permission prompts are exactly the kind of boot blocker operators need to resolve quickly. A brittle single-template detector converts an actionable “click allow” condition into opaque startup failure noise whenever plugin/UI copy drifts. Source: gaebal-gajae dogfood response to Clawhip message `1507133416884404254` on 2026-05-21.

 559. **Auto-compaction can refuse to compact very large short sessions because `compact_session` still enforces the default message-count gate even after real provider usage crosses the input-token threshold** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 22:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@9ef521b` and binary built from source SHA `25d663d`. Active tmux sessions at probe time: `gajae-issue-313-omx-launch-resilience-receipt`, `omx-pr-2447-ralplan-consensus-final-review`. Code inspection: `ConversationRuntime::maybe_auto_compact` at `runtime/src/conversation.rs:559-572` triggers from actual cumulative provider `input_tokens`, then calls `compact_session` with only `max_estimated_tokens: 0` overridden. But `compact_session` first calls `should_compact`, and `should_compact` at `runtime/src/compact.rs:41-50` still requires `compactable.len() > config.preserve_recent_messages` before considering token budget. Because `CompactionConfig::default().preserve_recent_messages` is 4, a session with 1-4 extremely large messages and real provider usage above `auto_compaction_input_tokens_threshold` returns `removed_message_count == 0`; `maybe_auto_compact` then silently returns `None`. The existing auto-compaction regression at `conversation.rs:1520-1572` seeds enough turns so the message-count predicate passes, but it does not cover a short huge transcript that crosses the real usage threshold. **Required fix shape:** (a) distinguish manual estimated-token compaction from auto-compaction-after-real-usage; (b) when actual provider usage crosses the threshold, allow compaction even if message count is <= default preserved tail, while preserving at least the latest user/assistant boundary safely; (c) add a regression with one or two huge messages plus `AssistantEvent::Usage { input_tokens: 120_000 }` proving auto-compaction emits `AutoCompactionEvent`; (d) make the skip reason observable when auto-compaction threshold is crossed but no messages are removed (`too_few_messages`, `tool_boundary`, `empty_prefix`, etc.); (e) ensure tool-use/tool-result boundary protection still wins over unsafe compaction. **Why this matters:** provider-reported usage is the trusted signal that the context window is hot. If auto-compaction ignores that signal for short but huge sessions, users hit context exhaustion with no auto-compaction event and no explanation. Source: gaebal-gajae dogfood response to Clawhip message `1507140966774079521` on 2026-05-21.
+
+560. **Auto-compaction observability reports only removed message count, not token pressure or before/after token estimates, so operators cannot tell whether compaction actually relieved context risk** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 22:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@a8c67b0` and binary built from source SHA `25d663d`. Active tmux sessions at probe time: `gajae-issue-314-omx-launch-resilience-review`, `gajae-pr-314-final-review`, `omx-pr-2447-ralplan-consensus-review3`. Code inspection: `AutoCompactionEvent` at `runtime/src/conversation.rs:123-127` contains only `removed_message_count`; `maybe_auto_compact` at `conversation.rs:559-581` has access to cumulative provider usage and the pre/post session but drops all token-pressure context. The CLI notice at `rusty-claude-cli/src/main.rs:3560-3562` prints only `[auto-compacted: removed N messages]`, and JSON output at `main.rs:5111-5114` exposes only `removed_messages` plus that same notice. The parity harness assertion at `mock_parity_harness.rs:691-712` only checks that the `auto_compaction` key exists and usage input tokens are large; it does not require any before/after estimate, threshold, trigger usage, or reduction metadata. **Required fix shape:** (a) extend `AutoCompactionEvent` with `trigger_input_tokens`, `threshold_input_tokens`, `estimated_tokens_before`, `estimated_tokens_after`, `removed_message_count`, and a `reason/trigger` enum; (b) compute before/after estimates around `compact_session` and include them in CLI text and JSON; (c) add tests proving JSON contains these fields for auto-compaction and that the CLI notice includes enough context to debug risk relief; (d) include a skipped/degraded event shape when threshold crossed but no messages removed, tying into #559; (e) update parity harness to assert semantic values, not merely key presence. **Why this matters:** auto-compaction is a context-safety action. Without token-pressure and reduction telemetry, a user sees “removed 2 messages” but cannot tell if the session went from 120k to 5k tokens, 120k to 119k, or failed to relieve the risk at all. Source: gaebal-gajae dogfood response to Clawhip message `1507148512159203338` on 2026-05-21.