docs(roadmap): add auto compaction short huge session gap

2026-05-22 21:56:45 +00:00 · 2026-05-21 22:00:55 +00:00
parent 9ef521bb98
commit a8c67b08e2
1 changed files with 2 additions and 0 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -6671,3 +6671,5 @@ Original filing (2026-04-18): the session emitted `SessionStart hook (completed)
 557. **Wrong-task prompt-misdelivery detection only recognizes `›` prompt echoes, so `>` / `❯` agent prompts can hide mismatched-task receipts until timeout** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 21:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@bc55711` and binary built from source SHA `25d663d`. Active tmux session at probe time: `gajae-issue-311-auto-merge-race-receipt`. Code inspection: worker readiness accepts multiple prompt glyphs (`>`, `›`, `❯`) in `detect_ready_for_prompt` at `runtime/src/worker_boot.rs:1079-1113`, but `detect_prompt_echo` at `worker_boot.rs:1210-1217` only strips a leading `›`. `detect_prompt_misdelivery` relies on `detect_prompt_echo` for the `mismatched_prompt_visible` path that catches wrong-task receipts when the screen shows a different task prompt. The existing regression `wrong_task_receipt_mismatch_is_detected_before_execution_continues` uses `› Explain this KakaoTalk screenshot...`, so it exercises only the single supported glyph. If the coding agent UI echoes `> Explain...` or `❯ Explain...`, `observed_prompt_preview` is `None`; when the expected prompt text is not also visible, the wrong-task mismatch is not detected and the worker stays `Running` until coarse startup timeout classification. **Required fix shape:** (a) make prompt-echo parsing share the same glyph set as `detect_ready_for_prompt` (`>`, `›`, `❯`, and boxed `│ >` variants if present); (b) add wrong-task receipt tests for `>`, `›`, and `❯` echoes; (c) include the raw echo line/glyph in `WorkerEventPayload::PromptDelivery` or event detail so operators can diagnose UI variant drift; (d) ensure shell prompt detection remains separate so real shell prompts are still classified as `Shell`, not wrong-task agent echoes; (e) add a timeout evidence regression proving observed prompt preview is populated for all supported glyphs. **Why this matters:** prompt-misdelivery protection is only as good as the UI echo parser. Supporting multiple ready glyphs but only one echo glyph creates event/log opacity: operators see a generic timeout instead of a precise wrong-task replay condition for common terminal themes or agent UIs. Source: gaebal-gajae dogfood response to Clawhip message `1507125863408341102` on 2026-05-21.

 558. **Tool-permission gate detection is hard-coded to one English MCP prompt shape, so alternate MCP approval wording can fall through as startup-no-evidence instead of `ToolPermissionRequired`** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 21:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@9fd61af` and binary built from source SHA `25d663d`. Active tmux session at probe time: `omx-issue-2443-ralplan-consensus-resume`. Code inspection: `detect_tool_permission_prompt` in `runtime/src/worker_boot.rs:958-999` only enters when the full screen contains either `allow the` + `server` + `tool` + `run`, or `allow tool` + `run`. The only production-shaped tests at `worker_boot.rs:1387-1474` use exactly `Allow the omx_memory MCP server to run tool "..."?`. Common equivalent approval copy such as `Allow MCP server omx_memory to call tool "project_memory_read"?`, `Allow server omx_memory to execute tool ...`, `Approve tool project_memory_read from omx_memory?`, or localized/shorter plugin prompts do not contain the exact `allow the ... server ... run tool` / `allow tool ... run` token pattern. When those appear during boot, `observe` will not set `WorkerStatus::ToolPermissionRequired`, no structured `ToolPermissionPrompt` payload is emitted, and later timeout evidence can degrade to generic `startup_no_evidence` or worker-crashed classification even though the pane clearly showed an approval gate. **Required fix shape:** (a) replace phrase-order checks with a tolerant classifier over permission verbs (`allow`/`approve`/`permit`), execution verbs (`run`/`call`/`execute`), and MCP/tool tokens independent of order; (b) add fixture tests for at least three real-world prompt variants, including `call tool` and prompts where the tool name appears before the server; (c) preserve extracted `server_name`, `tool_name`, allow-scope, and raw `prompt_preview` even when fields are partial; (d) emit an `Unknown`-scope tool-permission event rather than falling through when the approval intent is clear but parsing is incomplete; (e) include classifier confidence/reason in startup timeout evidence so UI wording drift is visible. **Why this matters:** MCP permission prompts are exactly the kind of boot blocker operators need to resolve quickly. A brittle single-template detector converts an actionable “click allow” condition into opaque startup failure noise whenever plugin/UI copy drifts. Source: gaebal-gajae dogfood response to Clawhip message `1507133416884404254` on 2026-05-21.
+
+559. **Auto-compaction can refuse to compact very large short sessions because `compact_session` still enforces the default message-count gate even after real provider usage crosses the input-token threshold** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 22:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@9ef521b` and binary built from source SHA `25d663d`. Active tmux sessions at probe time: `gajae-issue-313-omx-launch-resilience-receipt`, `omx-pr-2447-ralplan-consensus-final-review`. Code inspection: `ConversationRuntime::maybe_auto_compact` at `runtime/src/conversation.rs:559-572` triggers from actual cumulative provider `input_tokens`, then calls `compact_session` with only `max_estimated_tokens: 0` overridden. But `compact_session` first calls `should_compact`, and `should_compact` at `runtime/src/compact.rs:41-50` still requires `compactable.len() > config.preserve_recent_messages` before considering token budget. Because `CompactionConfig::default().preserve_recent_messages` is 4, a session with 1-4 extremely large messages and real provider usage above `auto_compaction_input_tokens_threshold` returns `removed_message_count == 0`; `maybe_auto_compact` then silently returns `None`. The existing auto-compaction regression at `conversation.rs:1520-1572` seeds enough turns so the message-count predicate passes, but it does not cover a short huge transcript that crosses the real usage threshold. **Required fix shape:** (a) distinguish manual estimated-token compaction from auto-compaction-after-real-usage; (b) when actual provider usage crosses the threshold, allow compaction even if message count is <= default preserved tail, while preserving at least the latest user/assistant boundary safely; (c) add a regression with one or two huge messages plus `AssistantEvent::Usage { input_tokens: 120_000 }` proving auto-compaction emits `AutoCompactionEvent`; (d) make the skip reason observable when auto-compaction threshold is crossed but no messages are removed (`too_few_messages`, `tool_boundary`, `empty_prefix`, etc.); (e) ensure tool-use/tool-result boundary protection still wins over unsafe compaction. **Why this matters:** provider-reported usage is the trusted signal that the context window is hot. If auto-compaction ignores that signal for short but huge sessions, users hit context exhaustion with no auto-compaction event and no explanation. Source: gaebal-gajae dogfood response to Clawhip message `1507140966774079521` on 2026-05-21.