docs(roadmap): add powershell output amplification

2026-05-21 13:16:45 +00:00 · 2026-05-20 15:30:42 +00:00
parent 7e73cdb60f
commit 51ea1aa01e
1 changed files with 2 additions and 0 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -6555,3 +6555,5 @@ Original filing (2026-04-18): the session emitted `SessionStart hook (completed)
 499. **`TodoWrite` returns both `oldTodos` and `newTodos` in every tool result, so large task boards are echoed twice per update and repeatedly burn context even though the model only needs the delta/current list** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 14:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@3d2a047`. Code inspection: `run_todo_write` serializes `execute_todo_write(input)?` via `to_pretty_json`; `execute_todo_write` reads the persisted todo store into `old_todos`, writes the new/persisted list, then returns `TodoWriteOutput { old_todos, new_todos: input.todos, verification_nudge_needed }`. The JSON field names are `oldTodos` and `newTodos`, so every TodoWrite result contains the entire previous board plus the entire submitted board. For a 200-item task board, a one-item status change returns roughly 400 todo objects to the model; repeated status updates multiply the same backlog text across the context window. This is the same output-amplification class as NotebookEdit (#500), but on the core planning/task-control surface rather than notebooks. **Required fix shape:** (a) replace `oldTodos` with a compact diff (`changed:[{id/content,status_before,status_after}]`, `added`, `removed`, `unchanged_count`) or hide it behind a debug flag; (b) keep `newTodos` only if the current board is below a safe size, otherwise return `current_count`, `open_count`, `completed_count`, and a truncated active subset; (c) include `truncated:true`/`omitted_old_count` metadata for large boards; (d) add regressions proving single-item updates on large boards do not serialize the entire old board. **Why this matters:** TodoWrite is called frequently in multi-step sessions. Echoing full before/after state on every update creates context-window pressure, increases cost, and makes compaction summaries noisier without adding useful operator signal. Source: gaebal-gajae dogfood response to Clawhip message `1506665332478050344` on 2026-05-20.

 500. **`REPL` tool returns unbounded raw `stdout`/`stderr` strings, so a tiny inline snippet can inject megabytes of output into the model context just like the NotebookEdit/TodoWrite amplification class** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 15:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@25b8dbb`. Code inspection: `execute_repl` in `rust/crates/tools/src/lib.rs` runs python/node/shell snippets with piped stdout/stderr, then returns `ReplOutput { stdout: String::from_utf8_lossy(&output.stdout).into_owned(), stderr: String::from_utf8_lossy(&output.stderr).into_owned(), ... }` serialized via `to_pretty_json`. There is no byte cap, line cap, truncation marker, or structured artifact path. A user/model can run `REPL(language:"python", code:"print('x'*5_000_000)")` and the full 5MB output is returned to the model as a JSON string; stderr has the same issue. This is distinct from bash timeout/provenance handling because REPL is marketed as a structured execution helper, yet it has no output budget. **Required fix shape:** (a) cap `stdout` and `stderr` in `ReplOutput` (e.g. first/last 64KB) with `stdout_truncated`, `stderr_truncated`, `stdout_bytes`, `stderr_bytes`; (b) optionally spill full output to an artifact file and return `artifact_path` only when safe; (c) apply the same cap to PowerShell/bash if not already covered; (d) add regressions for large stdout/stderr from python/node/shell proving the serialized tool result stays bounded and includes truncation metadata. **Why this matters:** REPL is an easy path for accidental context-window blowups (`print(large_df)`, stack traces, generated JSON). Without output budgets, a single tool call can consume the context window, trigger compaction, or hide the useful signal behind megabytes of raw output. Source: gaebal-gajae dogfood response to Clawhip message `1506672878047989812` on 2026-05-20.
+
+501. **`PowerShell`/shared `execute_shell_command` returns unbounded raw stdout/stderr in every foreground path, so shell output can still flood the model context even after the REPL output-amplification finding** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 15:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@7e73cdb`. Code inspection: `execute_powershell` delegates to `execute_shell_command`, whose timeout-success, timeout-kill, and no-timeout paths all construct `runtime::BashCommandOutput` with `stdout: String::from_utf8_lossy(&output.stdout).into_owned()` and `stderr: String::from_utf8_lossy(&output.stderr).into_owned()`. The background path discards output, but every foreground path serializes the full captured streams. A command like `PowerShell(command:"1..1000000 | % { 'x' }")` or any verbose test/log script can return megabytes of JSON-escaped output to the model. The timeout path is especially bad: it kills the process but still returns everything produced before the timeout plus the timeout footer, with no cap or `truncated` metadata. **Required fix shape:** (a) introduce a shared output-budget helper for all shell-like tools (`bash`, `PowerShell`, `REPL`) with byte caps, first/last slicing, and `stdout_truncated`/`stderr_truncated`/byte-count metadata; (b) preserve existing `raw_output_path`/`persisted_output_path` semantics by spilling full streams to artifact files when safe; (c) apply caps before JSON serialization in both success and timeout branches; (d) add regressions using stub `pwsh` and shell commands that emit >cap stdout/stderr and timeout-with-output. **Why this matters:** verbose tests and build tools are normal claw-code workflows. A single noisy PowerShell/shell command should not silently consume the context window or force compaction; the model needs a bounded summary plus a way to fetch artifacts if the full log is needed. Source: gaebal-gajae dogfood response to Clawhip message `1506680431620395091` on 2026-05-20.