docs(roadmap): add repl output amplification

2026-05-21 05:06:44 +00:00 · 2026-05-20 15:01:44 +00:00
parent 25b8dbb313
commit 7e73cdb60f
1 changed files with 2 additions and 0 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -6553,3 +6553,5 @@ Original filing (2026-04-18): the session emitted `SessionStart hook (completed)
 498. **`PermissionEnforcer::check_bash` allowlists `cargo` and `rustc` as read-only, bypassing the canonical package/build-state classifier and allowing state-mutating Rust toolchain commands in `read-only` mode** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 14:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@5a4a8eb`. Code inspection: `rust/crates/runtime/src/permission_enforcer.rs::is_read_only_command` includes `cargo` and `rustc` in the read-only allowlist and only rejects `-i`, `--in-place`, ` > `, or ` >> `. The canonical `bash_validation.rs` pipeline treats `cargo` as package/build-state management (`STATE_MODIFYING_COMMANDS` / `PACKAGE_COMMANDS`) and has a regression that `npm install` is blocked in read-only mode, but the runtime enforcer does not call that pipeline. As a result, commands like `cargo install cargo-edit`, `cargo add anyhow`, `cargo generate-lockfile`, `cargo update`, `cargo fix --allow-dirty`, and `cargo build` (writes `target/`) are classified as read-only by the actual enforcer. `rustc -o out main.rs` likewise writes an output binary without shell redirection and is allowed. **Required fix shape:** (a) remove `cargo` and `rustc` from the blanket read-only allowlist; (b) optionally add a conservative subcommand classifier where only clearly non-mutating forms (`cargo --version`, `cargo metadata --no-deps` if proven side-effect-free, `rustc --version`) are read-only; (c) route `check_bash` through the canonical `bash_validation` pipeline; (d) add regressions for `cargo install`, `cargo add`, `cargo update`, `cargo build`, `cargo test`, and `rustc -o` under `PermissionMode::ReadOnly`. **Why this matters:** build/package commands routinely modify the workspace, global cargo home, lockfiles, and build artifacts. A read-only exploratory lane should not be able to install packages or rewrite lock/build outputs just because the first token is `cargo`. Source: gaebal-gajae dogfood response to Clawhip message `1506657779883184250` on 2026-05-20.
 499. **`TodoWrite` returns both `oldTodos` and `newTodos` in every tool result, so large task boards are echoed twice per update and repeatedly burn context even though the model only needs the delta/current list** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 14:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@3d2a047`. Code inspection: `run_todo_write` serializes `execute_todo_write(input)?` via `to_pretty_json`; `execute_todo_write` reads the persisted todo store into `old_todos`, writes the new/persisted list, then returns `TodoWriteOutput { old_todos, new_todos: input.todos, verification_nudge_needed }`. The JSON field names are `oldTodos` and `newTodos`, so every TodoWrite result contains the entire previous board plus the entire submitted board. For a 200-item task board, a one-item status change returns roughly 400 todo objects to the model; repeated status updates multiply the same backlog text across the context window. This is the same output-amplification class as NotebookEdit (#500), but on the core planning/task-control surface rather than notebooks. **Required fix shape:** (a) replace `oldTodos` with a compact diff (`changed:[{id/content,status_before,status_after}]`, `added`, `removed`, `unchanged_count`) or hide it behind a debug flag; (b) keep `newTodos` only if the current board is below a safe size, otherwise return `current_count`, `open_count`, `completed_count`, and a truncated active subset; (c) include `truncated:true`/`omitted_old_count` metadata for large boards; (d) add regressions proving single-item updates on large boards do not serialize the entire old board. **Why this matters:** TodoWrite is called frequently in multi-step sessions. Echoing full before/after state on every update creates context-window pressure, increases cost, and makes compaction summaries noisier without adding useful operator signal. Source: gaebal-gajae dogfood response to Clawhip message `1506665332478050344` on 2026-05-20.
 500. **`REPL` tool returns unbounded raw `stdout`/`stderr` strings, so a tiny inline snippet can inject megabytes of output into the model context just like the NotebookEdit/TodoWrite amplification class** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 15:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@25b8dbb`. Code inspection: `execute_repl` in `rust/crates/tools/src/lib.rs` runs python/node/shell snippets with piped stdout/stderr, then returns `ReplOutput { stdout: String::from_utf8_lossy(&output.stdout).into_owned(), stderr: String::from_utf8_lossy(&output.stderr).into_owned(), ... }` serialized via `to_pretty_json`. There is no byte cap, line cap, truncation marker, or structured artifact path. A user/model can run `REPL(language:"python", code:"print('x'*5_000_000)")` and the full 5MB output is returned to the model as a JSON string; stderr has the same issue. This is distinct from bash timeout/provenance handling because REPL is marketed as a structured execution helper, yet it has no output budget. **Required fix shape:** (a) cap `stdout` and `stderr` in `ReplOutput` (e.g. first/last 64KB) with `stdout_truncated`, `stderr_truncated`, `stdout_bytes`, `stderr_bytes`; (b) optionally spill full output to an artifact file and return `artifact_path` only when safe; (c) apply the same cap to PowerShell/bash if not already covered; (d) add regressions for large stdout/stderr from python/node/shell proving the serialized tool result stays bounded and includes truncation metadata. **Why this matters:** REPL is an easy path for accidental context-window blowups (`print(large_df)`, stack traces, generated JSON). Without output budgets, a single tool call can consume the context window, trigger compaction, or hide the useful signal behind megabytes of raw output. Source: gaebal-gajae dogfood response to Clawhip message `1506672878047989812` on 2026-05-20.