docs(roadmap): add mcp stdio frame size cap gap

2026-05-22 21:56:45 +00:00 · 2026-05-22 19:31:27 +00:00
parent d996b65d64
commit 46f3bff7ef
1 changed files with 4 additions and 0 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -6739,3 +6739,7 @@ Original filing (2026-04-18): the session emitted `SessionStart hook (completed)
 592. **Plugin degraded-mode reporting drops degraded-server errors because `PluginState::Degraded` only preserves failed server details, so a server marked degraded can appear fully healthy/available in the operator contract** — dogfooded 2026-05-22 from the `#clawcode-building-in-public` 18:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@1df4114`. Active tmux session at probe time: `gajae-issue-357-review-session-vanish-replacement`; no active claw-code implementation session. Focused validation: `cd rust && cargo test -p runtime plugin_lifecycle -- --nocapture` passed 6/6, confirming the current locked contract. Code inspection: `runtime/src/plugin_lifecycle.rs::PluginState::from_servers` treats `ServerStatus::Degraded` as usable by placing it in `healthy_servers`, but the `PluginState::Degraded` variant stores only `healthy_servers: Vec<String>` and `failed_servers: Vec<ServerHealth>`. `PluginHealthcheck::degraded_mode` then builds `unavailable_tools` only from `failed_servers` and reports a reason of `"N servers healthy, M servers failed"`. Existing test `degraded_server_status_keeps_server_usable` asserts a degraded `beta` server is included in `healthy_servers` and `failed_servers` is empty, but it does not assert that `beta`'s `last_error` (`high latency`) or degraded status is visible in `degraded_mode`. Result: a plugin with one healthy server and one degraded-but-usable server can emit `startup_degraded` while its degraded-mode payload says `2 servers healthy, 0 servers failed`, has no degraded server list, and can mark the degraded server's tools as simply available if discovery returns them. Operators and automation see the terminal state but lose the actual degraded reason/capability risk. **Required fix shape:** (a) split `PluginState::Degraded` into `healthy_servers`, `degraded_servers: Vec<ServerHealth>`, and `failed_servers`, or preserve all non-healthy server health records; (b) make `degraded_mode` include `degraded_tools`/`degraded_servers` with `last_error` separately from unavailable failed tools; (c) update the reason string/counts to distinguish healthy, degraded, and failed rather than folding degraded into healthy; (d) add tests proving a `ServerStatus::Degraded` server's status/error/capabilities appear in the degraded-mode JSON while remaining callable when appropriate; (e) align this with MCP degraded reports so partial plugin startup carries both availability and quality-of-service impact. **Why this matters:** partial startup needs impact opacity removed. A degraded server is not failed, but it also is not fully healthy; folding it into the healthy list makes `startup_degraded` hard to diagnose and can trick recovery logic into thinking no server has actionable degradation details. Source: gaebal-gajae dogfood response to Clawhip message `1507450501925441616` on 2026-05-22.

 593. **Plugin uninstall is a three-store non-atomic transaction: it deletes the plugin directory before persisting registry/settings cleanup, so an IO failure can leave registry and `enabledPlugins` pointing at a removed plugin** — dogfooded 2026-05-22 from the `#clawcode-building-in-public` 19:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@244bdb7`. Active tmux session at probe time: `gajae-pr-358-review-session-vanish-replacement-review`; no active claw-code implementation session. Focused validation: `cd rust && cargo test -p plugins installs_enables_updates_and_uninstalls_external_plugins -- --nocapture` passed 1/1, confirming the happy-path lifecycle test does not cover interrupted persistence. Code inspection: `plugins/src/lib.rs::PluginManager::uninstall` removes the plugin record from the in-memory registry, then `remove_dir_all(record.install_path)`, then `store_registry(&registry)`, then `write_enabled_state(plugin_id, None)`. If directory removal succeeds but `store_registry` fails, `installed.json` still records the old plugin while the on-disk install directory is gone. If `store_registry` succeeds but `write_enabled_state` fails, the registry and disk are removed but `.claw/settings.json` can still contain `enabledPlugins[plugin_id]=true`. The next `plugin_registry_report` may silently prune the stale registry entry, but the enabled state can remain as orphaned configuration noise and the original uninstall command has already destroyed the install before it can report a clean transactional result. Existing `installs_enables_updates_and_uninstalls_external_plugins` asserts only the success path; it does not simulate registry write failure, settings write failure, or crash after `remove_dir_all`. **Required fix shape:** (a) make plugin uninstall a transaction with a tombstone/backup phase: first persist intended disabled/removed state or acquire a lock, then move the install directory to a same-filesystem trash/backup path, then update registry/settings, then remove backup after all metadata commits succeed; (b) if metadata cleanup fails, restore or report a recoverable tombstoned state with `rollback_available:true`; (c) make stale enabled settings for missing plugins produce a structured warning and auto-clean only through a safe metadata path; (d) add tests injecting failures after directory move/removal, after `store_registry`, and after `write_enabled_state`; (e) reuse the atomic JSON/write-lock helper from #508 so registry/settings writes cannot be partially written. **Why this matters:** uninstall is the destructive plugin lifecycle verb. Deleting files before committing metadata means a transient settings/registry IO failure turns a local cleanup into stale-branch-style plugin confusion: inventory can say installed/enabled while the executable hook/tool files are already gone. Source: gaebal-gajae dogfood response to Clawhip message `1507458055812546630` on 2026-05-22.
+
+594. **MCP stdio frame reader trusts arbitrary `Content-Length` and allocates that many bytes before reading, so a buggy or hostile MCP server can OOM the runtime with one header** — dogfooded 2026-05-22 from the `#clawcode-building-in-public` 19:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@d996b65`. Active tmux session at probe time: `gajae-pr-358-review-session-vanish-replacement-review`; no active claw-code implementation session. Focused validation: `cd rust && cargo test -p runtime mcp_stdio -- --nocapture` passed 22/22, confirming current MCP stdio tests cover normal/mismatched/lowercase frames but not oversized frames. Code inspection: `runtime/src/mcp_stdio.rs::McpStdioProcess::read_frame` parses `Content-Length` into `usize`, then immediately does `let mut payload = vec![0_u8; content_length]; self.stdout.read_exact(&mut payload).await?;`. There is no maximum frame size, no per-server byte budget, no early rejection on huge lengths, and no streaming cap. A server can send `Content-Length: 10000000000
+
+` (or any value near available memory) and force allocation before any payload bytes arrive; the surrounding `run_process_request` timeout does not protect the allocation itself. This is distinct from HTTP body caps (#503) and SSE parser buffering (#506): it is the MCP JSON-RPC stdio framing layer. Existing tests assert lowercase `Content-Length`, missing/mismatched IDs, timeout, and retry/reset behavior, but none assert a maximum accepted frame length. **Required fix shape:** (a) add a conservative `MAX_MCP_STDIO_FRAME_BYTES` default and optional per-server override; (b) after parsing `Content-Length`, reject values above the cap with `io::ErrorKind::InvalidData` carrying `content_length` and `max_frame_bytes`; (c) read the body through a bounded buffer/helper so allocation is capped and timeout/error surfaces stay typed as MCP invalid response; (d) add regression scripts that emit huge `Content-Length` with no body and oversized body, proving no large allocation and a structured invalid-response error; (e) include frame-size metadata in MCP degraded/error reports so operators can distinguish protocol abuse from transport EOF. **Why this matters:** MCP servers are extension processes. The client must treat their stdio as untrusted protocol input; one oversized length header should not be able to OOM a prompt startup, tool discovery, or resource read before degraded-mode reporting can fire. Source: gaebal-gajae dogfood response to Clawhip message `1507465601499660349` on 2026-05-22.