docs(roadmap): add websearch base url boundary gap

This commit is contained in:
Yeachan-Heo
2026-05-22 11:30:40 +00:00
parent bff67c2b24
commit d8864ff151

View File

@@ -6707,3 +6707,5 @@ Original filing (2026-04-18): the session emitted `SessionStart hook (completed)
576. **`glob_search` brace expansion has no fan-out cap, so a single pattern with large brace groups can explode into thousands of `WalkDir` traversals before any timeout/result limit applies** — dogfooded 2026-05-22 from the `#clawcode-building-in-public` 10:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@ab5754a`. Active tmux sessions at probe time: none. Code inspection: `rust/crates/runtime/src/file_ops.rs::expand_braces` recursively expands the first `{...}` group by splitting on every comma and calling itself on each alternative. `glob_search_impl` then iterates every expanded pattern and creates a separate `WalkDir` traversal for each one before applying the hard `take(100)` result cap. There is no maximum number of alternatives, no recursion-depth/expanded-pattern cap, and no early deduplication by shared walk root. A user/model-supplied pattern like `**/*.{a,b,c,...hundreds}` or multiple brace groups can produce hundreds/thousands of full repo walks even though only 100 filenames can be returned. Existing tests cover a tiny `*.{rs,toml}` happy path and unmatched braces, but not expansion fan-out or timeout behavior. **Required fix shape:** (a) impose a small maximum expanded-pattern count and return `InvalidInput`/typed error when exceeded; (b) deduplicate shared walk roots or compile multiple patterns per root so alternatives do not trigger repeated full-tree walks; (c) include `expanded_pattern_count` and truncation/fanout metadata in diagnostics; (d) add regressions for large single brace groups, multiple nested brace groups, and normal small brace patterns; (e) keep the final 100-result cap but do not rely on it as protection against pre-result traversal explosions. **Why this matters:** glob is a model-facing discovery tool. Unbounded brace fan-out turns a compact pattern into many expensive filesystem walks, creating startup friction and an easy self-inflicted DoS before the existing result limit can help. Source: gaebal-gajae dogfood response to Clawhip message `1507329710257078353` on 2026-05-22.
577. **`WebFetch` upgrades only the initial HTTP URL, but follows redirects without revalidating the final target, so HTTPS pages can redirect the tool into localhost/private HTTP endpoints** — dogfooded 2026-05-22 from the `#clawcode-building-in-public` 11:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@571b4c2`. Active tmux sessions at probe time: none. Code inspection: `rust/crates/tools/src/lib.rs::normalize_fetch_url` upgrades an initial `http://` URL to HTTPS unless the host is exactly `localhost`, `127.0.0.1`, or `::1`. But `build_http_client` enables `redirect(reqwest::redirect::Policy::limited(10))`, and `execute_web_fetch` sends the request once, then trusts `response.url()` as the final URL. There is no redirect policy that re-runs scheme/host/IP checks for each hop, no block for redirects from public HTTPS to `http://localhost`, `http://127.0.0.1`, RFC1918/link-local/metadata IPs, or non-HTTPS final URLs. As a result, a model-requested public URL can legally redirect the tool into local/private network resources even though direct non-local HTTP is upgraded and no SSRF boundary is documented. **Required fix shape:** (a) implement a custom redirect policy that validates every `Location` target before following it; (b) reject redirects to localhost, loopback, link-local, RFC1918/private ranges, cloud metadata hosts/IPs, and/or downgrade-to-HTTP unless explicitly allowed; (c) resolve hostnames before fetch or after redirect to prevent DNS-based private IP hops; (d) add tests with public HTTPS -> localhost/private/metadata redirects and safe HTTPS->HTTPS redirects; (e) expose final URL plus redirect-block reason in the `WebFetch` output/error without leaking response bodies. **Why this matters:** WebFetch is an external content tool. Redirects are part of the fetch boundary, not a postscript; without per-hop validation, a harmless-looking public URL can become an internal network probe or local service read. Source: gaebal-gajae dogfood response to Clawhip message `1507337256107642961` on 2026-05-22.
578. **`WebSearch` accepts `CLAWD_WEB_SEARCH_BASE_URL` with any scheme/host and follows the shared redirect policy, so a local environment variable can turn search into a private-network fetch without guardrails** — dogfooded 2026-05-22 from the `#clawcode-building-in-public` 11:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@bff67c2`. Active tmux session at probe time: `gajae-issue-339-backlog-zero-candidate-selection`; no active claw-code implementation session. Code inspection: `rust/crates/tools/src/lib.rs::build_search_url` checks `CLAWD_WEB_SEARCH_BASE_URL`, parses it with `reqwest::Url::parse`, appends `q`, and returns it unchanged. Unlike `WebFetch`, there is no `normalize_fetch_url`-style scheme upgrade or host validation at all. `execute_web_search` then uses the same `build_http_client` as WebFetch, which follows up to 10 redirects. A poisoned or stale local env var can point search at `http://localhost`, RFC1918 services, metadata endpoints, or a redirector to those targets; the tool will fetch and parse generic links before domain allow/block filters are applied to extracted result URLs, not to the search endpoint itself. Existing tests/logic focus on hit filtering, not search-provider endpoint validation. **Required fix shape:** (a) validate `CLAWD_WEB_SEARCH_BASE_URL` at parse time against allowed schemes/hosts or require an explicit unsafe/local-test opt-in; (b) apply the same per-redirect target validation required by #577 to WebSearch; (c) distinguish configured test search providers from production search with a typed config/source field in output; (d) add regressions for env base URLs pointing to localhost/private/metadata and safe HTTPS test endpoints; (e) document the env var as test-only or constrain it to trusted public search domains. **Why this matters:** search is often treated as a low-risk public-web tool. An unvalidated base URL makes it an environment-controlled internal fetch surface, and result-domain filters do not protect the initial request or redirects. Source: gaebal-gajae dogfood response to Clawhip message `1507344805167108257` on 2026-05-22.