From 9843ccfa281a60894f1430d60d5b58e3b5d9a65b Mon Sep 17 00:00:00 2001 From: Yeachan-Heo Date: Fri, 22 May 2026 15:31:00 +0000 Subject: [PATCH] docs(roadmap): add plugin registry sync race gap --- ROADMAP.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/ROADMAP.md b/ROADMAP.md index ebccc731..de4b7a74 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -6723,3 +6723,5 @@ Original filing (2026-04-18): the session emitted `SessionStart hook (completed) 584. **`WorkerRegistry::restart` clears prompt/trust state but leaves old events and the original `created_at`, so post-restart timeout evidence can mix prior-attempt blockers with the new boot attempt** — dogfooded 2026-05-22 from the `#clawcode-building-in-public` 14:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@ac09033`. Active tmux sessions at probe time: none. Channel context included Jobdori's worker-boot #592 finding, so this probe stayed in `worker_boot.rs` but checked restart lifecycle continuity. Code inspection: `rust/crates/runtime/src/worker_boot.rs::restart` resets `status`, trust flags, prompt fields, `last_error`, attempts, and `prompt_in_flight`, then appends `WorkerEventKind::Restarted`. It does not reset `worker.created_at`, does not create a per-attempt `boot_started_at`, does not increment a restart/attempt counter, and does not clear or partition previous events. Later `observe_startup_timeout` computes `elapsed = now.saturating_sub(worker.created_at)` and derives `trust_prompt_detected`, `tool_permission_detected`, and `ready_for_prompt_detected` by scanning the entire `worker.events` history. A worker that previously hit trust/tool/ready evidence, then restarts cleanly, can have the new timeout classified with old pre-restart evidence and an elapsed time anchored to the original creation. Existing `restart_and_terminate_reset_or_finish_worker` only asserts prompt fields/attempts reset; it does not assert event scoping or elapsed-time reset. **Required fix shape:** (a) add `current_attempt_started_at`/`boot_started_at` and `attempt_index` fields; (b) set them on create and restart and use them for startup timeout elapsed/command_started_at; (c) scope timeout evidence scans to events since the current attempt or store per-attempt cached evidence; (d) add tests where trust/tool evidence exists before restart but not after, proving the post-restart timeout does not inherit stale blockers; (e) include attempt index in worker events so dashboards can separate old and new boot attempts. **Why this matters:** restart is supposed to produce a fresh boot attempt. If old events and timestamps remain authoritative, operators see misleading “stalled for hours / trust required” evidence for a brand-new restart and recovery automation can choose the wrong next action. Source: gaebal-gajae dogfood response to Clawhip message `1507390103956226228` on 2026-05-22. 585. **Prompt-misdelivery auto-recovery arms replay without clearing `last_error`, so a worker can be `ReadyForPrompt` while still carrying a stale prompt-delivery failure** — dogfooded 2026-05-22 from the `#clawcode-building-in-public` 15:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@b0bca2e`. Active tmux session at probe time: `gajae-pr-346-session-gateway-continuity-digest-review`; no active claw-code implementation session. Code inspection: in `rust/crates/runtime/src/worker_boot.rs::observe`, prompt misdelivery sets `worker.last_error = Some(WorkerFailureKind::PromptDelivery)` and `prompt_in_flight = false`, then pushes a `PromptMisdelivery` event. If `worker.auto_recover_prompt_misdelivery` is true, it sets `worker.replay_prompt = worker.last_prompt.clone()` and `worker.status = WorkerStatus::ReadyForPrompt`, then pushes `PromptReplayArmed`; however it never clears or demotes `last_error`. `await_ready` then returns `ready: true` and `last_error: Some(PromptDelivery)`, so callers/dashboards can see a ready replay state and a failure state simultaneously. Later `send_prompt` clears `last_error`, but until replay is actually sent the state snapshot is contradictory. Existing replay tests assert `status == ReadyForPrompt` and `replay_prompt` contents, but do not assert `last_error` semantics while replay is armed. **Required fix shape:** (a) when auto-recovery arms replay, either clear `last_error` or replace it with a non-fatal/degraded `replay_armed` status separate from failure; (b) include `recovery_armed:true` in the worker snapshot or ready result so callers can distinguish a recoverable ready state from a failed state; (c) add tests asserting `await_ready` after auto-recovery does not report contradictory ready+fatal error; (d) preserve the original misdelivery event for audit history while keeping current worker state coherent; (e) ensure manual/non-auto recovery still reports failure until explicitly resolved. **Why this matters:** recovery state is an operator contract. A worker that is ready to replay should not also advertise a current fatal prompt-delivery error, or automation may both retry and escalate the same incident. Source: gaebal-gajae dogfood response to Clawhip message `1507397657763778562` on 2026-05-22. + +586. **Plugin registry reads synchronously mutate bundled plugin installs without a lock/atomic swap, so startup/listing can race or fail while merely trying to aggregate hooks/tools** — dogfooded 2026-05-22 from the `#clawcode-building-in-public` 15:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@8043090`. Active tmux session at probe time: `gajae-pr-348-package-release-drift-review`; no active claw-code implementation session. Code inspection: every `PluginManager::plugin_registry_report` call begins with `self.sync_bundled_plugins()?`, and read-style paths (`plugin_registry`, `list_plugins`, `discover_plugins`, `aggregated_hooks`, `aggregated_tools`, startup `build_runtime_plugin_state_with_loader`) all flow through it. `sync_bundled_plugins` loads bundled manifests, then for each stale/outdated bundled plugin does `fs::remove_dir_all(&install_path)?; copy_dir_all(&source_root, &install_path)?;`, removes stale bundled IDs/directories, and finally writes `plugins/registry.json`. There is no process-wide file lock, no temp-dir + atomic rename, and no read-only/degraded mode. Two concurrent CLI startups or a startup plus `claw plugins list` can both decide a sync is needed; one can remove an install dir while the other is loading/copying it, yielding transient missing/partial plugin directories or a registry write race. Even a purely diagnostic/list/aggregate command can fail because bundled-plugin self-sync mutates disk before returning registry data. Existing tests cover sync happy paths and load-failure reporting, but not concurrent registry readers or a simulated remove/copy interruption. **Required fix shape:** (a) separate read-only registry discovery from bundled-plugin reconciliation, or gate reconciliation behind an explicit locked startup/update phase; (b) protect bundled install sync and registry writes with an interprocess lock; (c) copy bundled plugins into a temp dir and atomically rename/swap, never exposing partial installs; (d) if sync fails during a read-style command, return a degraded registry report with load failures instead of aborting all plugin aggregation where safe; (e) add concurrency/interruption tests with two managers racing `plugin_registry_report` and with `copy_dir_all` failure after removal, proving readers see either old or new complete plugin installs. **Why this matters:** plugin/MCP startup already has lifecycle friction. Registry reads should be safe and mostly observational; making them perform unlocked destructive replacement means diagnostics and startup can create the very plugin-load failures they are trying to observe. Source: gaebal-gajae dogfood response to Clawhip message `1507405207602987138` on 2026-05-22.