Files
relicario/docs/superpowers/specs/2026-05-04-security-polish-design.md
adlee-was-taken 4f7ab91f14 docs(specs): three architecture-review followup plans (security/CLI/extension)
Plan A (security & docs polish, S): SessionHandle impl Drop + JS .free()
audit + recovery_qr.rs documentation + relay launcher dev-c expansion.
Independent of B/C; ships first.

Plan B (CLI restructure, M-L): split cli/main.rs (2641 LOC) into commands/
folder + prompt.rs + parse.rs; helpers::git_run captures stderr; Vault::
after_manifest_change centralizes the groups-cache discipline; canonical
ParamsFile; batched purge; migrate parse_month_year/base32_decode_lenient/
guess_mime to relicario-core with WASM re-exports.

Plan C (extension restructure, L): typed StateHost (precondition); extract
service-worker/storage.ts; setup.ts SW migration via create_vault/
attach_vault messages + step-registry pattern; vault.ts split into
shell/sidebar/list/drawer/form-wrapper with vault_locked channel
unified through shared/state.ts; P2 cluster (timer reset, gitHost clear,
teardown helper, allSettled, MutationObserver debounce); get_vault_status
closes the relicario status parity gap.

Cross-boundary cites verified: Plan B Phase 8 WASM exports are the seam
Plan C consumes (deferred to a future plan); Plan A owns the .free() swallow
removal that Plan C respects without redoing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 20:02:40 -04:00

136 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Security & Docs Polish — Design
**Date:** 2026-05-04
**Status:** Proposed
**Source:** `docs/superpowers/reviews/2026-05-04-architecture-review.md` (P1.1, P1.7, P1.8)
**Effort estimate:** S (whole-PR; phase totals below)
## Summary
This is the one-day security and documentation PR that ships first in the v0.5.x architecture-followup train, independent of Plan B (CLI restructure) and Plan C (extension restructure). It closes the single defense-in-depth crypto gap the review flagged (`SessionHandle` has no `impl Drop`, so wasm-bindgen's `.free()` is a cleanup no-op while the master key sits in WASM linear memory), removes the JS-side error swallow that was masking that fact, brings the one undocumented security-critical core file (`recovery_qr.rs`) up to the documentation density of `crypto.rs` / `imgsecret.rs` / `backup.rs` / `tar_safe.rs`, and confirms the relay test/launcher polish items. Together these are mechanical, low-risk changes that fix the only finding in the review where "what the code looks like it does" diverges dangerously from "what it actually does" — which is exactly the wall a tinkerer learning Rust would hit hardest.
## Findings addressed
- **P1.1 — `SessionHandle` has no `impl Drop`; `.free()` is a cleanup no-op** — DEV-B (Rust headline) + DEV-C (JS partner finding). Files: `crates/relicario-wasm/src/lib.rs:15-23`, `crates/relicario-wasm/src/session.rs:1-58`, `extension/src/service-worker/session.ts:26`.
- **Partner JS swallow** — DEV-C P2 in service-worker. File: `extension/src/service-worker/session.ts:26` (the `try { current.free(); } catch { /* already freed */ }` block).
- **P1.7 — `recovery_qr.rs` is undocumented relative to the rest of core** — DEV-A. File: `crates/relicario-core/src/recovery_qr.rs:1-130`.
- **P1.8 — `tools/relay/queue.test.ts` assertion update** — DEV-C. File: `tools/relay/queue.test.ts:54`. Confirmed during planning: **already fixed in `061facd`** — line 54 now reads `assert.ok(isRole("dev-c"));` and the negative case `assert.ok(!isRole("dev-d"));` is on line 55. No code work remains; the plan tracks this as a verification step.
- **Launcher follow-up — `tools/relay/start.sh` still references three roles** — DEV-C P2. File: `tools/relay/start.sh:30-86`. Confirmed during planning: **not yet fixed.** Both `launch_tmux` and `launch_kitty` open only PM, Dev-A, Dev-B (no Dev-C); the manual-mode banner still says "Open 3 new terminals"; the post-launch summary lines (`:61, :81`) print "3 windows (PM, Dev-A, Dev-B)"; no `*-dev-c-prompt.md` is discovered or used. This phase is real.
## Approach
The PR is mechanical and uniform across four phases:
- **Adds** one `impl Drop for SessionHandle` block in `crates/relicario-wasm/src/lib.rs`, one Rust test that exercises construct → drop → registry-empty, one module-level `//!` doc-block in `recovery_qr.rs`, one ASCII layout diagram next to the magic constants there, four per-function doc-comments on the public API there, and one `DEV_C_PROMPT` discovery line + a fourth window in each launcher mode of `start.sh`.
- **Removes** the `try { current.free(); } catch { /* already freed */ }` swallow at `extension/src/service-worker/session.ts:26`.
- **Verifies** every `.free()` callsite under `extension/src/` (the audit deliverable for P1.1's JS side; today the audit itself is small — only one match — but the deliverable is the recorded grep + a check that every code path that holds a `SessionHandle` calls `wasm.lock(handle)` before `free()`).
- **Confirms in passing** that `tools/relay/queue.test.ts:54` already matches the new role union (it does — committed in `061facd`).
No public API changes on the Rust side beyond the `Drop` impl (which is observably equivalent to "calling `lock(handle)` was previously the only way to clean up"). No JS public API changes. No file moves. No new dependencies (`wasm-bindgen-test` is already in `[dev-dependencies]` for the wasm crate).
## Implementation phases
### Phase 1 — Rust-side `impl Drop for SessionHandle` + test
- **Goal:** make wasm-bindgen's auto-generated `.free()` actually clear the master key from WASM linear memory, regardless of whether JS also called `lock(handle)`.
- **Changes:**
- `crates/relicario-wasm/src/lib.rs` — add `impl Drop for SessionHandle { fn drop(&mut self) { let _ = session::remove(self.0); } }` immediately after the existing `#[wasm_bindgen] impl SessionHandle { ... value() ... }` block (around line 23). Update the doc-comment on `SessionHandle` (line 15) to document the contract: "Dropping (or `.free()` from JS) removes the entry from the session registry and zeroizes the wrapped key/image_secret. `lock(handle)` remains available as the explicit early-cleanup path; `Drop` is the safety net."
- `crates/relicario-wasm/src/session.rs` — no functional change required, but add a one-line module-level note above `pub fn remove(...)` explicitly stating that `Drop for SessionHandle` is the second caller of this function alongside `lock()`, so that a future reader removing `lock()` doesn't accidentally orphan the cleanup path.
- **Tests:**
- Add a `#[wasm_bindgen_test]` (the `wasm-bindgen-test` crate is already in `[dev-dependencies]`, line 25 of `Cargo.toml`) under `crates/relicario-wasm/tests/session_drop.rs`. The test constructs a `SessionHandle` (via the public `unlock` path with synthetic JPEG + tiny `KdfParams` for fast key derivation, mirroring the existing native `session_tests::manifest_round_trip_via_handle` shape), records the underlying handle id via `handle.value()`, drops the handle, then asserts `session::with(id, |_| ()).is_none()`. The `session::with` symbol must be re-exported from `tests/` scope; if it isn't, expose `pub fn with_for_tests(handle: u32) -> bool` on the `session` module guarded by `#[cfg(test)]` so the test does not need to reach into thread-locals.
- As a fallback / belt-and-suspenders, add a parallel native `#[test]` under the existing `mod session_tests` in `lib.rs` that constructs a `SessionHandle(session::insert(...))`, drops it explicitly with `drop(handle)`, then asserts `session::with(h, |_| ()).is_none()`. Native tests run on every `cargo test`; the wasm-bindgen-test runs only when someone invokes `wasm-pack test`, so the native variant is the one that catches regressions in CI.
- **Effort:** S
- **Depends on:** none
### Phase 2 — JS-side `.free()` audit + remove the swallow
- **Goal:** make the JS side stop hiding crypto-state-transition errors, and prove (via a recorded audit) that there are no other `.free()` callsites on `SessionHandle` (or on `EncryptedAttachment` / `TotpCode`) that bypass `wasm.lock(handle)` first.
- **Changes:**
- `extension/src/service-worker/session.ts:24-28` — replace the body of `clearCurrent()` so that `current.free()` is called unconditionally (no `try`/`catch`) and `current` is set to `null` afterwards. The doc-block at the top of the file (lines 1-9) gets one extra paragraph noting that with `impl Drop` on the Rust side (Phase 1), `free()` now removes the registry entry — `lock(handle)` becomes a redundant early-cleanup option, but the SW currently does not call it. Document whether to keep `clearCurrent()` minimal (just `free()`) or add an explicit `wasm.lock(current.value)` immediately before `free()` for symmetry with the CLI's session lifecycle. Recommended: just `free()` post-Phase-1 — calling `lock` first is now belt-and-suspenders that adds a redundant HashMap remove. Document the rationale in the comment.
- **Audit deliverable:** run `grep -rn "\.free\b" extension/src/` and record the result in the PR description. Today this returns exactly one match (`extension/src/service-worker/session.ts:26`), confirming the SW is the only surface that calls `.free()` on a wasm-bindgen handle. If a future surface (e.g. an offscreen document, a future setup-flow refactor) adds a second callsite, this PR's grep becomes the baseline that flags the new one.
- **Lifecycle audit (no code change required, but record findings in PR description):** confirm every code path that constructs a `SessionHandle` via `unlock(...)` ultimately reaches `clearCurrent()`. The two paths are (a) explicit user lock via the popup `lock` action → `service-worker/router/popup-only.ts` `lock` handler → `clearCurrent()`, and (b) inactivity timer expiry → `service-worker/index.ts:51-58` session-expired callback → `clearCurrent()`. Both are already wired; this audit just records that the wiring is complete.
- **Tests:**
- Vitest in `extension/`: add a small unit test in `extension/src/service-worker/__tests__/session.test.ts` (create if absent) that asserts `clearCurrent()` is a no-op when `current` is `null`, calls `setCurrent({ free: spy, value: 1 } as unknown as SessionHandle)` then `clearCurrent()`, and asserts `spy` was called exactly once and `getCurrent()` returns `null`. Use the existing `__stubs__/relicario_wasm.stub.ts` pattern for the handle shape if convenient.
- Optionally extend the lifecycle test: a second case where `free()` throws — assert the exception now propagates out of `clearCurrent()` (regression coverage for the swallow removal).
- **Effort:** S
- **Depends on:** Phase 1 (the swallow removal is only safe once the Rust side actually does the cleanup work that `.free()` is now supposed to do)
### Phase 3 — `recovery_qr.rs` documentation pass
- **Goal:** raise `crates/relicario-core/src/recovery_qr.rs` to the documentation density of `crypto.rs` / `imgsecret.rs` / `backup.rs` / `tar_safe.rs` so a reader landing in the file does not assume it is out-of-the-documented-zone scratch code.
- **Changes:** all in `crates/relicario-core/src/recovery_qr.rs`:
- **Module-level `//!` header** at the top of the file (above the `use` block on line 1). Three short paragraphs: (1) *what* the module produces (a 109-byte payload encoded as a QR v40 EcLevel::M SVG that, given the same passphrase, recovers the 32-byte image_secret); (2) *why the format is structured this way* — XChaCha20-Poly1305 wrapping over an Argon2id-derived "wrap key" derived from the recovery passphrase (`b"relicario-recovery-v1\0"` domain-separation prefix prevents the wrap key from ever colliding with a vault master key, even if the user reuses their vault passphrase here); (3) *parameter-pinning rationale*`production_params()` deliberately re-states `KdfParams::default()`'s values rather than referencing them, because changing the default at the type level must not silently invalidate every recovery QR a user has printed.
- **ASCII layout diagram** above the magic-constant block (lines 7-9):
```
// byte field length
// ------ -------------- ------
// 0..4 MAGIC = "RREC" 4
// 4..5 VERSION = 0x01 1
// 5..37 kdf_salt 32 (random per QR)
// 37..61 wrap_nonce 24 (random per QR)
// 61..109 ciphertext 48 (32 image_secret + 16 AEAD tag)
// ------------------------------
// total 109
```
Pair the diagram with a `const _: () = assert!(PAYLOAD_LEN == 4 + 1 + 32 + 24 + 48);` static assertion so a future edit that changes the layout cannot drift from the documented total without a compile error.
- **Doc-comments on the four public items:**
- `RecoveryQrPayload` (line 11): one-paragraph summary; note that `as_bytes()` is the only accessor and that the bytes are an opaque sealed package — they should travel together (printing them as a QR is the supported transport, but a hex string also works for paranoid sneakernet).
- `generate_recovery_qr` (line 45): documents inputs (passphrase, image_secret), output (the 109-byte payload), and the security properties (random kdf_salt + random wrap_nonce per call ⇒ two QRs from the same passphrase + image_secret are different bytes). Cross-reference `recovery_qr_to_svg` for rendering.
- `unwrap_recovery_qr` (line 81): documents inputs/output, return-type rationale (`Zeroizing<[u8; 32]>` so the recovered image_secret zeroes when dropped), and the error vocabulary (`RecoveryQr` for format errors, `Decrypt` for AEAD failure / wrong passphrase — the deliberate non-distinguishing rejection from audit M4).
- `recovery_qr_to_svg` (line 122): documents the rendering choice (QR v40, EcLevel::M, 140×140 minimum dimension) and the rationale (109 bytes fits comfortably in v40 at M, leaving recovery-error-correction headroom for a smudged print).
- **`production_params` (line 32):** add a doc-comment explaining the deliberate divergence-tolerance from `KdfParams::default()` — wording: "Pinned at QR-format authoring time. `KdfParams::default()` may evolve as we re-tune Argon2 cost; this function MUST NOT change values that have ever shipped, because doing so silently invalidates every recovery QR printed under a previous parameter set." Either keep the function-with-comment shape, or per the synthesis suggestion, replace it with a `const RECOVERY_PRODUCTION_PARAMS: KdfParams = KdfParams { argon2_m: 65536, argon2_t: 3, argon2_p: 4 };` if `KdfParams` is `const`-constructible (it is — checked via `crypto.rs`). Either is acceptable; the `const` form is preferred because it makes the "this is not derived from defaults" property visible at the use site.
- **Optional micro-cleanups while in the file** (cite explicitly so they aren't surprising in code review):
- Replace the hand-counted `payload_bytes[5..37]`, `[37..61]`, `[61..109]` slice indices with named constants (`KDF_SALT_RANGE`, `WRAP_NONCE_RANGE`, `CIPHERTEXT_RANGE`) derived from the layout offsets. Improves diff-readability if the layout ever extends.
- Add a one-line comment to `recovery_kdf_input` documenting that the length-prefix on `nfc_bytes` mirrors the same convention as `crypto::derive_master_key` (audit H1 in the design spec) — this is the connective tissue that explains "why is there a `(len as u64).to_be_bytes()` here too?"
- **Tests:**
- No new test logic needed — the existing `tests/recovery_qr.rs` and the `RecoveryQr`/`Decrypt` round-trips in `tests/integration.rs` already pin behaviour.
- The `const _: () = assert!(...)` static assertion is itself the contract test for the layout diagram.
- **Effort:** S
- **Depends on:** none
### Phase 4 — Relay launcher (queue.test.ts already done in `061facd`)
- **Goal:** confirm the test fix that already shipped, and bring the launcher script up to the four-role world.
- **Changes:**
- `tools/relay/queue.test.ts` — **no change.** Confirmed during planning: line 54 is already `assert.ok(isRole("dev-c"));` and line 55 is already `assert.ok(!isRole("dev-d"));` (committed in `061facd`). This phase records the verification step in the PR description so a reviewer can see the file was checked.
- `tools/relay/start.sh` — real changes:
- Add a `DEV_C_PROMPT` discovery line at lines 30-33 alongside the existing `PM_PROMPT` / `DEV_A_PROMPT` / `DEV_B_PROMPT` definitions: `DEV_C_PROMPT="$(ls -t "$COORD_DIR"/*-dev-c-prompt.md 2>/dev/null | head -1 || echo "(none found)")"`.
- Manual mode (`print_manual_instructions`, lines 35-51) — change "Open 3 new terminals" to "Open 4 new terminals" and add the `Terminal 4 (Dev C): cat '$DEV_C_PROMPT'` line.
- tmux mode (`launch_tmux`, lines 53-69) — add a fourth `tmux new-window -t "$SESSION:" -n "dev-c" "cd '$REPO_ROOT' && claude"` line; update the summary print on line 61 from "4 windows: relay, pm, dev-a, dev-b" to "5 windows: relay, pm, dev-a, dev-b, dev-c"; add `Dev C: $DEV_C_PROMPT` to the prompts list (currently lines 64-66).
- kitty mode (`launch_kitty`, lines 71-86) — add a fourth `kitty @ launch --type=tab --tab-title "Dev-C" --hold -- bash -l -i -c "cd '$REPO_ROOT' && claude"` line; update the summary print on line 81 from "3 windows (PM, Dev-A, Dev-B)" to "4 windows (PM, Dev-A, Dev-B, Dev-C)"; add `Dev C: $DEV_C_PROMPT` to the prompts list.
- **Tests:** none directly. The relay launcher has no automated tests; verification is "run `bash tools/relay/start.sh --manual` and confirm the banner mentions four terminals + the Dev-C prompt path; run `bash tools/relay/start.sh --kitty` on a kitty terminal and confirm a fourth tab opens." Both checks belong in the PR's manual-test plan.
- **Effort:** S
- **Depends on:** none
## Risks and mitigations
- **Adding `impl Drop` could change observable behaviour for any JS code that today holds two references to the same `SessionHandle` and frees one of them.** wasm-bindgen does not in fact let JS hold two refs to the same Rust struct without explicit cloning — the `SessionHandle` is moved into JS at construction and `.free()` consumes the handle pointer. The audit in Phase 2 confirmed only one `.free()` callsite under `extension/src/`. The CLI does not touch the WASM crate at all (Rust-side; the CLI uses `relicario-core` directly). So the realistic blast radius is exactly the one SW callsite this plan also touches. Mitigation: Phase 1's native test explicitly covers "drop the handle, registry is empty," and Phase 2's vitest covers "free is called exactly once."
- **Removing the `try { current.free() } catch { /* already freed */ }` swallow could expose latent bugs that were silently tolerated.** Specifically, if any code path today double-frees the handle (calls `clearCurrent()` twice without a `setCurrent()` in between), the catch is currently hiding it. Reading the SW lifecycle: `clearCurrent()` is called from (a) the `lock` message handler, (b) the inactivity-timer session-expired callback, and (c) potentially the unlock-failure cleanup path. The current `if (!current) return;` early-out at line 25 already guards the simplest double-free case (calling `clearCurrent()` twice in a row is safe because the second call no-ops). The only way the catch fires today is if something in WASM raises during `free()` — which post-Phase-1 should be impossible (Drop is infallible; `session::remove` returns a bool, not a Result). Mitigation: the swallow removal happens *after* Phase 1's Drop lands, so the catch should now have nothing to catch. If a regression test surfaces a real bug, the right fix is to find and fix the bug, not re-add the swallow.
- **The `recovery_qr.rs` static assertion (`const _: () = assert!(PAYLOAD_LEN == ...);`) requires Rust 1.79+** for `const` panic in test position. The workspace's `rust-toolchain.toml` (if present) or the `Cargo.toml` `rust-version` field needs to be checked; if the repo is on an older toolchain, fall back to a runtime `#[test] fn payload_len_matches_layout() { assert_eq!(PAYLOAD_LEN, 4 + 1 + 32 + 24 + 48); }` — same effect, slightly less elegant. Verify during implementation; not a blocker.
- **Phase 4's launcher change is untested by automation.** A typo in the kitty/tmux command lines won't surface until someone runs `start.sh --kitty` or `start.sh --tmux`. Mitigation: the PR description includes an explicit manual-test plan ("run all three modes and confirm 4 windows + Dev-C prompt resolution") and the change pattern is symmetric to the existing Dev-A/Dev-B lines, minimizing typo surface.
- **The `wasm-bindgen-test` variant in Phase 1 only runs under `wasm-pack test --node`, which is not part of the project's default `cargo test` cycle.** The native `#[test]` in `mod session_tests` is the one that protects against regressions in CI. The wasm-bindgen-test is included for symmetry and to exercise the actual WASM target's Drop behaviour — it should be invoked at least once during PR review and added to a follow-up CI workflow if the project's test infrastructure ever expands to wasm-pack.
## Out of scope
This plan deliberately does not address, and explicitly leaves to other plans or the P2/P3 tail:
- Plan B's CLI restructure work in its entirety (P1.2 main.rs split, P1.3 git-shell error UX, P1.10 parser migration to core, plus the CLI P2/P3 entries). These touch `crates/relicario-cli/` and are independent of this PR.
- Plan C's extension restructure work in its entirety (P1.4 setup.ts WASM extraction, P1.5 vault.ts split, P1.6 shared/state.ts typing, P1.9 SW router helper extraction, plus the extension P2/P3 entries). These touch `extension/src/` and are independent of this PR — except for the single SW file that Phase 2 of this plan touches, which is small and disjoint from Plan C's surfaces.
- The P2 WASM cleanups DEV-B flagged: redundant `need_key + with(...).unwrap()` double-lookup, `Vec<u8>` getter clones, the `wasm_*_recovery_qr` naming inconsistency, and the `device.rs` vs `session.rs` concurrency-primitive split. Each is independently correct and worth doing, but each is its own one-paragraph change and the PR is already covering its theme.
- DEV-C's other relay P2s: queue TTL/persistence/cap, the `tools/relay/call.py` and `call.ts` tracking decision, `server.ts` per-connection factory comment. None block this PR.
- The eight "Open architectural decisions" at the bottom of the synthesis (deliberate-Drop-omission, parser migration timing, bootstrap rule, `Lock` subcommand visibility, Task 12 status, call.py tracking, snake_case-vs-camelCase WASM naming, `.gitea_env_vars` gitignore). Most are user-judgement calls; this PR confirms #1 (the omission was not deliberate; the fix lands here) but takes no position on #2-#8.
- Anything touching the in-flight uncommitted v0.5.x work — manifests, glyphs, vault.css, relay tooling beyond `start.sh`. Those changes are on parallel branches and are explicitly hands-off for this PR.
## Done criteria
A reviewer can confirm this plan shipped by checking:
- [ ] `crates/relicario-wasm/src/lib.rs` contains `impl Drop for SessionHandle` whose body removes the registry entry.
- [ ] `crates/relicario-wasm/tests/session_drop.rs` (or equivalent native `#[test]` inside `mod session_tests`) asserts that dropping a `SessionHandle` removes its handle id from the session registry. `cargo test -p relicario-wasm` passes the new test.
- [ ] `extension/src/service-worker/session.ts` no longer contains `try { current.free(); } catch`. The body of `clearCurrent()` calls `current.free()` unconditionally, then sets `current = null`.
- [ ] `grep -rn "\.free\b" extension/src/` returns exactly one match (the SW callsite). PR description records this audit.
- [ ] `extension/src/service-worker/__tests__/session.test.ts` covers `clearCurrent()` with both the no-op (no current handle) and the propagating-throw cases. `npm test` from `extension/` passes.
- [ ] `crates/relicario-core/src/recovery_qr.rs` has a module-level `//!` header explaining format + domain-separation + parameter-pinning, an ASCII layout diagram next to the magic constants, doc-comments on `RecoveryQrPayload`, `generate_recovery_qr`, `unwrap_recovery_qr`, and `recovery_qr_to_svg`, plus either a `const RECOVERY_PRODUCTION_PARAMS` or a doc-commented `production_params()` explaining why it deliberately diverges from `KdfParams::default()`. `cargo test -p relicario-core` still passes.
- [ ] `tools/relay/queue.test.ts:54-55` verified to already match the new role union (no change required, recorded in PR description).
- [ ] `tools/relay/start.sh` discovers a `*-dev-c-prompt.md`, opens a fourth window/terminal in tmux and kitty modes, and the manual-mode banner mentions all four roles. `bash tools/relay/start.sh --manual` shows "Open 4 new terminals" and lists the Dev-C prompt path.