Files
relicario/docs/superpowers/specs/2026-05-04-security-polish-design.md
adlee-was-taken 4f7ab91f14 docs(specs): three architecture-review followup plans (security/CLI/extension)
Plan A (security & docs polish, S): SessionHandle impl Drop + JS .free()
audit + recovery_qr.rs documentation + relay launcher dev-c expansion.
Independent of B/C; ships first.

Plan B (CLI restructure, M-L): split cli/main.rs (2641 LOC) into commands/
folder + prompt.rs + parse.rs; helpers::git_run captures stderr; Vault::
after_manifest_change centralizes the groups-cache discipline; canonical
ParamsFile; batched purge; migrate parse_month_year/base32_decode_lenient/
guess_mime to relicario-core with WASM re-exports.

Plan C (extension restructure, L): typed StateHost (precondition); extract
service-worker/storage.ts; setup.ts SW migration via create_vault/
attach_vault messages + step-registry pattern; vault.ts split into
shell/sidebar/list/drawer/form-wrapper with vault_locked channel
unified through shared/state.ts; P2 cluster (timer reset, gitHost clear,
teardown helper, allSettled, MutationObserver debounce); get_vault_status
closes the relicario status parity gap.

Cross-boundary cites verified: Plan B Phase 8 WASM exports are the seam
Plan C consumes (deferred to a future plan); Plan A owns the .free() swallow
removal that Plan C respects without redoing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 20:02:40 -04:00

23 KiB
Raw Permalink Blame History

Security & Docs Polish — Design

Date: 2026-05-04 Status: Proposed Source: docs/superpowers/reviews/2026-05-04-architecture-review.md (P1.1, P1.7, P1.8) Effort estimate: S (whole-PR; phase totals below)

Summary

This is the one-day security and documentation PR that ships first in the v0.5.x architecture-followup train, independent of Plan B (CLI restructure) and Plan C (extension restructure). It closes the single defense-in-depth crypto gap the review flagged (SessionHandle has no impl Drop, so wasm-bindgen's .free() is a cleanup no-op while the master key sits in WASM linear memory), removes the JS-side error swallow that was masking that fact, brings the one undocumented security-critical core file (recovery_qr.rs) up to the documentation density of crypto.rs / imgsecret.rs / backup.rs / tar_safe.rs, and confirms the relay test/launcher polish items. Together these are mechanical, low-risk changes that fix the only finding in the review where "what the code looks like it does" diverges dangerously from "what it actually does" — which is exactly the wall a tinkerer learning Rust would hit hardest.

Findings addressed

  • P1.1 — SessionHandle has no impl Drop; .free() is a cleanup no-op — DEV-B (Rust headline) + DEV-C (JS partner finding). Files: crates/relicario-wasm/src/lib.rs:15-23, crates/relicario-wasm/src/session.rs:1-58, extension/src/service-worker/session.ts:26.
  • Partner JS swallow — DEV-C P2 in service-worker. File: extension/src/service-worker/session.ts:26 (the try { current.free(); } catch { /* already freed */ } block).
  • P1.7 — recovery_qr.rs is undocumented relative to the rest of core — DEV-A. File: crates/relicario-core/src/recovery_qr.rs:1-130.
  • P1.8 — tools/relay/queue.test.ts assertion update — DEV-C. File: tools/relay/queue.test.ts:54. Confirmed during planning: already fixed in 061facd — line 54 now reads assert.ok(isRole("dev-c")); and the negative case assert.ok(!isRole("dev-d")); is on line 55. No code work remains; the plan tracks this as a verification step.
  • Launcher follow-up — tools/relay/start.sh still references three roles — DEV-C P2. File: tools/relay/start.sh:30-86. Confirmed during planning: not yet fixed. Both launch_tmux and launch_kitty open only PM, Dev-A, Dev-B (no Dev-C); the manual-mode banner still says "Open 3 new terminals"; the post-launch summary lines (:61, :81) print "3 windows (PM, Dev-A, Dev-B)"; no *-dev-c-prompt.md is discovered or used. This phase is real.

Approach

The PR is mechanical and uniform across four phases:

  • Adds one impl Drop for SessionHandle block in crates/relicario-wasm/src/lib.rs, one Rust test that exercises construct → drop → registry-empty, one module-level //! doc-block in recovery_qr.rs, one ASCII layout diagram next to the magic constants there, four per-function doc-comments on the public API there, and one DEV_C_PROMPT discovery line + a fourth window in each launcher mode of start.sh.
  • Removes the try { current.free(); } catch { /* already freed */ } swallow at extension/src/service-worker/session.ts:26.
  • Verifies every .free() callsite under extension/src/ (the audit deliverable for P1.1's JS side; today the audit itself is small — only one match — but the deliverable is the recorded grep + a check that every code path that holds a SessionHandle calls wasm.lock(handle) before free()).
  • Confirms in passing that tools/relay/queue.test.ts:54 already matches the new role union (it does — committed in 061facd).

No public API changes on the Rust side beyond the Drop impl (which is observably equivalent to "calling lock(handle) was previously the only way to clean up"). No JS public API changes. No file moves. No new dependencies (wasm-bindgen-test is already in [dev-dependencies] for the wasm crate).

Implementation phases

Phase 1 — Rust-side impl Drop for SessionHandle + test

  • Goal: make wasm-bindgen's auto-generated .free() actually clear the master key from WASM linear memory, regardless of whether JS also called lock(handle).
  • Changes:
    • crates/relicario-wasm/src/lib.rs — add impl Drop for SessionHandle { fn drop(&mut self) { let _ = session::remove(self.0); } } immediately after the existing #[wasm_bindgen] impl SessionHandle { ... value() ... } block (around line 23). Update the doc-comment on SessionHandle (line 15) to document the contract: "Dropping (or .free() from JS) removes the entry from the session registry and zeroizes the wrapped key/image_secret. lock(handle) remains available as the explicit early-cleanup path; Drop is the safety net."
    • crates/relicario-wasm/src/session.rs — no functional change required, but add a one-line module-level note above pub fn remove(...) explicitly stating that Drop for SessionHandle is the second caller of this function alongside lock(), so that a future reader removing lock() doesn't accidentally orphan the cleanup path.
  • Tests:
    • Add a #[wasm_bindgen_test] (the wasm-bindgen-test crate is already in [dev-dependencies], line 25 of Cargo.toml) under crates/relicario-wasm/tests/session_drop.rs. The test constructs a SessionHandle (via the public unlock path with synthetic JPEG + tiny KdfParams for fast key derivation, mirroring the existing native session_tests::manifest_round_trip_via_handle shape), records the underlying handle id via handle.value(), drops the handle, then asserts session::with(id, |_| ()).is_none(). The session::with symbol must be re-exported from tests/ scope; if it isn't, expose pub fn with_for_tests(handle: u32) -> bool on the session module guarded by #[cfg(test)] so the test does not need to reach into thread-locals.
    • As a fallback / belt-and-suspenders, add a parallel native #[test] under the existing mod session_tests in lib.rs that constructs a SessionHandle(session::insert(...)), drops it explicitly with drop(handle), then asserts session::with(h, |_| ()).is_none(). Native tests run on every cargo test; the wasm-bindgen-test runs only when someone invokes wasm-pack test, so the native variant is the one that catches regressions in CI.
  • Effort: S
  • Depends on: none

Phase 2 — JS-side .free() audit + remove the swallow

  • Goal: make the JS side stop hiding crypto-state-transition errors, and prove (via a recorded audit) that there are no other .free() callsites on SessionHandle (or on EncryptedAttachment / TotpCode) that bypass wasm.lock(handle) first.
  • Changes:
    • extension/src/service-worker/session.ts:24-28 — replace the body of clearCurrent() so that current.free() is called unconditionally (no try/catch) and current is set to null afterwards. The doc-block at the top of the file (lines 1-9) gets one extra paragraph noting that with impl Drop on the Rust side (Phase 1), free() now removes the registry entry — lock(handle) becomes a redundant early-cleanup option, but the SW currently does not call it. Document whether to keep clearCurrent() minimal (just free()) or add an explicit wasm.lock(current.value) immediately before free() for symmetry with the CLI's session lifecycle. Recommended: just free() post-Phase-1 — calling lock first is now belt-and-suspenders that adds a redundant HashMap remove. Document the rationale in the comment.
    • Audit deliverable: run grep -rn "\.free\b" extension/src/ and record the result in the PR description. Today this returns exactly one match (extension/src/service-worker/session.ts:26), confirming the SW is the only surface that calls .free() on a wasm-bindgen handle. If a future surface (e.g. an offscreen document, a future setup-flow refactor) adds a second callsite, this PR's grep becomes the baseline that flags the new one.
    • Lifecycle audit (no code change required, but record findings in PR description): confirm every code path that constructs a SessionHandle via unlock(...) ultimately reaches clearCurrent(). The two paths are (a) explicit user lock via the popup lock action → service-worker/router/popup-only.ts lock handler → clearCurrent(), and (b) inactivity timer expiry → service-worker/index.ts:51-58 session-expired callback → clearCurrent(). Both are already wired; this audit just records that the wiring is complete.
  • Tests:
    • Vitest in extension/: add a small unit test in extension/src/service-worker/__tests__/session.test.ts (create if absent) that asserts clearCurrent() is a no-op when current is null, calls setCurrent({ free: spy, value: 1 } as unknown as SessionHandle) then clearCurrent(), and asserts spy was called exactly once and getCurrent() returns null. Use the existing __stubs__/relicario_wasm.stub.ts pattern for the handle shape if convenient.
    • Optionally extend the lifecycle test: a second case where free() throws — assert the exception now propagates out of clearCurrent() (regression coverage for the swallow removal).
  • Effort: S
  • Depends on: Phase 1 (the swallow removal is only safe once the Rust side actually does the cleanup work that .free() is now supposed to do)

Phase 3 — recovery_qr.rs documentation pass

  • Goal: raise crates/relicario-core/src/recovery_qr.rs to the documentation density of crypto.rs / imgsecret.rs / backup.rs / tar_safe.rs so a reader landing in the file does not assume it is out-of-the-documented-zone scratch code.
  • Changes: all in crates/relicario-core/src/recovery_qr.rs:
    • Module-level //! header at the top of the file (above the use block on line 1). Three short paragraphs: (1) what the module produces (a 109-byte payload encoded as a QR v40 EcLevel::M SVG that, given the same passphrase, recovers the 32-byte image_secret); (2) why the format is structured this way — XChaCha20-Poly1305 wrapping over an Argon2id-derived "wrap key" derived from the recovery passphrase (b"relicario-recovery-v1\0" domain-separation prefix prevents the wrap key from ever colliding with a vault master key, even if the user reuses their vault passphrase here); (3) parameter-pinning rationaleproduction_params() deliberately re-states KdfParams::default()'s values rather than referencing them, because changing the default at the type level must not silently invalidate every recovery QR a user has printed.
    • ASCII layout diagram above the magic-constant block (lines 7-9):
      //  byte    field           length
      //  ------  --------------  ------
      //  0..4    MAGIC = "RREC"  4
      //  4..5    VERSION = 0x01  1
      //  5..37   kdf_salt        32   (random per QR)
      //  37..61  wrap_nonce      24   (random per QR)
      //  61..109 ciphertext      48   (32 image_secret + 16 AEAD tag)
      //  ------------------------------
      //  total                   109
      
      Pair the diagram with a const _: () = assert!(PAYLOAD_LEN == 4 + 1 + 32 + 24 + 48); static assertion so a future edit that changes the layout cannot drift from the documented total without a compile error.
    • Doc-comments on the four public items:
      • RecoveryQrPayload (line 11): one-paragraph summary; note that as_bytes() is the only accessor and that the bytes are an opaque sealed package — they should travel together (printing them as a QR is the supported transport, but a hex string also works for paranoid sneakernet).
      • generate_recovery_qr (line 45): documents inputs (passphrase, image_secret), output (the 109-byte payload), and the security properties (random kdf_salt + random wrap_nonce per call ⇒ two QRs from the same passphrase + image_secret are different bytes). Cross-reference recovery_qr_to_svg for rendering.
      • unwrap_recovery_qr (line 81): documents inputs/output, return-type rationale (Zeroizing<[u8; 32]> so the recovered image_secret zeroes when dropped), and the error vocabulary (RecoveryQr for format errors, Decrypt for AEAD failure / wrong passphrase — the deliberate non-distinguishing rejection from audit M4).
      • recovery_qr_to_svg (line 122): documents the rendering choice (QR v40, EcLevel::M, 140×140 minimum dimension) and the rationale (109 bytes fits comfortably in v40 at M, leaving recovery-error-correction headroom for a smudged print).
    • production_params (line 32): add a doc-comment explaining the deliberate divergence-tolerance from KdfParams::default() — wording: "Pinned at QR-format authoring time. KdfParams::default() may evolve as we re-tune Argon2 cost; this function MUST NOT change values that have ever shipped, because doing so silently invalidates every recovery QR printed under a previous parameter set." Either keep the function-with-comment shape, or per the synthesis suggestion, replace it with a const RECOVERY_PRODUCTION_PARAMS: KdfParams = KdfParams { argon2_m: 65536, argon2_t: 3, argon2_p: 4 }; if KdfParams is const-constructible (it is — checked via crypto.rs). Either is acceptable; the const form is preferred because it makes the "this is not derived from defaults" property visible at the use site.
    • Optional micro-cleanups while in the file (cite explicitly so they aren't surprising in code review):
      • Replace the hand-counted payload_bytes[5..37], [37..61], [61..109] slice indices with named constants (KDF_SALT_RANGE, WRAP_NONCE_RANGE, CIPHERTEXT_RANGE) derived from the layout offsets. Improves diff-readability if the layout ever extends.
      • Add a one-line comment to recovery_kdf_input documenting that the length-prefix on nfc_bytes mirrors the same convention as crypto::derive_master_key (audit H1 in the design spec) — this is the connective tissue that explains "why is there a (len as u64).to_be_bytes() here too?"
  • Tests:
    • No new test logic needed — the existing tests/recovery_qr.rs and the RecoveryQr/Decrypt round-trips in tests/integration.rs already pin behaviour.
    • The const _: () = assert!(...) static assertion is itself the contract test for the layout diagram.
  • Effort: S
  • Depends on: none

Phase 4 — Relay launcher (queue.test.ts already done in 061facd)

  • Goal: confirm the test fix that already shipped, and bring the launcher script up to the four-role world.
  • Changes:
    • tools/relay/queue.test.tsno change. Confirmed during planning: line 54 is already assert.ok(isRole("dev-c")); and line 55 is already assert.ok(!isRole("dev-d")); (committed in 061facd). This phase records the verification step in the PR description so a reviewer can see the file was checked.
    • tools/relay/start.sh — real changes:
      • Add a DEV_C_PROMPT discovery line at lines 30-33 alongside the existing PM_PROMPT / DEV_A_PROMPT / DEV_B_PROMPT definitions: DEV_C_PROMPT="$(ls -t "$COORD_DIR"/*-dev-c-prompt.md 2>/dev/null | head -1 || echo "(none found)")".
      • Manual mode (print_manual_instructions, lines 35-51) — change "Open 3 new terminals" to "Open 4 new terminals" and add the Terminal 4 (Dev C): cat '$DEV_C_PROMPT' line.
      • tmux mode (launch_tmux, lines 53-69) — add a fourth tmux new-window -t "$SESSION:" -n "dev-c" "cd '$REPO_ROOT' && claude" line; update the summary print on line 61 from "4 windows: relay, pm, dev-a, dev-b" to "5 windows: relay, pm, dev-a, dev-b, dev-c"; add Dev C: $DEV_C_PROMPT to the prompts list (currently lines 64-66).
      • kitty mode (launch_kitty, lines 71-86) — add a fourth kitty @ launch --type=tab --tab-title "Dev-C" --hold -- bash -l -i -c "cd '$REPO_ROOT' && claude" line; update the summary print on line 81 from "3 windows (PM, Dev-A, Dev-B)" to "4 windows (PM, Dev-A, Dev-B, Dev-C)"; add Dev C: $DEV_C_PROMPT to the prompts list.
  • Tests: none directly. The relay launcher has no automated tests; verification is "run bash tools/relay/start.sh --manual and confirm the banner mentions four terminals + the Dev-C prompt path; run bash tools/relay/start.sh --kitty on a kitty terminal and confirm a fourth tab opens." Both checks belong in the PR's manual-test plan.
  • Effort: S
  • Depends on: none

Risks and mitigations

  • Adding impl Drop could change observable behaviour for any JS code that today holds two references to the same SessionHandle and frees one of them. wasm-bindgen does not in fact let JS hold two refs to the same Rust struct without explicit cloning — the SessionHandle is moved into JS at construction and .free() consumes the handle pointer. The audit in Phase 2 confirmed only one .free() callsite under extension/src/. The CLI does not touch the WASM crate at all (Rust-side; the CLI uses relicario-core directly). So the realistic blast radius is exactly the one SW callsite this plan also touches. Mitigation: Phase 1's native test explicitly covers "drop the handle, registry is empty," and Phase 2's vitest covers "free is called exactly once."
  • Removing the try { current.free() } catch { /* already freed */ } swallow could expose latent bugs that were silently tolerated. Specifically, if any code path today double-frees the handle (calls clearCurrent() twice without a setCurrent() in between), the catch is currently hiding it. Reading the SW lifecycle: clearCurrent() is called from (a) the lock message handler, (b) the inactivity-timer session-expired callback, and (c) potentially the unlock-failure cleanup path. The current if (!current) return; early-out at line 25 already guards the simplest double-free case (calling clearCurrent() twice in a row is safe because the second call no-ops). The only way the catch fires today is if something in WASM raises during free() — which post-Phase-1 should be impossible (Drop is infallible; session::remove returns a bool, not a Result). Mitigation: the swallow removal happens after Phase 1's Drop lands, so the catch should now have nothing to catch. If a regression test surfaces a real bug, the right fix is to find and fix the bug, not re-add the swallow.
  • The recovery_qr.rs static assertion (const _: () = assert!(PAYLOAD_LEN == ...);) requires Rust 1.79+ for const panic in test position. The workspace's rust-toolchain.toml (if present) or the Cargo.toml rust-version field needs to be checked; if the repo is on an older toolchain, fall back to a runtime #[test] fn payload_len_matches_layout() { assert_eq!(PAYLOAD_LEN, 4 + 1 + 32 + 24 + 48); } — same effect, slightly less elegant. Verify during implementation; not a blocker.
  • Phase 4's launcher change is untested by automation. A typo in the kitty/tmux command lines won't surface until someone runs start.sh --kitty or start.sh --tmux. Mitigation: the PR description includes an explicit manual-test plan ("run all three modes and confirm 4 windows + Dev-C prompt resolution") and the change pattern is symmetric to the existing Dev-A/Dev-B lines, minimizing typo surface.
  • The wasm-bindgen-test variant in Phase 1 only runs under wasm-pack test --node, which is not part of the project's default cargo test cycle. The native #[test] in mod session_tests is the one that protects against regressions in CI. The wasm-bindgen-test is included for symmetry and to exercise the actual WASM target's Drop behaviour — it should be invoked at least once during PR review and added to a follow-up CI workflow if the project's test infrastructure ever expands to wasm-pack.

Out of scope

This plan deliberately does not address, and explicitly leaves to other plans or the P2/P3 tail:

  • Plan B's CLI restructure work in its entirety (P1.2 main.rs split, P1.3 git-shell error UX, P1.10 parser migration to core, plus the CLI P2/P3 entries). These touch crates/relicario-cli/ and are independent of this PR.
  • Plan C's extension restructure work in its entirety (P1.4 setup.ts WASM extraction, P1.5 vault.ts split, P1.6 shared/state.ts typing, P1.9 SW router helper extraction, plus the extension P2/P3 entries). These touch extension/src/ and are independent of this PR — except for the single SW file that Phase 2 of this plan touches, which is small and disjoint from Plan C's surfaces.
  • The P2 WASM cleanups DEV-B flagged: redundant need_key + with(...).unwrap() double-lookup, Vec<u8> getter clones, the wasm_*_recovery_qr naming inconsistency, and the device.rs vs session.rs concurrency-primitive split. Each is independently correct and worth doing, but each is its own one-paragraph change and the PR is already covering its theme.
  • DEV-C's other relay P2s: queue TTL/persistence/cap, the tools/relay/call.py and call.ts tracking decision, server.ts per-connection factory comment. None block this PR.
  • The eight "Open architectural decisions" at the bottom of the synthesis (deliberate-Drop-omission, parser migration timing, bootstrap rule, Lock subcommand visibility, Task 12 status, call.py tracking, snake_case-vs-camelCase WASM naming, .gitea_env_vars gitignore). Most are user-judgement calls; this PR confirms #1 (the omission was not deliberate; the fix lands here) but takes no position on #2-#8.
  • Anything touching the in-flight uncommitted v0.5.x work — manifests, glyphs, vault.css, relay tooling beyond start.sh. Those changes are on parallel branches and are explicitly hands-off for this PR.

Done criteria

A reviewer can confirm this plan shipped by checking:

  • crates/relicario-wasm/src/lib.rs contains impl Drop for SessionHandle whose body removes the registry entry.
  • crates/relicario-wasm/tests/session_drop.rs (or equivalent native #[test] inside mod session_tests) asserts that dropping a SessionHandle removes its handle id from the session registry. cargo test -p relicario-wasm passes the new test.
  • extension/src/service-worker/session.ts no longer contains try { current.free(); } catch. The body of clearCurrent() calls current.free() unconditionally, then sets current = null.
  • grep -rn "\.free\b" extension/src/ returns exactly one match (the SW callsite). PR description records this audit.
  • extension/src/service-worker/__tests__/session.test.ts covers clearCurrent() with both the no-op (no current handle) and the propagating-throw cases. npm test from extension/ passes.
  • crates/relicario-core/src/recovery_qr.rs has a module-level //! header explaining format + domain-separation + parameter-pinning, an ASCII layout diagram next to the magic constants, doc-comments on RecoveryQrPayload, generate_recovery_qr, unwrap_recovery_qr, and recovery_qr_to_svg, plus either a const RECOVERY_PRODUCTION_PARAMS or a doc-commented production_params() explaining why it deliberately diverges from KdfParams::default(). cargo test -p relicario-core still passes.
  • tools/relay/queue.test.ts:54-55 verified to already match the new role union (no change required, recorded in PR description).
  • tools/relay/start.sh discovers a *-dev-c-prompt.md, opens a fourth window/terminal in tmux and kitty modes, and the manual-mode banner mentions all four roles. bash tools/relay/start.sh --manual shows "Open 4 new terminals" and lists the Dev-C prompt path.