docs: multiplayer soak & UX test harness design

Design for a standalone Playwright-based soak runner that drives 16 authenticated browser sessions across 4 concurrent rooms to populate staging scoreboards and hunt stability bugs. Architected as a pluggable scenario harness so future UX test scenarios (reconnect, invite flow, admin workflows, mobile) slot in cleanly. Also gitignores .superpowers/ (brainstorming session artifacts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 23:03:28 -04:00
parent 52d7118c33
commit 97036be319
2 changed files with 639 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -214,6 +214,7 @@ cython_debug/

 # Claude Code
 .claude/
+.superpowers/

 # Virtualenv in project root
 bin/
--- a/docs/superpowers/specs/2026-04-10-multiplayer-soak-test-design.md
+++ b/docs/superpowers/specs/2026-04-10-multiplayer-soak-test-design.md
@@ -0,0 +1,638 @@
+# Multiplayer Soak & UX Test Harness — Design
+
+**Date:** 2026-04-10
+**Status:** Design approved, pending implementation plan
+
+## Context
+
+Golf Card Game is a real-time multiplayer WebSocket application with event-sourced game state, a leaderboard system, and an aggressive animation pipeline. Current test coverage is:
+
+- `server/` — pytest unit/integration tests
+- `tests/e2e/specs/` — Playwright tests exercising single-context flows (full game, stress with rapid clicks, visual regression, v3 features)
+
+What's missing: a way to exercise the system with **many concurrent authenticated users playing real multiplayer games** for long durations. We can't currently:
+
+1. Populate staging scoreboards with realistic game history for demos and visual verification
+2. Hunt race conditions, WebSocket leaks, and room cleanup bugs under sustained concurrent load
+3. Validate multiplayer UX end-to-end across rooms without manual coordination
+4. Exercise authentication, room lifecycle, and stats aggregation as a cohesive system
+
+This spec defines a **multi-scenario soak and UX test harness**: a standalone Playwright-based runner that drives 16 authenticated browser sessions across 4 concurrent rooms playing many games against each other (plus optional CPU opponents). It starts as a soak tool with two scenarios (`populate`, `stress`) and grows into the project's general-purpose multi-user UX test platform.
+
+## Goals
+
+1. **Scoreboard population** — run long multi-round games against staging with varied CPU personalities to produce realistic scoreboard data
+2. **Stability stress** — run rapid short games with chaos injection to surface race conditions and cleanup bugs
+3. **Extensibility** — new scenarios (reconnect, invite flow, admin workflow, mobile) slot in without runner changes
+4. **Watchability** — a dashboard mode with click-to-watch live video of any player, usable for demos and debugging
+5. **Per-run isolation** — test account traffic must be cleanly separable from real user traffic in stats queries
+
+## Non-goals
+
+- Replacing the existing `tests/e2e/specs/` Playwright tests (they serve a different purpose — single-context edge cases)
+- Distributed runner across multiple machines
+- Concurrent scenario execution (one scenario per run for MVP)
+- Grafana/OTEL integration
+- Auto-promoting findings to regression tests
+
+## Constraints
+
+- **Staging auth gate** — staging runs `INVITE_ONLY=true`; seeding must go through the register endpoint with an invite code
+- **Invite code `5VC2MCCN`** — provisioned with 16 uses, used once per test account on first-ever run, cached afterward
+- **Per-IP rate limiting** — `DAILY_SIGNUPS_PER_IP=20` on prod, lower default elsewhere; seeding must stay within budget
+- **Room idle cleanup** — `ROOM_IDLE_TIMEOUT_SECONDS=300` means the scenario must keep rooms active or tolerate cleanup cascades
+- **Existing bot code** — `tests/e2e/bot/golf-bot.ts` already provides `createGame`, `joinGame`, `addCPU`, `playTurn`, `playGame`; the harness reuses it verbatim
+
+## Architecture
+
+### Module layout
+
+```
+runner.ts (entry)
+  ├─ SessionPool          owns 16 BrowserContexts, seeds/logs in, allocates
+  ├─ Scenario             pluggable interface, per-scenario file
+  ├─ RoomCoordinator      host→joiners room-code handoff via Deferred<string>
+  ├─ Dashboard (optional) HTTP + WS server, status grid + click-to-watch video
+  └─ GolfBot (reused)     tests/e2e/bot/golf-bot.ts, unchanged
+```
+
+Default: one browser, 16 contexts (lowest RAM, fastest startup). `WATCH=tiled` is the exception — it launches two browsers, one headed (hosts) and one headless (joiners), because Chromium's headed/headless flag is browser-scoped, not context-scoped. See the `tiled` implementation detail below.
+
+### Location
+
+New sibling directory `tests/soak/` — does not modify `tests/e2e/`. Shares `GolfBot` via direct import from `../e2e/bot/`.
+
+Rationale: Playwright Test is designed for short isolated tests. A single `test()` running 16 contexts for hours fights the test model (worker limits, all-or-nothing failure, single giant trace file). A standalone node script gives first-class CLI flags, full control over the event loop, clean home for the dashboard server, and reuses the `GolfBot` class unchanged. Existing `tests/e2e/specs/stress.spec.ts` stays as-is for single-context edge cases.
+
+## Components
+
+### SessionPool
+
+Owns the lifecycle of 16 authenticated `BrowserContext`s.
+
+**Responsibilities:**
+- On first run: register 16 accounts via `POST /api/auth/register` with invite code `5VC2MCCN`, cache credentials to `.env.stresstest`
+- On subsequent runs: read cached credentials, create contexts, inject auth into each (localStorage token, or re-login via cached password if token rejected)
+- Expose `acquire({ count }): Promise<Session[]>` — scenarios request N authenticated sessions without caring how they got there
+- On scenario completion: close all contexts cleanly
+
+**`Session` shape:**
+```typescript
+interface Session {
+  context: BrowserContext;
+  page: Page;
+  bot: GolfBot;
+  account: Account;  // { username, password, token }
+  key: string;       // stable identifier, e.g., "soak_07"
+}
+```
+
+**`.env.stresstest` format** (gitignored, local-only, plaintext — this is a test tool):
+```
+SOAK_ACCOUNT_00=soak_00_a7bx:Hunter2!xK9mQ:eyJhbGc...
+SOAK_ACCOUNT_01=soak_01_c3pz:Kc82!wQm4Rt:eyJhbGc...
+...
+SOAK_ACCOUNT_15=soak_15_m9fy:Px7!eR4sTn2:eyJhbGc...
+```
+
+Line format: `username:password:token`. Password kept so the pool can recover from token expiry automatically.
+
+### Scenario interface
+
+```typescript
+export interface ScenarioNeeds {
+  accounts: number;
+  rooms?: number;
+  cpusPerRoom?: number;
+}
+
+export interface ScenarioContext {
+  config: ScenarioConfig;        // CLI flags merged with scenario defaults
+  sessions: Session[];           // pre-authenticated, pre-navigated
+  coordinator: RoomCoordinator;
+  dashboard: DashboardReporter;  // no-op when watch mode doesn't use it
+  logger: Logger;
+  signal: AbortSignal;           // graceful shutdown
+  heartbeat(roomId: string): void;  // resets the per-room watchdog
+}
+
+export interface ScenarioResult {
+  gamesCompleted: number;
+  errors: ScenarioError[];
+  durationMs: number;
+  customMetrics?: Record<string, number>;
+}
+
+export interface Scenario {
+  name: string;
+  description: string;
+  defaultConfig: ScenarioConfig;
+  needs: ScenarioNeeds;
+  run(ctx: ScenarioContext): Promise<ScenarioResult>;
+}
+```
+
+Scenarios are plain objects exported as default from files in `tests/soak/scenarios/`. The runner discovers them via a registry (`scenarios/index.ts`) that maps name → module. No filesystem scanning, no magic.
+
+### RoomCoordinator
+
+~30 lines. Solves host→joiners room-code handoff:
+
+```typescript
+class RoomCoordinator {
+  private rooms = new Map<string, Deferred<string>>();
+
+  announce(roomId: string, code: string) { this.get(roomId).resolve(code); }
+  async await(roomId: string): Promise<string> { return this.get(roomId).promise; }
+  private get(roomId: string) {
+    if (!this.rooms.has(roomId)) this.rooms.set(roomId, deferred());
+    return this.rooms.get(roomId)!;
+  }
+}
+```
+
+Usage:
+```typescript
+// Host
+const code = await host.bot.createGame(host.account.username);
+coordinator.announce('room-1', code);
+
+// Joiners (concurrent)
+const code = await coordinator.await('room-1');
+await joiner.bot.joinGame(code, joiner.account.username);
+```
+
+No polling, no sleeps, no cross-page scraping.
+
+### Dashboard
+
+Optional — only instantiated when `WATCH=dashboard`.
+
+**Server side** (`dashboard/server.ts`): vanilla `http` + `ws` module. Serves a single static HTML page, accepts WebSocket connections, relays messages between scenarios and the browser.
+
+**Client side** (`dashboard/index.html` + `dashboard.js`): 2×2 room grid, per-player tiles with live status (current player, score, held card, phase, moves), progress bars per hole, activity log at the bottom. No framework, ~300 lines total.
+
+**Click-to-watch**: clicking a player tile sends `start_stream(sessionKey)` over WS. The runner attaches a CDP session to that player's page via `context.newCDPSession(page)`, calls `Page.startScreencast` with `{format: 'jpeg', quality: 60, maxWidth: 640, maxHeight: 360, everyNthFrame: 2}`, and forwards each `Page.screencastFrame` event to the dashboard as `{ sessionKey, jpeg_b64 }`. The dashboard renders it into an `<img>` that swaps `src` on each frame.
+
+Returning to the grid sends `stop_stream(sessionKey)` and the runner detaches the CDP session. On WS disconnect, all active screencasts stop. This keeps CPU cost zero except while someone is actively watching.
+
+**`DashboardReporter` interface exposed to scenarios:**
+```typescript
+interface DashboardReporter {
+  update(roomId: string, state: Partial<RoomState>): void;
+  log(level: 'info'|'warn'|'error', msg: string, meta?: object): void;
+  incrementMetric(name: string, by?: number): void;
+}
+```
+
+When `WATCH` is not `dashboard`, all three methods are no-ops; structured logs still go to stdout.
+
+### Runner
+
+`runner.ts` is the CLI entry point. Parses flags, resolves config precedence, launches browser(s), instantiates `SessionPool` + `RoomCoordinator` + (optional) `Dashboard`, loads the requested scenario by name, executes it, reports results, cleans up.
+
+## Scenarios
+
+### Scenario 1: `populate`
+
+**Goal:** produce realistic scoreboard data for staging demos.
+
+**Config:**
+```typescript
+{
+  name: 'populate',
+  description: 'Long multi-round games to populate scoreboards',
+  needs: { accounts: 16, rooms: 4, cpusPerRoom: 1 },
+  defaultConfig: {
+    gamesPerRoom: 10,
+    holes: 9,
+    decks: 2,
+    cpuPersonalities: ['Sofia', 'Marcus', 'Kenji', 'Priya'],
+    thinkTimeMs: [800, 2200],
+    interGamePauseMs: 3000,
+  },
+}
+```
+
+**Shape:** 4 rooms × 4 accounts + 1 CPU each. Each room runs `gamesPerRoom` sequential games. Inside a room: host creates game → joiners join → host adds CPU → host starts game → all sessions loop on `isMyTurn()` + `playTurn()` with randomized human-like think time between turns. Between games, rooms pause briefly to mimic natural pacing.
+
+### Scenario 2: `stress`
+
+**Goal:** hunt race conditions and stability bugs.
+
+**Config:**
+```typescript
+{
+  name: 'stress',
+  description: 'Rapid short games for stability & race condition hunting',
+  needs: { accounts: 16, rooms: 4, cpusPerRoom: 2 },
+  defaultConfig: {
+    gamesPerRoom: 50,
+    holes: 1,
+    decks: 1,
+    thinkTimeMs: [50, 150],
+    interGamePauseMs: 200,
+    chaosChance: 0.05,
+  },
+}
+```
+
+**Shape:** same as `populate` but tight loops, 1-hole games, and a chaos injector that fires with 5% probability per turn. Chaos events:
+- Rapid concurrent clicks on multiple cards
+- Random tab-navigation away and back
+- Simultaneous click on card + discard button
+- Brief WebSocket drop via Playwright's `context.setOffline()` followed by reconnect
+
+Each chaos event is logged with enough context to reproduce (room, player, turn, event type).
+
+### Future scenarios (not MVP, design anticipates them)
+
+- `reconnect` — 2 accounts, deliberate mid-game disconnect, verify recovery
+- `invite-flow` — 0 accounts (fresh signups), exercise invite request → approval → first-game pipeline
+- `admin-workflow` — 1 admin account, drive the admin panel
+- `mobile-populate` — reuses `populate` with `devices['iPhone 13']` context options
+- `replay-viewer` — watches completed games via the replay UI
+
+Each is a new file in `tests/soak/scenarios/`, zero runner changes.
+
+## Data flow
+
+### Cold start (first-ever run)
+
+1. Runner reads `.env.stresstest` → file missing
+2. `SessionPool.seedAccounts()`:
+   - For `i` in `0..15`: `POST /api/auth/register` with `{ username, password, email, invite_code: '5VC2MCCN' }`
+   - Receive `{ user, token, expires_at }`, write to `.env.stresstest`
+3. Server sets `is_test_account=true` automatically because the invite code has `marks_as_test=true` (see Server changes)
+4. Runner proceeds to normal startup
+
+### Warm start (subsequent runs)
+
+1. Runner reads `.env.stresstest` → 16 entries
+2. `SessionPool` creates 16 `BrowserContext`s
+3. For each context: inject token into localStorage using the key the client app reads on load (resolved during implementation by inspecting `client/app.js`; see Open Questions)
+4. Each session navigates to `/` and lands post-auth
+5. If any token is rejected (401), pool silently re-logs in via cached password and refreshes the token in `.env.stresstest`
+
+### Seeding: explicit script vs automatic fallback
+
+Two paths to the same result, for flexibility:
+
+- **Preferred: explicit `npm run seed`** — runs `scripts/seed-accounts.ts` once during bring-up. Gives clear feedback, fails loudly on rate limits or network issues, lets you verify the accounts exist before a real run.
+- **Fallback: auto-seed on cold start** — if `runner.ts` starts and `.env.stresstest` is missing, `SessionPool` invokes the same seeding logic transparently. Useful for CI or fresh clones where nobody ran the explicit step.
+
+Both paths share the same code in `core/session-pool.ts`; the script is a thin CLI wrapper around `SessionPool.seedAccounts()`. Documented in `tests/soak/README.md` with "run `npm run seed` first" as the happy path.
+
+### Room code handoff
+
+Host session calls `createGame` → receives room code → `coordinator.announce(roomId, code)`. Joiner sessions `await coordinator.await(roomId)` → receive code → call `joinGame`. All in-process, no polling.
+
+## Watch modes
+
+| Mode | Flag | Rendering | When to use |
+|---|---|---|---|
+| `none` | `WATCH=none` | Pure headless, JSONL stdout | CI, overnight unattended |
+| `dashboard` | `WATCH=dashboard` *(default)* | HTML status grid + click-to-watch live video | Interactive runs, demos, debugging |
+| `tiled` | `WATCH=tiled` | 4 native Chromium windows positioned 2×2 | Hands-on power-user debugging |
+
+### `tiled` implementation detail
+
+Two browsers launched: one headed (`headless: false, slowMo: 50`) for the 4 host contexts, one headless for the 12 joiner contexts. Host windows positioned via `page.evaluate(() => window.moveTo(x, y))` after load. Grid computed from screen size with a default of 1920×1080.
+
+## Server-side changes
+
+All changes are additive and fit the existing inline migration pattern in `server/stores/user_store.py`.
+
+### 1. Schema
+
+Two new columns + one partial index:
+
+```sql
+ALTER TABLE users_v2 ADD COLUMN IF NOT EXISTS is_test_account BOOLEAN DEFAULT FALSE;
+CREATE INDEX IF NOT EXISTS idx_users_v2_is_test_account ON users_v2(is_test_account)
+  WHERE is_test_account = TRUE;
+ALTER TABLE invite_codes ADD COLUMN IF NOT EXISTS marks_as_test BOOLEAN DEFAULT FALSE;
+```
+
+Partial index because ~99% of rows will be `FALSE`; we only want to accelerate the "show test accounts" admin queries, not pay index-maintenance cost on every normal write.
+
+### 2. Register flow propagates the flag
+
+In `services/auth_service.py`, after resolving the invite code, read `marks_as_test` and pass through to `user_store.create_user`:
+
+```python
+invite = await admin_service.get_invite_code(invite_code)
+is_test = bool(invite and invite.marks_as_test)
+user = await user_store.create_user(
+    username=..., password_hash=..., email=...,
+    is_test_account=is_test,
+)
+```
+
+Users signing up without an invite or with a non-test invite are unaffected.
+
+### 3. One-time: flag `5VC2MCCN` as test-seed
+
+Executed once against staging (and any other environment the harness runs against):
+
+```sql
+UPDATE invite_codes SET marks_as_test = TRUE WHERE code = '5VC2MCCN';
+```
+
+Documented in the seeder script as a comment, and in `tests/soak/README.md` as a bring-up step. No admin UI for flagging invites as test-seed in MVP — add later if needed.
+
+### 4. Stats filtering
+
+Add `include_test: bool = False` parameter to stats queries in `services/stats_service.py`:
+
+```python
+async def get_leaderboard(self, limit=50, include_test=False):
+    query = """
+      SELECT ... FROM player_stats ps
+      JOIN users_v2 u ON u.id = ps.user_id
+      WHERE ($1 OR NOT u.is_test_account)
+      ORDER BY ps.total_points DESC
+      LIMIT $2
+    """
+    return await conn.fetch(query, include_test, limit)
+```
+
+Router in `routers/stats.py` exposes `include_test` as an optional query parameter. Default `False` — real users visiting the site never see soak traffic. Admin panel and debugging views pass `?include_test=true`.
+
+Same treatment for:
+- `get_player_stats(user_id, include_test)` — gates individual profile lookups
+- `get_recent_games(include_test)` — hides games where any participant is a test account by default
+
+### 5. Admin panel surfacing
+
+Small additions to `client/admin.html` + `client/admin.js`:
+- User list: "Test" badge column for `is_test_account=true` rows
+- Invite codes: "Test-seed" indicator next to `marks_as_test=true` codes
+- Leaderboard + user list: "Include test accounts" toggle → passes `?include_test=true`
+
+### Out of scope (server-side)
+
+- New admin endpoint for marking existing accounts as test
+- Admin UI for flagging invites as test-seed at creation time
+- Separate "test stats only" aggregation (admins invert their mental filter)
+- `test_only=true` query mode
+
+## Error handling
+
+### Failure taxonomy
+
+| Category | Example | Strategy |
+|---|---|---|
+| Recoverable game error | Animation flag stuck, click missed target | Log, continue, bot retries via existing `GolfBot` fallbacks |
+| Recoverable session error | WS disconnect for one player, token expires | Reconnect session, rejoin game if possible, abort that room only if unrecoverable |
+| Unrecoverable room error | Room stuck >60s, impossible state | Kill the room, capture artifacts, let other rooms continue |
+| Fatal runner error | Staging unreachable, invite code exhausted, OOM | Stop everything cleanly, dump summary, exit non-zero |
+
+**Core principle: per-room isolation.** A failure in room 3 never unwinds rooms 1, 2, 4. Each room runs in its own `Promise.allSettled` branch.
+
+### Per-room watchdog
+
+Each room gets a watchdog that resets on every `ctx.heartbeat(roomId)` call. If a room hasn't heartbeat'd in 60s, the watchdog captures artifacts, aborts that room only, and the runner continues with the remaining rooms.
+
+Scenarios call `heartbeat` at each significant progress point (turn played, game started, game finished). The helper `DashboardReporter.update()` internally calls `heartbeat` as a convenience, so scenarios that use the dashboard reporter get watchdog resets for free. Scenarios that run with `WATCH=none` still need to call `heartbeat` explicitly at least once per 60s — a single call at the top of the per-turn loop is sufficient.
+
+### Artifact capture on failure
+
+Captured per-room into `tests/soak/artifacts/<run-id>/<room-id>/`:
+- Screenshot of every context in the affected room
+- `page.content()` HTML snapshot per context
+- Last 200 console log messages per context (already captured by `GolfBot`)
+- Game state JSON from the state parser
+- Error stack trace
+- Scenario config snapshot
+
+Directory structure:
+```
+tests/soak/artifacts/
+  2026-04-10-populate-14.23.05/
+    run.log           # structured JSONL, full run
+    summary.json      # final stats
+    room-0/
+      screenshot-host.png
+      screenshot-joiner-1.png
+      page-host.html
+      console.txt
+      state.json
+      error.txt
+```
+
+Artifacts directory is gitignored. Runs older than 7 days auto-pruned on startup.
+
+### Structured logging
+
+Single logger, JSON Lines to stdout, pretty mirror to the dashboard. Every log line carries `run_id`, `scenario`, `room` (when applicable), and `timestamp`. Grep-friendly and `jq`-friendly.
+
+### Graceful shutdown
+
+`SIGINT` / `SIGTERM` trigger shutdown via `AbortController`:
+1. Global `AbortSignal` flips to aborted
+2. Scenarios check `ctx.signal.aborted` in loops, finish current turn, exit cleanly
+3. Runner waits up to 10s for scenarios to unwind
+4. After 10s, force-closes all contexts + browser
+5. Writes final `summary.json` and prints results
+6. Exit codes: `0` = all rooms completed target games, `1` = any room failed, `2` = interrupted before completion
+
+Double Ctrl-C = immediate force exit.
+
+### Periodic health probes
+
+Every 30s during a run:
+- `GET /api/health` against the target server
+- Count of open browser contexts vs expected
+- Runner memory usage
+
+If `/api/health` fails 3 consecutive times, declare fatal error, capture artifacts, stop. This prevents staging outages from being misattributed to bot bugs.
+
+### Retry policy
+
+Retry only at the session level, never at the scenario level.
+- WS drop → reconnect session, rejoin game if possible, 3 attempts max
+- Token rejected → re-login via cached password, 1 attempt
+- Click missed → existing `GolfBot` retry (already built in)
+
+Never retry: whole games, whole scenarios, fatal errors.
+
+### Cleanup guarantees
+
+Three cleanup points, all going through the same `cleanup()` function wrapped in top-level `try/finally`:
+1. **Success** — close contexts, close browsers, flush logs, write summary
+2. **Exception** — capture artifacts first, then close contexts, flush logs, write partial summary
+3. **Signal interrupt** — graceful shutdown as above, best-effort artifact capture
+
+## File layout
+
+```
+tests/soak/
+├── package.json              # standalone (separate from tests/e2e/)
+├── tsconfig.json
+├── README.md                 # quickstart + flag reference + bring-up steps
+├── .env.stresstest.example   # template (real file gitignored)
+│
+├── runner.ts                 # CLI entry — `npm run soak`
+├── config.ts                 # CLI parsing + defaults merging
+│
+├── core/
+│   ├── session-pool.ts
+│   ├── room-coordinator.ts
+│   ├── screencaster.ts       # CDP attach/detach on demand
+│   ├── watchdog.ts
+│   ├── artifacts.ts
+│   ├── logger.ts
+│   └── types.ts              # Scenario, Session, ScenarioContext interfaces
+│
+├── scenarios/
+│   ├── populate.ts
+│   ├── stress.ts
+│   └── index.ts              # name → module registry
+│
+├── dashboard/
+│   ├── server.ts             # http + ws
+│   ├── index.html
+│   ├── dashboard.css
+│   └── dashboard.js
+│
+├── scripts/
+│   ├── seed-accounts.ts      # one-shot seeding
+│   ├── reset-accounts.ts     # future: wipe test account stats
+│   └── smoke.sh              # bring-up validation
+│
+└── artifacts/                # gitignored, auto-pruned 7d
+    └── <run-id>/...
+```
+
+## Dependencies
+
+New `tests/soak/package.json`:
+
+```json
+{
+  "name": "golf-soak",
+  "private": true,
+  "scripts": {
+    "soak": "tsx runner.ts",
+    "soak:populate": "tsx runner.ts --scenario=populate",
+    "soak:stress": "tsx runner.ts --scenario=stress",
+    "seed": "tsx scripts/seed-accounts.ts",
+    "smoke": "scripts/smoke.sh"
+  },
+  "dependencies": {
+    "playwright-core": "^1.40.0",
+    "ws": "^8.16.0"
+  },
+  "devDependencies": {
+    "tsx": "^4.7.0",
+    "@types/ws": "^8.5.0",
+    "@types/node": "^20.10.0",
+    "typescript": "^5.3.0"
+  }
+}
+```
+
+Three runtime deps: `playwright-core` (already in `tests/e2e/`), `ws` (WebSocket for dashboard), `tsx` (dev-only, runs TypeScript directly). No HTTP framework, no bundler, no build step.
+
+## CLI flags
+
+```
+--scenario=populate|stress    required
+--accounts=<n>                total sessions (default: scenario.needs.accounts)
+--rooms=<n>                   default from scenario.needs
+--cpus-per-room=<n>           default from scenario.needs
+--games-per-room=<n>          default from scenario.defaultConfig
+--holes=<n>                   default from scenario.defaultConfig
+--watch=none|dashboard|tiled  default: dashboard
+--dashboard-port=<n>          default: 7777
+--target=<url>                default: TEST_URL env or http://localhost:8000
+--run-id=<string>             default: ISO timestamp
+--list                        print available scenarios and exit
+--dry-run                     validate config without running
+```
+
+Derived: `accounts-per-room = accounts / rooms`. Must divide evenly; runner errors out with a clear message if not.
+
+Config precedence: CLI flags → environment variables → scenario `defaultConfig` → runner defaults.
+
+## Meta-testing
+
+### Unit tests (Vitest, minimal)
+
+- `room-coordinator.ts` — announce/await correctness, timeout behavior
+- `watchdog.ts` — fires on timeout, resets on heartbeat, cancels cleanly
+- `config.ts` — CLI precedence, required field validation
+
+### Bring-up smoke test (`tests/soak/scripts/smoke.sh`)
+
+Runs against local dev server with minimum viable config:
+```bash
+TEST_URL=http://localhost:8000 \
+  npm run soak -- \
+  --scenario=populate \
+  --accounts=2 \
+  --rooms=1 \
+  --cpus-per-room=0 \
+  --games-per-room=1 \
+  --holes=1 \
+  --watch=none
+```
+
+Exit 0 = full harness works end-to-end. ~30 seconds. Run after any change.
+
+### Manual validation checklist
+
+Documented in `tests/soak/CHECKLIST.md`:
+- [ ] Seed 16 accounts against staging using the invite code
+- [ ] `--scenario=populate --rooms=1 --games-per-room=1` completes cleanly
+- [ ] `--scenario=populate --rooms=4 --games-per-room=1` — 4 rooms in parallel, no cross-contamination
+- [ ] `--watch=dashboard` opens browser, grid renders, progress updates
+- [ ] Click a player tile → live video appears, Esc → stops
+- [ ] `--watch=tiled` opens 4 browser windows in 2×2 grid
+- [ ] Ctrl-C during a run → graceful shutdown, summary printed, exit 2
+- [ ] Kill the target server mid-run → runner detects, captures artifacts, exits 1
+- [ ] Stats query `?include_test=false` hides soak accounts, `?include_test=true` shows them
+- [ ] Full stress run (`--scenario=stress --games-per-room=10`) — no console errors, all rooms complete
+
+## Implementation order
+
+Sequenced so each step produces something demonstrable before moving on. The writing-plans skill will break this into concrete tasks.
+
+1. **Server-side changes** — schema alters, register flow, stats filter, admin badge. Independent, ships first, unblocks local testing.
+2. **Scaffold `tests/soak/`** — package.json, tsconfig, core/types, logger. No behavior yet.
+3. **`SessionPool` + `scripts/seed-accounts.ts`** — end-to-end auth: seed, cache, load, validate login.
+4. **`RoomCoordinator` + minimal `populate` scenario body** — proves multi-room orchestration.
+5. **`runner.ts`** — CLI, config merging, scenario loading, top-level error handling.
+6. **`--watch=none` works** — runs against local dev, produces clean logs, exits 0. First end-to-end milestone.
+7. **`--watch=dashboard` status grid** — HTML + WS + tile updates (no video yet).
+8. **CDP screencast / click-to-watch** — the live video feature.
+9. **`--watch=tiled` mode** — native windows via `page.evaluate(window.moveTo)`.
+10. **`stress` scenario** — chaos injection, rapid games.
+11. **Failure handling** — watchdog, artifact capture, graceful shutdown.
+12. **Smoke test script + CHECKLIST.md** — validation.
+13. **Run against staging for real** — populate scoreboard, hunt bugs, report findings.
+
+If step 6 takes longer than planned, steps 1–5 are still useful standalone.
+
+## Out of scope for MVP
+
+- Mobile viewport scenarios (future `mobile-populate`)
+- Reconnect-storm scenarios
+- Admin workflow scenarios
+- Concurrent scenario execution
+- Distributed runner
+- Grafana / OTEL / custom metrics push
+- Test account stat reset tooling
+- Auto-promoting stress findings into Playwright regression tests
+- New admin endpoints for account marking
+- Admin UI for flagging invites as test-seed
+
+All of these are cheap to add later because the scenario interface and session pool don't presuppose them.
+
+## Open questions (to resolve during implementation)
+
+1. **localStorage auth key** — exact keys used by `client/app.js` to persist the JWT and user blob; verified by reading the file during step 3.
+2. **Chaos event set for `stress` scenario** — finalize which chaos events are in scope for MVP vs added incrementally (start with rapid clicks + tab nav + `setOffline`, add more as the server proves robust).
+3. **CDP screencast frame rate tuning** — start at `everyNthFrame: 2` (~15fps), adjust down if bandwidth/CPU is excessive on long runs.
+4. **Screen bounds detection for `tiled` mode** — default to 1920×1080, expose override via `--tiled-bounds=WxH`; auto-detect later if useful.