From 97036be319361c48e8db4ce019482f7d104a0d3d Mon Sep 17 00:00:00 2001 From: adlee-was-taken Date: Fri, 10 Apr 2026 23:03:28 -0400 Subject: [PATCH] docs: multiplayer soak & UX test harness design Design for a standalone Playwright-based soak runner that drives 16 authenticated browser sessions across 4 concurrent rooms to populate staging scoreboards and hunt stability bugs. Architected as a pluggable scenario harness so future UX test scenarios (reconnect, invite flow, admin workflows, mobile) slot in cleanly. Also gitignores .superpowers/ (brainstorming session artifacts). Co-Authored-By: Claude Opus 4.6 (1M context) --- .gitignore | 1 + ...2026-04-10-multiplayer-soak-test-design.md | 638 ++++++++++++++++++ 2 files changed, 639 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-10-multiplayer-soak-test-design.md diff --git a/.gitignore b/.gitignore index ec412a1..27d83d8 100644 --- a/.gitignore +++ b/.gitignore @@ -214,6 +214,7 @@ cython_debug/ # Claude Code .claude/ +.superpowers/ # Virtualenv in project root bin/ diff --git a/docs/superpowers/specs/2026-04-10-multiplayer-soak-test-design.md b/docs/superpowers/specs/2026-04-10-multiplayer-soak-test-design.md new file mode 100644 index 0000000..ca363e0 --- /dev/null +++ b/docs/superpowers/specs/2026-04-10-multiplayer-soak-test-design.md @@ -0,0 +1,638 @@ +# Multiplayer Soak & UX Test Harness — Design + +**Date:** 2026-04-10 +**Status:** Design approved, pending implementation plan + +## Context + +Golf Card Game is a real-time multiplayer WebSocket application with event-sourced game state, a leaderboard system, and an aggressive animation pipeline. Current test coverage is: + +- `server/` — pytest unit/integration tests +- `tests/e2e/specs/` — Playwright tests exercising single-context flows (full game, stress with rapid clicks, visual regression, v3 features) + +What's missing: a way to exercise the system with **many concurrent authenticated users playing real multiplayer games** for long durations. We can't currently: + +1. Populate staging scoreboards with realistic game history for demos and visual verification +2. Hunt race conditions, WebSocket leaks, and room cleanup bugs under sustained concurrent load +3. Validate multiplayer UX end-to-end across rooms without manual coordination +4. Exercise authentication, room lifecycle, and stats aggregation as a cohesive system + +This spec defines a **multi-scenario soak and UX test harness**: a standalone Playwright-based runner that drives 16 authenticated browser sessions across 4 concurrent rooms playing many games against each other (plus optional CPU opponents). It starts as a soak tool with two scenarios (`populate`, `stress`) and grows into the project's general-purpose multi-user UX test platform. + +## Goals + +1. **Scoreboard population** — run long multi-round games against staging with varied CPU personalities to produce realistic scoreboard data +2. **Stability stress** — run rapid short games with chaos injection to surface race conditions and cleanup bugs +3. **Extensibility** — new scenarios (reconnect, invite flow, admin workflow, mobile) slot in without runner changes +4. **Watchability** — a dashboard mode with click-to-watch live video of any player, usable for demos and debugging +5. **Per-run isolation** — test account traffic must be cleanly separable from real user traffic in stats queries + +## Non-goals + +- Replacing the existing `tests/e2e/specs/` Playwright tests (they serve a different purpose — single-context edge cases) +- Distributed runner across multiple machines +- Concurrent scenario execution (one scenario per run for MVP) +- Grafana/OTEL integration +- Auto-promoting findings to regression tests + +## Constraints + +- **Staging auth gate** — staging runs `INVITE_ONLY=true`; seeding must go through the register endpoint with an invite code +- **Invite code `5VC2MCCN`** — provisioned with 16 uses, used once per test account on first-ever run, cached afterward +- **Per-IP rate limiting** — `DAILY_SIGNUPS_PER_IP=20` on prod, lower default elsewhere; seeding must stay within budget +- **Room idle cleanup** — `ROOM_IDLE_TIMEOUT_SECONDS=300` means the scenario must keep rooms active or tolerate cleanup cascades +- **Existing bot code** — `tests/e2e/bot/golf-bot.ts` already provides `createGame`, `joinGame`, `addCPU`, `playTurn`, `playGame`; the harness reuses it verbatim + +## Architecture + +### Module layout + +``` +runner.ts (entry) + ├─ SessionPool owns 16 BrowserContexts, seeds/logs in, allocates + ├─ Scenario pluggable interface, per-scenario file + ├─ RoomCoordinator host→joiners room-code handoff via Deferred + ├─ Dashboard (optional) HTTP + WS server, status grid + click-to-watch video + └─ GolfBot (reused) tests/e2e/bot/golf-bot.ts, unchanged +``` + +Default: one browser, 16 contexts (lowest RAM, fastest startup). `WATCH=tiled` is the exception — it launches two browsers, one headed (hosts) and one headless (joiners), because Chromium's headed/headless flag is browser-scoped, not context-scoped. See the `tiled` implementation detail below. + +### Location + +New sibling directory `tests/soak/` — does not modify `tests/e2e/`. Shares `GolfBot` via direct import from `../e2e/bot/`. + +Rationale: Playwright Test is designed for short isolated tests. A single `test()` running 16 contexts for hours fights the test model (worker limits, all-or-nothing failure, single giant trace file). A standalone node script gives first-class CLI flags, full control over the event loop, clean home for the dashboard server, and reuses the `GolfBot` class unchanged. Existing `tests/e2e/specs/stress.spec.ts` stays as-is for single-context edge cases. + +## Components + +### SessionPool + +Owns the lifecycle of 16 authenticated `BrowserContext`s. + +**Responsibilities:** +- On first run: register 16 accounts via `POST /api/auth/register` with invite code `5VC2MCCN`, cache credentials to `.env.stresstest` +- On subsequent runs: read cached credentials, create contexts, inject auth into each (localStorage token, or re-login via cached password if token rejected) +- Expose `acquire({ count }): Promise` — scenarios request N authenticated sessions without caring how they got there +- On scenario completion: close all contexts cleanly + +**`Session` shape:** +```typescript +interface Session { + context: BrowserContext; + page: Page; + bot: GolfBot; + account: Account; // { username, password, token } + key: string; // stable identifier, e.g., "soak_07" +} +``` + +**`.env.stresstest` format** (gitignored, local-only, plaintext — this is a test tool): +``` +SOAK_ACCOUNT_00=soak_00_a7bx:Hunter2!xK9mQ:eyJhbGc... +SOAK_ACCOUNT_01=soak_01_c3pz:Kc82!wQm4Rt:eyJhbGc... +... +SOAK_ACCOUNT_15=soak_15_m9fy:Px7!eR4sTn2:eyJhbGc... +``` + +Line format: `username:password:token`. Password kept so the pool can recover from token expiry automatically. + +### Scenario interface + +```typescript +export interface ScenarioNeeds { + accounts: number; + rooms?: number; + cpusPerRoom?: number; +} + +export interface ScenarioContext { + config: ScenarioConfig; // CLI flags merged with scenario defaults + sessions: Session[]; // pre-authenticated, pre-navigated + coordinator: RoomCoordinator; + dashboard: DashboardReporter; // no-op when watch mode doesn't use it + logger: Logger; + signal: AbortSignal; // graceful shutdown + heartbeat(roomId: string): void; // resets the per-room watchdog +} + +export interface ScenarioResult { + gamesCompleted: number; + errors: ScenarioError[]; + durationMs: number; + customMetrics?: Record; +} + +export interface Scenario { + name: string; + description: string; + defaultConfig: ScenarioConfig; + needs: ScenarioNeeds; + run(ctx: ScenarioContext): Promise; +} +``` + +Scenarios are plain objects exported as default from files in `tests/soak/scenarios/`. The runner discovers them via a registry (`scenarios/index.ts`) that maps name → module. No filesystem scanning, no magic. + +### RoomCoordinator + +~30 lines. Solves host→joiners room-code handoff: + +```typescript +class RoomCoordinator { + private rooms = new Map>(); + + announce(roomId: string, code: string) { this.get(roomId).resolve(code); } + async await(roomId: string): Promise { return this.get(roomId).promise; } + private get(roomId: string) { + if (!this.rooms.has(roomId)) this.rooms.set(roomId, deferred()); + return this.rooms.get(roomId)!; + } +} +``` + +Usage: +```typescript +// Host +const code = await host.bot.createGame(host.account.username); +coordinator.announce('room-1', code); + +// Joiners (concurrent) +const code = await coordinator.await('room-1'); +await joiner.bot.joinGame(code, joiner.account.username); +``` + +No polling, no sleeps, no cross-page scraping. + +### Dashboard + +Optional — only instantiated when `WATCH=dashboard`. + +**Server side** (`dashboard/server.ts`): vanilla `http` + `ws` module. Serves a single static HTML page, accepts WebSocket connections, relays messages between scenarios and the browser. + +**Client side** (`dashboard/index.html` + `dashboard.js`): 2×2 room grid, per-player tiles with live status (current player, score, held card, phase, moves), progress bars per hole, activity log at the bottom. No framework, ~300 lines total. + +**Click-to-watch**: clicking a player tile sends `start_stream(sessionKey)` over WS. The runner attaches a CDP session to that player's page via `context.newCDPSession(page)`, calls `Page.startScreencast` with `{format: 'jpeg', quality: 60, maxWidth: 640, maxHeight: 360, everyNthFrame: 2}`, and forwards each `Page.screencastFrame` event to the dashboard as `{ sessionKey, jpeg_b64 }`. The dashboard renders it into an `` that swaps `src` on each frame. + +Returning to the grid sends `stop_stream(sessionKey)` and the runner detaches the CDP session. On WS disconnect, all active screencasts stop. This keeps CPU cost zero except while someone is actively watching. + +**`DashboardReporter` interface exposed to scenarios:** +```typescript +interface DashboardReporter { + update(roomId: string, state: Partial): void; + log(level: 'info'|'warn'|'error', msg: string, meta?: object): void; + incrementMetric(name: string, by?: number): void; +} +``` + +When `WATCH` is not `dashboard`, all three methods are no-ops; structured logs still go to stdout. + +### Runner + +`runner.ts` is the CLI entry point. Parses flags, resolves config precedence, launches browser(s), instantiates `SessionPool` + `RoomCoordinator` + (optional) `Dashboard`, loads the requested scenario by name, executes it, reports results, cleans up. + +## Scenarios + +### Scenario 1: `populate` + +**Goal:** produce realistic scoreboard data for staging demos. + +**Config:** +```typescript +{ + name: 'populate', + description: 'Long multi-round games to populate scoreboards', + needs: { accounts: 16, rooms: 4, cpusPerRoom: 1 }, + defaultConfig: { + gamesPerRoom: 10, + holes: 9, + decks: 2, + cpuPersonalities: ['Sofia', 'Marcus', 'Kenji', 'Priya'], + thinkTimeMs: [800, 2200], + interGamePauseMs: 3000, + }, +} +``` + +**Shape:** 4 rooms × 4 accounts + 1 CPU each. Each room runs `gamesPerRoom` sequential games. Inside a room: host creates game → joiners join → host adds CPU → host starts game → all sessions loop on `isMyTurn()` + `playTurn()` with randomized human-like think time between turns. Between games, rooms pause briefly to mimic natural pacing. + +### Scenario 2: `stress` + +**Goal:** hunt race conditions and stability bugs. + +**Config:** +```typescript +{ + name: 'stress', + description: 'Rapid short games for stability & race condition hunting', + needs: { accounts: 16, rooms: 4, cpusPerRoom: 2 }, + defaultConfig: { + gamesPerRoom: 50, + holes: 1, + decks: 1, + thinkTimeMs: [50, 150], + interGamePauseMs: 200, + chaosChance: 0.05, + }, +} +``` + +**Shape:** same as `populate` but tight loops, 1-hole games, and a chaos injector that fires with 5% probability per turn. Chaos events: +- Rapid concurrent clicks on multiple cards +- Random tab-navigation away and back +- Simultaneous click on card + discard button +- Brief WebSocket drop via Playwright's `context.setOffline()` followed by reconnect + +Each chaos event is logged with enough context to reproduce (room, player, turn, event type). + +### Future scenarios (not MVP, design anticipates them) + +- `reconnect` — 2 accounts, deliberate mid-game disconnect, verify recovery +- `invite-flow` — 0 accounts (fresh signups), exercise invite request → approval → first-game pipeline +- `admin-workflow` — 1 admin account, drive the admin panel +- `mobile-populate` — reuses `populate` with `devices['iPhone 13']` context options +- `replay-viewer` — watches completed games via the replay UI + +Each is a new file in `tests/soak/scenarios/`, zero runner changes. + +## Data flow + +### Cold start (first-ever run) + +1. Runner reads `.env.stresstest` → file missing +2. `SessionPool.seedAccounts()`: + - For `i` in `0..15`: `POST /api/auth/register` with `{ username, password, email, invite_code: '5VC2MCCN' }` + - Receive `{ user, token, expires_at }`, write to `.env.stresstest` +3. Server sets `is_test_account=true` automatically because the invite code has `marks_as_test=true` (see Server changes) +4. Runner proceeds to normal startup + +### Warm start (subsequent runs) + +1. Runner reads `.env.stresstest` → 16 entries +2. `SessionPool` creates 16 `BrowserContext`s +3. For each context: inject token into localStorage using the key the client app reads on load (resolved during implementation by inspecting `client/app.js`; see Open Questions) +4. Each session navigates to `/` and lands post-auth +5. If any token is rejected (401), pool silently re-logs in via cached password and refreshes the token in `.env.stresstest` + +### Seeding: explicit script vs automatic fallback + +Two paths to the same result, for flexibility: + +- **Preferred: explicit `npm run seed`** — runs `scripts/seed-accounts.ts` once during bring-up. Gives clear feedback, fails loudly on rate limits or network issues, lets you verify the accounts exist before a real run. +- **Fallback: auto-seed on cold start** — if `runner.ts` starts and `.env.stresstest` is missing, `SessionPool` invokes the same seeding logic transparently. Useful for CI or fresh clones where nobody ran the explicit step. + +Both paths share the same code in `core/session-pool.ts`; the script is a thin CLI wrapper around `SessionPool.seedAccounts()`. Documented in `tests/soak/README.md` with "run `npm run seed` first" as the happy path. + +### Room code handoff + +Host session calls `createGame` → receives room code → `coordinator.announce(roomId, code)`. Joiner sessions `await coordinator.await(roomId)` → receive code → call `joinGame`. All in-process, no polling. + +## Watch modes + +| Mode | Flag | Rendering | When to use | +|---|---|---|---| +| `none` | `WATCH=none` | Pure headless, JSONL stdout | CI, overnight unattended | +| `dashboard` | `WATCH=dashboard` *(default)* | HTML status grid + click-to-watch live video | Interactive runs, demos, debugging | +| `tiled` | `WATCH=tiled` | 4 native Chromium windows positioned 2×2 | Hands-on power-user debugging | + +### `tiled` implementation detail + +Two browsers launched: one headed (`headless: false, slowMo: 50`) for the 4 host contexts, one headless for the 12 joiner contexts. Host windows positioned via `page.evaluate(() => window.moveTo(x, y))` after load. Grid computed from screen size with a default of 1920×1080. + +## Server-side changes + +All changes are additive and fit the existing inline migration pattern in `server/stores/user_store.py`. + +### 1. Schema + +Two new columns + one partial index: + +```sql +ALTER TABLE users_v2 ADD COLUMN IF NOT EXISTS is_test_account BOOLEAN DEFAULT FALSE; +CREATE INDEX IF NOT EXISTS idx_users_v2_is_test_account ON users_v2(is_test_account) + WHERE is_test_account = TRUE; +ALTER TABLE invite_codes ADD COLUMN IF NOT EXISTS marks_as_test BOOLEAN DEFAULT FALSE; +``` + +Partial index because ~99% of rows will be `FALSE`; we only want to accelerate the "show test accounts" admin queries, not pay index-maintenance cost on every normal write. + +### 2. Register flow propagates the flag + +In `services/auth_service.py`, after resolving the invite code, read `marks_as_test` and pass through to `user_store.create_user`: + +```python +invite = await admin_service.get_invite_code(invite_code) +is_test = bool(invite and invite.marks_as_test) +user = await user_store.create_user( + username=..., password_hash=..., email=..., + is_test_account=is_test, +) +``` + +Users signing up without an invite or with a non-test invite are unaffected. + +### 3. One-time: flag `5VC2MCCN` as test-seed + +Executed once against staging (and any other environment the harness runs against): + +```sql +UPDATE invite_codes SET marks_as_test = TRUE WHERE code = '5VC2MCCN'; +``` + +Documented in the seeder script as a comment, and in `tests/soak/README.md` as a bring-up step. No admin UI for flagging invites as test-seed in MVP — add later if needed. + +### 4. Stats filtering + +Add `include_test: bool = False` parameter to stats queries in `services/stats_service.py`: + +```python +async def get_leaderboard(self, limit=50, include_test=False): + query = """ + SELECT ... FROM player_stats ps + JOIN users_v2 u ON u.id = ps.user_id + WHERE ($1 OR NOT u.is_test_account) + ORDER BY ps.total_points DESC + LIMIT $2 + """ + return await conn.fetch(query, include_test, limit) +``` + +Router in `routers/stats.py` exposes `include_test` as an optional query parameter. Default `False` — real users visiting the site never see soak traffic. Admin panel and debugging views pass `?include_test=true`. + +Same treatment for: +- `get_player_stats(user_id, include_test)` — gates individual profile lookups +- `get_recent_games(include_test)` — hides games where any participant is a test account by default + +### 5. Admin panel surfacing + +Small additions to `client/admin.html` + `client/admin.js`: +- User list: "Test" badge column for `is_test_account=true` rows +- Invite codes: "Test-seed" indicator next to `marks_as_test=true` codes +- Leaderboard + user list: "Include test accounts" toggle → passes `?include_test=true` + +### Out of scope (server-side) + +- New admin endpoint for marking existing accounts as test +- Admin UI for flagging invites as test-seed at creation time +- Separate "test stats only" aggregation (admins invert their mental filter) +- `test_only=true` query mode + +## Error handling + +### Failure taxonomy + +| Category | Example | Strategy | +|---|---|---| +| Recoverable game error | Animation flag stuck, click missed target | Log, continue, bot retries via existing `GolfBot` fallbacks | +| Recoverable session error | WS disconnect for one player, token expires | Reconnect session, rejoin game if possible, abort that room only if unrecoverable | +| Unrecoverable room error | Room stuck >60s, impossible state | Kill the room, capture artifacts, let other rooms continue | +| Fatal runner error | Staging unreachable, invite code exhausted, OOM | Stop everything cleanly, dump summary, exit non-zero | + +**Core principle: per-room isolation.** A failure in room 3 never unwinds rooms 1, 2, 4. Each room runs in its own `Promise.allSettled` branch. + +### Per-room watchdog + +Each room gets a watchdog that resets on every `ctx.heartbeat(roomId)` call. If a room hasn't heartbeat'd in 60s, the watchdog captures artifacts, aborts that room only, and the runner continues with the remaining rooms. + +Scenarios call `heartbeat` at each significant progress point (turn played, game started, game finished). The helper `DashboardReporter.update()` internally calls `heartbeat` as a convenience, so scenarios that use the dashboard reporter get watchdog resets for free. Scenarios that run with `WATCH=none` still need to call `heartbeat` explicitly at least once per 60s — a single call at the top of the per-turn loop is sufficient. + +### Artifact capture on failure + +Captured per-room into `tests/soak/artifacts///`: +- Screenshot of every context in the affected room +- `page.content()` HTML snapshot per context +- Last 200 console log messages per context (already captured by `GolfBot`) +- Game state JSON from the state parser +- Error stack trace +- Scenario config snapshot + +Directory structure: +``` +tests/soak/artifacts/ + 2026-04-10-populate-14.23.05/ + run.log # structured JSONL, full run + summary.json # final stats + room-0/ + screenshot-host.png + screenshot-joiner-1.png + page-host.html + console.txt + state.json + error.txt +``` + +Artifacts directory is gitignored. Runs older than 7 days auto-pruned on startup. + +### Structured logging + +Single logger, JSON Lines to stdout, pretty mirror to the dashboard. Every log line carries `run_id`, `scenario`, `room` (when applicable), and `timestamp`. Grep-friendly and `jq`-friendly. + +### Graceful shutdown + +`SIGINT` / `SIGTERM` trigger shutdown via `AbortController`: +1. Global `AbortSignal` flips to aborted +2. Scenarios check `ctx.signal.aborted` in loops, finish current turn, exit cleanly +3. Runner waits up to 10s for scenarios to unwind +4. After 10s, force-closes all contexts + browser +5. Writes final `summary.json` and prints results +6. Exit codes: `0` = all rooms completed target games, `1` = any room failed, `2` = interrupted before completion + +Double Ctrl-C = immediate force exit. + +### Periodic health probes + +Every 30s during a run: +- `GET /api/health` against the target server +- Count of open browser contexts vs expected +- Runner memory usage + +If `/api/health` fails 3 consecutive times, declare fatal error, capture artifacts, stop. This prevents staging outages from being misattributed to bot bugs. + +### Retry policy + +Retry only at the session level, never at the scenario level. +- WS drop → reconnect session, rejoin game if possible, 3 attempts max +- Token rejected → re-login via cached password, 1 attempt +- Click missed → existing `GolfBot` retry (already built in) + +Never retry: whole games, whole scenarios, fatal errors. + +### Cleanup guarantees + +Three cleanup points, all going through the same `cleanup()` function wrapped in top-level `try/finally`: +1. **Success** — close contexts, close browsers, flush logs, write summary +2. **Exception** — capture artifacts first, then close contexts, flush logs, write partial summary +3. **Signal interrupt** — graceful shutdown as above, best-effort artifact capture + +## File layout + +``` +tests/soak/ +├── package.json # standalone (separate from tests/e2e/) +├── tsconfig.json +├── README.md # quickstart + flag reference + bring-up steps +├── .env.stresstest.example # template (real file gitignored) +│ +├── runner.ts # CLI entry — `npm run soak` +├── config.ts # CLI parsing + defaults merging +│ +├── core/ +│ ├── session-pool.ts +│ ├── room-coordinator.ts +│ ├── screencaster.ts # CDP attach/detach on demand +│ ├── watchdog.ts +│ ├── artifacts.ts +│ ├── logger.ts +│ └── types.ts # Scenario, Session, ScenarioContext interfaces +│ +├── scenarios/ +│ ├── populate.ts +│ ├── stress.ts +│ └── index.ts # name → module registry +│ +├── dashboard/ +│ ├── server.ts # http + ws +│ ├── index.html +│ ├── dashboard.css +│ └── dashboard.js +│ +├── scripts/ +│ ├── seed-accounts.ts # one-shot seeding +│ ├── reset-accounts.ts # future: wipe test account stats +│ └── smoke.sh # bring-up validation +│ +└── artifacts/ # gitignored, auto-pruned 7d + └── /... +``` + +## Dependencies + +New `tests/soak/package.json`: + +```json +{ + "name": "golf-soak", + "private": true, + "scripts": { + "soak": "tsx runner.ts", + "soak:populate": "tsx runner.ts --scenario=populate", + "soak:stress": "tsx runner.ts --scenario=stress", + "seed": "tsx scripts/seed-accounts.ts", + "smoke": "scripts/smoke.sh" + }, + "dependencies": { + "playwright-core": "^1.40.0", + "ws": "^8.16.0" + }, + "devDependencies": { + "tsx": "^4.7.0", + "@types/ws": "^8.5.0", + "@types/node": "^20.10.0", + "typescript": "^5.3.0" + } +} +``` + +Three runtime deps: `playwright-core` (already in `tests/e2e/`), `ws` (WebSocket for dashboard), `tsx` (dev-only, runs TypeScript directly). No HTTP framework, no bundler, no build step. + +## CLI flags + +``` +--scenario=populate|stress required +--accounts= total sessions (default: scenario.needs.accounts) +--rooms= default from scenario.needs +--cpus-per-room= default from scenario.needs +--games-per-room= default from scenario.defaultConfig +--holes= default from scenario.defaultConfig +--watch=none|dashboard|tiled default: dashboard +--dashboard-port= default: 7777 +--target= default: TEST_URL env or http://localhost:8000 +--run-id= default: ISO timestamp +--list print available scenarios and exit +--dry-run validate config without running +``` + +Derived: `accounts-per-room = accounts / rooms`. Must divide evenly; runner errors out with a clear message if not. + +Config precedence: CLI flags → environment variables → scenario `defaultConfig` → runner defaults. + +## Meta-testing + +### Unit tests (Vitest, minimal) + +- `room-coordinator.ts` — announce/await correctness, timeout behavior +- `watchdog.ts` — fires on timeout, resets on heartbeat, cancels cleanly +- `config.ts` — CLI precedence, required field validation + +### Bring-up smoke test (`tests/soak/scripts/smoke.sh`) + +Runs against local dev server with minimum viable config: +```bash +TEST_URL=http://localhost:8000 \ + npm run soak -- \ + --scenario=populate \ + --accounts=2 \ + --rooms=1 \ + --cpus-per-room=0 \ + --games-per-room=1 \ + --holes=1 \ + --watch=none +``` + +Exit 0 = full harness works end-to-end. ~30 seconds. Run after any change. + +### Manual validation checklist + +Documented in `tests/soak/CHECKLIST.md`: +- [ ] Seed 16 accounts against staging using the invite code +- [ ] `--scenario=populate --rooms=1 --games-per-room=1` completes cleanly +- [ ] `--scenario=populate --rooms=4 --games-per-room=1` — 4 rooms in parallel, no cross-contamination +- [ ] `--watch=dashboard` opens browser, grid renders, progress updates +- [ ] Click a player tile → live video appears, Esc → stops +- [ ] `--watch=tiled` opens 4 browser windows in 2×2 grid +- [ ] Ctrl-C during a run → graceful shutdown, summary printed, exit 2 +- [ ] Kill the target server mid-run → runner detects, captures artifacts, exits 1 +- [ ] Stats query `?include_test=false` hides soak accounts, `?include_test=true` shows them +- [ ] Full stress run (`--scenario=stress --games-per-room=10`) — no console errors, all rooms complete + +## Implementation order + +Sequenced so each step produces something demonstrable before moving on. The writing-plans skill will break this into concrete tasks. + +1. **Server-side changes** — schema alters, register flow, stats filter, admin badge. Independent, ships first, unblocks local testing. +2. **Scaffold `tests/soak/`** — package.json, tsconfig, core/types, logger. No behavior yet. +3. **`SessionPool` + `scripts/seed-accounts.ts`** — end-to-end auth: seed, cache, load, validate login. +4. **`RoomCoordinator` + minimal `populate` scenario body** — proves multi-room orchestration. +5. **`runner.ts`** — CLI, config merging, scenario loading, top-level error handling. +6. **`--watch=none` works** — runs against local dev, produces clean logs, exits 0. First end-to-end milestone. +7. **`--watch=dashboard` status grid** — HTML + WS + tile updates (no video yet). +8. **CDP screencast / click-to-watch** — the live video feature. +9. **`--watch=tiled` mode** — native windows via `page.evaluate(window.moveTo)`. +10. **`stress` scenario** — chaos injection, rapid games. +11. **Failure handling** — watchdog, artifact capture, graceful shutdown. +12. **Smoke test script + CHECKLIST.md** — validation. +13. **Run against staging for real** — populate scoreboard, hunt bugs, report findings. + +If step 6 takes longer than planned, steps 1–5 are still useful standalone. + +## Out of scope for MVP + +- Mobile viewport scenarios (future `mobile-populate`) +- Reconnect-storm scenarios +- Admin workflow scenarios +- Concurrent scenario execution +- Distributed runner +- Grafana / OTEL / custom metrics push +- Test account stat reset tooling +- Auto-promoting stress findings into Playwright regression tests +- New admin endpoints for account marking +- Admin UI for flagging invites as test-seed + +All of these are cheap to add later because the scenario interface and session pool don't presuppose them. + +## Open questions (to resolve during implementation) + +1. **localStorage auth key** — exact keys used by `client/app.js` to persist the JWT and user blob; verified by reading the file during step 3. +2. **Chaos event set for `stress` scenario** — finalize which chaos events are in scope for MVP vs added incrementally (start with rapid clicks + tab nav + `setOffline`, add more as the server proves robust). +3. **CDP screencast frame rate tuning** — start at `everyNthFrame: 2` (~15fps), adjust down if bandwidth/CPU is excessive on long runs. +4. **Screen bounds detection for `tiled` mode** — default to 1920×1080, expose override via `--tiled-bounds=WxH`; auto-detect later if useful.