golfgame/docs/superpowers/specs/2026-04-10-multiplayer-soak-test-design.md

# Multiplayer Soak & UX Test Harness — Design

**Date:** 2026-04-10
**Status:** Design approved, pending implementation plan

## Context

Golf Card Game is a real-time multiplayer WebSocket application with event-sourced game state, a leaderboard system, and an aggressive animation pipeline. Current test coverage is:

- `server/` — pytest unit/integration tests
- `tests/e2e/specs/` — Playwright tests exercising single-context flows (full game, stress with rapid clicks, visual regression, v3 features)

What's missing: a way to exercise the system with **many concurrent authenticated users playing real multiplayer games** for long durations. We can't currently:

1. Populate staging scoreboards with realistic game history for demos and visual verification
2. Hunt race conditions, WebSocket leaks, and room cleanup bugs under sustained concurrent load
3. Validate multiplayer UX end-to-end across rooms without manual coordination
4. Exercise authentication, room lifecycle, and stats aggregation as a cohesive system

This spec defines a **multi-scenario soak and UX test harness**: a standalone Playwright-based runner that drives 16 authenticated browser sessions across 4 concurrent rooms playing many games against each other (plus optional CPU opponents). It starts as a soak tool with two scenarios (`populate`, `stress`) and grows into the project's general-purpose multi-user UX test platform.

## Goals

1. **Scoreboard population** — run long multi-round games against staging with varied CPU personalities to produce realistic scoreboard data
2. **Stability stress** — run rapid short games with chaos injection to surface race conditions and cleanup bugs
3. **Extensibility** — new scenarios (reconnect, invite flow, admin workflow, mobile) slot in without runner changes
4. **Watchability** — a dashboard mode with click-to-watch live video of any player, usable for demos and debugging
5. **Per-run isolation** — test account traffic must be cleanly separable from real user traffic in stats queries

## Non-goals

- Replacing the existing `tests/e2e/specs/` Playwright tests (they serve a different purpose — single-context edge cases)
- Distributed runner across multiple machines
- Concurrent scenario execution (one scenario per run for MVP)
- Grafana/OTEL integration
- Auto-promoting findings to regression tests

## Constraints

- **Staging auth gate** — staging runs `INVITE_ONLY=true`; seeding must go through the register endpoint with an invite code
- **Invite code `5VC2MCCN`** — provisioned with 16 uses, used once per test account on first-ever run, cached afterward
- **Per-IP rate limiting** — `DAILY_SIGNUPS_PER_IP=20` on prod, lower default elsewhere; seeding must stay within budget
- **Room idle cleanup** — `ROOM_IDLE_TIMEOUT_SECONDS=300` means the scenario must keep rooms active or tolerate cleanup cascades
- **Existing bot code** — `tests/e2e/bot/golf-bot.ts` already provides `createGame`, `joinGame`, `addCPU`, `playTurn`, `playGame`; the harness reuses it verbatim

## Architecture

### Module layout

```
runner.ts (entry)
  ├─ SessionPool          owns 16 BrowserContexts, seeds/logs in, allocates
  ├─ Scenario             pluggable interface, per-scenario file
  ├─ RoomCoordinator      host→joiners room-code handoff via Deferred<string>
  ├─ Dashboard (optional) HTTP + WS server, status grid + click-to-watch video
  └─ GolfBot (reused)     tests/e2e/bot/golf-bot.ts, unchanged
```

Default: one browser, 16 contexts (lowest RAM, fastest startup). `WATCH=tiled` is the exception — it launches two browsers, one headed (hosts) and one headless (joiners), because Chromium's headed/headless flag is browser-scoped, not context-scoped. See the `tiled` implementation detail below.

### Location

New sibling directory `tests/soak/` — does not modify `tests/e2e/`. Shares `GolfBot` via direct import from `../e2e/bot/`.

Rationale: Playwright Test is designed for short isolated tests. A single `test()` running 16 contexts for hours fights the test model (worker limits, all-or-nothing failure, single giant trace file). A standalone node script gives first-class CLI flags, full control over the event loop, clean home for the dashboard server, and reuses the `GolfBot` class unchanged. Existing `tests/e2e/specs/stress.spec.ts` stays as-is for single-context edge cases.

## Components

### SessionPool

Owns the lifecycle of 16 authenticated `BrowserContext`s.

**Responsibilities:**
- On first run: register 16 accounts via `POST /api/auth/register` with invite code `5VC2MCCN`, cache credentials to `.env.stresstest`
- On subsequent runs: read cached credentials, create contexts, inject auth into each (localStorage token, or re-login via cached password if token rejected)
- Expose `acquire({ count }): Promise<Session[]>` — scenarios request N authenticated sessions without caring how they got there
- On scenario completion: close all contexts cleanly

**`Session` shape:**
```typescript
interface Session {
  context: BrowserContext;
  page: Page;
  bot: GolfBot;
  account: Account;  // { username, password, token }
  key: string;       // stable identifier, e.g., "soak_07"
}
```

**`.env.stresstest` format** (gitignored, local-only, plaintext — this is a test tool):
```
SOAK_ACCOUNT_00=soak_00_a7bx:Hunter2!xK9mQ:eyJhbGc...
SOAK_ACCOUNT_01=soak_01_c3pz:Kc82!wQm4Rt:eyJhbGc...
...
SOAK_ACCOUNT_15=soak_15_m9fy:Px7!eR4sTn2:eyJhbGc...
```

Line format: `username:password:token`. Password kept so the pool can recover from token expiry automatically.

### Scenario interface

```typescript
export interface ScenarioNeeds {
  accounts: number;
  rooms?: number;
  cpusPerRoom?: number;
}

export interface ScenarioContext {
  config: ScenarioConfig;        // CLI flags merged with scenario defaults
  sessions: Session[];           // pre-authenticated, pre-navigated
  coordinator: RoomCoordinator;
  dashboard: DashboardReporter;  // no-op when watch mode doesn't use it
  logger: Logger;
  signal: AbortSignal;           // graceful shutdown
  heartbeat(roomId: string): void;  // resets the per-room watchdog
}

export interface ScenarioResult {
  gamesCompleted: number;
  errors: ScenarioError[];
  durationMs: number;
  customMetrics?: Record<string, number>;
}

export interface Scenario {
  name: string;
  description: string;
  defaultConfig: ScenarioConfig;
  needs: ScenarioNeeds;
  run(ctx: ScenarioContext): Promise<ScenarioResult>;
}
```

Scenarios are plain objects exported as default from files in `tests/soak/scenarios/`. The runner discovers them via a registry (`scenarios/index.ts`) that maps name → module. No filesystem scanning, no magic.

### RoomCoordinator

~30 lines. Solves host→joiners room-code handoff:

```typescript
class RoomCoordinator {
  private rooms = new Map<string, Deferred<string>>();

  announce(roomId: string, code: string) { this.get(roomId).resolve(code); }
  async await(roomId: string): Promise<string> { return this.get(roomId).promise; }
  private get(roomId: string) {
    if (!this.rooms.has(roomId)) this.rooms.set(roomId, deferred());
    return this.rooms.get(roomId)!;
  }
}
```

Usage:
```typescript
// Host
const code = await host.bot.createGame(host.account.username);
coordinator.announce('room-1', code);

// Joiners (concurrent)
const code = await coordinator.await('room-1');
await joiner.bot.joinGame(code, joiner.account.username);
```

No polling, no sleeps, no cross-page scraping.

### Dashboard

Optional — only instantiated when `WATCH=dashboard`.

**Server side** (`dashboard/server.ts`): vanilla `http` + `ws` module. Serves a single static HTML page, accepts WebSocket connections, relays messages between scenarios and the browser.

**Client side** (`dashboard/index.html` + `dashboard.js`): 2×2 room grid, per-player tiles with live status (current player, score, held card, phase, moves), progress bars per hole, activity log at the bottom. No framework, ~300 lines total.

**Click-to-watch**: clicking a player tile sends `start_stream(sessionKey)` over WS. The runner attaches a CDP session to that player's page via `context.newCDPSession(page)`, calls `Page.startScreencast` with `{format: 'jpeg', quality: 60, maxWidth: 640, maxHeight: 360, everyNthFrame: 2}`, and forwards each `Page.screencastFrame` event to the dashboard as `{ sessionKey, jpeg_b64 }`. The dashboard renders it into an `<img>` that swaps `src` on each frame.

Returning to the grid sends `stop_stream(sessionKey)` and the runner detaches the CDP session. On WS disconnect, all active screencasts stop. This keeps CPU cost zero except while someone is actively watching.

**`DashboardReporter` interface exposed to scenarios:**
```typescript
interface DashboardReporter {
  update(roomId: string, state: Partial<RoomState>): void;
  log(level: 'info'|'warn'|'error', msg: string, meta?: object): void;
  incrementMetric(name: string, by?: number): void;
}
```

When `WATCH` is not `dashboard`, all three methods are no-ops; structured logs still go to stdout.

### Runner

`runner.ts` is the CLI entry point. Parses flags, resolves config precedence, launches browser(s), instantiates `SessionPool` + `RoomCoordinator` + (optional) `Dashboard`, loads the requested scenario by name, executes it, reports results, cleans up.

## Scenarios

### Scenario 1: `populate`

**Goal:** produce realistic scoreboard data for staging demos.

**Config:**
```typescript
{
  name: 'populate',
  description: 'Long multi-round games to populate scoreboards',
  needs: { accounts: 16, rooms: 4, cpusPerRoom: 1 },
  defaultConfig: {
    gamesPerRoom: 10,
    holes: 9,
    decks: 2,
    cpuPersonalities: ['Sofia', 'Marcus', 'Kenji', 'Priya'],
    thinkTimeMs: [800, 2200],
    interGamePauseMs: 3000,
  },
}
```

**Shape:** 4 rooms × 4 accounts + 1 CPU each. Each room runs `gamesPerRoom` sequential games. Inside a room: host creates game → joiners join → host adds CPU → host starts game → all sessions loop on `isMyTurn()` + `playTurn()` with randomized human-like think time between turns. Between games, rooms pause briefly to mimic natural pacing.

### Scenario 2: `stress`

**Goal:** hunt race conditions and stability bugs.

**Config:**
```typescript
{
  name: 'stress',
  description: 'Rapid short games for stability & race condition hunting',
  needs: { accounts: 16, rooms: 4, cpusPerRoom: 2 },
  defaultConfig: {
    gamesPerRoom: 50,
    holes: 1,
    decks: 1,
    thinkTimeMs: [50, 150],
    interGamePauseMs: 200,
    chaosChance: 0.05,
  },
}
```

**Shape:** same as `populate` but tight loops, 1-hole games, and a chaos injector that fires with 5% probability per turn. Chaos events:
- Rapid concurrent clicks on multiple cards
- Random tab-navigation away and back
- Simultaneous click on card + discard button
- Brief WebSocket drop via Playwright's `context.setOffline()` followed by reconnect

Each chaos event is logged with enough context to reproduce (room, player, turn, event type).

### Future scenarios (not MVP, design anticipates them)

- `reconnect` — 2 accounts, deliberate mid-game disconnect, verify recovery
- `invite-flow` — 0 accounts (fresh signups), exercise invite request → approval → first-game pipeline
- `admin-workflow` — 1 admin account, drive the admin panel
- `mobile-populate` — reuses `populate` with `devices['iPhone 13']` context options
- `replay-viewer` — watches completed games via the replay UI

Each is a new file in `tests/soak/scenarios/`, zero runner changes.

## Data flow

### Cold start (first-ever run)

1. Runner reads `.env.stresstest` → file missing
2. `SessionPool.seedAccounts()`:
   - For `i` in `0..15`: `POST /api/auth/register` with `{ username, password, email, invite_code: '5VC2MCCN' }`
   - Receive `{ user, token, expires_at }`, write to `.env.stresstest`
3. Server sets `is_test_account=true` automatically because the invite code has `marks_as_test=true` (see Server changes)
4. Runner proceeds to normal startup

### Warm start (subsequent runs)

1. Runner reads `.env.stresstest` → 16 entries
2. `SessionPool` creates 16 `BrowserContext`s
3. For each context: inject token into localStorage using the key the client app reads on load (resolved during implementation by inspecting `client/app.js`; see Open Questions)
4. Each session navigates to `/` and lands post-auth
5. If any token is rejected (401), pool silently re-logs in via cached password and refreshes the token in `.env.stresstest`

### Seeding: explicit script vs automatic fallback

Two paths to the same result, for flexibility:

- **Preferred: explicit `npm run seed`** — runs `scripts/seed-accounts.ts` once during bring-up. Gives clear feedback, fails loudly on rate limits or network issues, lets you verify the accounts exist before a real run.
- **Fallback: auto-seed on cold start** — if `runner.ts` starts and `.env.stresstest` is missing, `SessionPool` invokes the same seeding logic transparently. Useful for CI or fresh clones where nobody ran the explicit step.

Both paths share the same code in `core/session-pool.ts`; the script is a thin CLI wrapper around `SessionPool.seedAccounts()`. Documented in `tests/soak/README.md` with "run `npm run seed` first" as the happy path.

### Room code handoff

Host session calls `createGame` → receives room code → `coordinator.announce(roomId, code)`. Joiner sessions `await coordinator.await(roomId)` → receive code → call `joinGame`. All in-process, no polling.

## Watch modes

| Mode | Flag | Rendering | When to use |
|---|---|---|---|
| `none` | `WATCH=none` | Pure headless, JSONL stdout | CI, overnight unattended |
| `dashboard` | `WATCH=dashboard` *(default)* | HTML status grid + click-to-watch live video | Interactive runs, demos, debugging |
| `tiled` | `WATCH=tiled` | 4 native Chromium windows positioned 2×2 | Hands-on power-user debugging |

### `tiled` implementation detail

Two browsers launched: one headed (`headless: false, slowMo: 50`) for the 4 host contexts, one headless for the 12 joiner contexts. Host windows positioned via `page.evaluate(() => window.moveTo(x, y))` after load. Grid computed from screen size with a default of 1920×1080.

## Server-side changes

All changes are additive and fit the existing inline migration pattern in `server/stores/user_store.py`.

### 1. Schema

Two new columns + one partial index:

```sql
ALTER TABLE users_v2 ADD COLUMN IF NOT EXISTS is_test_account BOOLEAN DEFAULT FALSE;
CREATE INDEX IF NOT EXISTS idx_users_v2_is_test_account ON users_v2(is_test_account)
  WHERE is_test_account = TRUE;
ALTER TABLE invite_codes ADD COLUMN IF NOT EXISTS marks_as_test BOOLEAN DEFAULT FALSE;
```

Partial index because ~99% of rows will be `FALSE`; we only want to accelerate the "show test accounts" admin queries, not pay index-maintenance cost on every normal write.

### 2. Register flow propagates the flag

In `services/auth_service.py`, after resolving the invite code, read `marks_as_test` and pass through to `user_store.create_user`:

```python
invite = await admin_service.get_invite_code(invite_code)
is_test = bool(invite and invite.marks_as_test)
user = await user_store.create_user(
    username=..., password_hash=..., email=...,
    is_test_account=is_test,
)
```

Users signing up without an invite or with a non-test invite are unaffected.

### 3. One-time: flag `5VC2MCCN` as test-seed

Executed once against staging (and any other environment the harness runs against):

```sql
UPDATE invite_codes SET marks_as_test = TRUE WHERE code = '5VC2MCCN';
```

Documented in the seeder script as a comment, and in `tests/soak/README.md` as a bring-up step. No admin UI for flagging invites as test-seed in MVP — add later if needed.

### 4. Stats filtering

Add `include_test: bool = False` parameter to stats queries in `services/stats_service.py`:

```python
async def get_leaderboard(self, limit=50, include_test=False):
    query = """
      SELECT ... FROM player_stats ps
      JOIN users_v2 u ON u.id = ps.user_id
      WHERE ($1 OR NOT u.is_test_account)
      ORDER BY ps.total_points DESC
      LIMIT $2
    """
    return await conn.fetch(query, include_test, limit)
```

Router in `routers/stats.py` exposes `include_test` as an optional query parameter. Default `False` — real users visiting the site never see soak traffic. Admin panel and debugging views pass `?include_test=true`.

Same treatment for:
- `get_player_stats(user_id, include_test)` — gates individual profile lookups
- `get_recent_games(include_test)` — hides games where any participant is a test account by default

### 5. Admin panel surfacing

Small additions to `client/admin.html` + `client/admin.js`:
- User list: "Test" badge column for `is_test_account=true` rows
- Invite codes: "Test-seed" indicator next to `marks_as_test=true` codes
- Leaderboard + user list: "Include test accounts" toggle → passes `?include_test=true`

### Out of scope (server-side)

- New admin endpoint for marking existing accounts as test
- Admin UI for flagging invites as test-seed at creation time
- Separate "test stats only" aggregation (admins invert their mental filter)
- `test_only=true` query mode

## Error handling

### Failure taxonomy

| Category | Example | Strategy |
|---|---|---|
| Recoverable game error | Animation flag stuck, click missed target | Log, continue, bot retries via existing `GolfBot` fallbacks |
| Recoverable session error | WS disconnect for one player, token expires | Reconnect session, rejoin game if possible, abort that room only if unrecoverable |
| Unrecoverable room error | Room stuck >60s, impossible state | Kill the room, capture artifacts, let other rooms continue |
| Fatal runner error | Staging unreachable, invite code exhausted, OOM | Stop everything cleanly, dump summary, exit non-zero |

**Core principle: per-room isolation.** A failure in room 3 never unwinds rooms 1, 2, 4. Each room runs in its own `Promise.allSettled` branch.

### Per-room watchdog

Each room gets a watchdog that resets on every `ctx.heartbeat(roomId)` call. If a room hasn't heartbeat'd in 60s, the watchdog captures artifacts, aborts that room only, and the runner continues with the remaining rooms.

Scenarios call `heartbeat` at each significant progress point (turn played, game started, game finished). The helper `DashboardReporter.update()` internally calls `heartbeat` as a convenience, so scenarios that use the dashboard reporter get watchdog resets for free. Scenarios that run with `WATCH=none` still need to call `heartbeat` explicitly at least once per 60s — a single call at the top of the per-turn loop is sufficient.

### Artifact capture on failure

Captured per-room into `tests/soak/artifacts/<run-id>/<room-id>/`:
- Screenshot of every context in the affected room
- `page.content()` HTML snapshot per context
- Last 200 console log messages per context (already captured by `GolfBot`)
- Game state JSON from the state parser
- Error stack trace
- Scenario config snapshot

Directory structure:
```
tests/soak/artifacts/
  2026-04-10-populate-14.23.05/
    run.log           # structured JSONL, full run
    summary.json      # final stats
    room-0/
      screenshot-host.png
      screenshot-joiner-1.png
      page-host.html
      console.txt
      state.json
      error.txt
```

Artifacts directory is gitignored. Runs older than 7 days auto-pruned on startup.

### Structured logging

Single logger, JSON Lines to stdout, pretty mirror to the dashboard. Every log line carries `run_id`, `scenario`, `room` (when applicable), and `timestamp`. Grep-friendly and `jq`-friendly.

### Graceful shutdown

`SIGINT` / `SIGTERM` trigger shutdown via `AbortController`:
1. Global `AbortSignal` flips to aborted
2. Scenarios check `ctx.signal.aborted` in loops, finish current turn, exit cleanly
3. Runner waits up to 10s for scenarios to unwind
4. After 10s, force-closes all contexts + browser
5. Writes final `summary.json` and prints results
6. Exit codes: `0` = all rooms completed target games, `1` = any room failed, `2` = interrupted before completion

Double Ctrl-C = immediate force exit.

### Periodic health probes

Every 30s during a run:
- `GET /api/health` against the target server
- Count of open browser contexts vs expected
- Runner memory usage

If `/api/health` fails 3 consecutive times, declare fatal error, capture artifacts, stop. This prevents staging outages from being misattributed to bot bugs.

### Retry policy

Retry only at the session level, never at the scenario level.
- WS drop → reconnect session, rejoin game if possible, 3 attempts max
- Token rejected → re-login via cached password, 1 attempt
- Click missed → existing `GolfBot` retry (already built in)

Never retry: whole games, whole scenarios, fatal errors.

### Cleanup guarantees

Three cleanup points, all going through the same `cleanup()` function wrapped in top-level `try/finally`:
1. **Success** — close contexts, close browsers, flush logs, write summary
2. **Exception** — capture artifacts first, then close contexts, flush logs, write partial summary
3. **Signal interrupt** — graceful shutdown as above, best-effort artifact capture

## File layout

```
tests/soak/
├── package.json              # standalone (separate from tests/e2e/)
├── tsconfig.json
├── README.md                 # quickstart + flag reference + bring-up steps
├── .env.stresstest.example   # template (real file gitignored)
│
├── runner.ts                 # CLI entry — `npm run soak`
├── config.ts                 # CLI parsing + defaults merging
│
├── core/
│   ├── session-pool.ts
│   ├── room-coordinator.ts
│   ├── screencaster.ts       # CDP attach/detach on demand
│   ├── watchdog.ts
│   ├── artifacts.ts
│   ├── logger.ts
│   └── types.ts              # Scenario, Session, ScenarioContext interfaces
│
├── scenarios/
│   ├── populate.ts
│   ├── stress.ts
│   └── index.ts              # name → module registry
│
├── dashboard/
│   ├── server.ts             # http + ws
│   ├── index.html
│   ├── dashboard.css
│   └── dashboard.js
│
├── scripts/
│   ├── seed-accounts.ts      # one-shot seeding
│   ├── reset-accounts.ts     # future: wipe test account stats
│   └── smoke.sh              # bring-up validation
│
└── artifacts/                # gitignored, auto-pruned 7d
    └── <run-id>/...
```

## Dependencies

New `tests/soak/package.json`:

```json
{
  "name": "golf-soak",
  "private": true,
  "scripts": {
    "soak": "tsx runner.ts",
    "soak:populate": "tsx runner.ts --scenario=populate",
    "soak:stress": "tsx runner.ts --scenario=stress",
    "seed": "tsx scripts/seed-accounts.ts",
    "smoke": "scripts/smoke.sh"
  },
  "dependencies": {
    "playwright-core": "^1.40.0",
    "ws": "^8.16.0"
  },
  "devDependencies": {
    "tsx": "^4.7.0",
    "@types/ws": "^8.5.0",
    "@types/node": "^20.10.0",
    "typescript": "^5.3.0"
  }
}
```

Three runtime deps: `playwright-core` (already in `tests/e2e/`), `ws` (WebSocket for dashboard), `tsx` (dev-only, runs TypeScript directly). No HTTP framework, no bundler, no build step.

## CLI flags

```
--scenario=populate|stress    required
--accounts=<n>                total sessions (default: scenario.needs.accounts)
--rooms=<n>                   default from scenario.needs
--cpus-per-room=<n>           default from scenario.needs
--games-per-room=<n>          default from scenario.defaultConfig
--holes=<n>                   default from scenario.defaultConfig
--watch=none|dashboard|tiled  default: dashboard
--dashboard-port=<n>          default: 7777
--target=<url>                default: TEST_URL env or http://localhost:8000
--run-id=<string>             default: ISO timestamp
--list                        print available scenarios and exit
--dry-run                     validate config without running
```

Derived: `accounts-per-room = accounts / rooms`. Must divide evenly; runner errors out with a clear message if not.

Config precedence: CLI flags → environment variables → scenario `defaultConfig` → runner defaults.

## Meta-testing

### Unit tests (Vitest, minimal)

- `room-coordinator.ts` — announce/await correctness, timeout behavior
- `watchdog.ts` — fires on timeout, resets on heartbeat, cancels cleanly
- `config.ts` — CLI precedence, required field validation

### Bring-up smoke test (`tests/soak/scripts/smoke.sh`)

Runs against local dev server with minimum viable config:
```bash
TEST_URL=http://localhost:8000 \
  npm run soak -- \
  --scenario=populate \
  --accounts=2 \
  --rooms=1 \
  --cpus-per-room=0 \
  --games-per-room=1 \
  --holes=1 \
  --watch=none
```

Exit 0 = full harness works end-to-end. ~30 seconds. Run after any change.

### Manual validation checklist

Documented in `tests/soak/CHECKLIST.md`:
- [ ] Seed 16 accounts against staging using the invite code
- [ ] `--scenario=populate --rooms=1 --games-per-room=1` completes cleanly
- [ ] `--scenario=populate --rooms=4 --games-per-room=1` — 4 rooms in parallel, no cross-contamination
- [ ] `--watch=dashboard` opens browser, grid renders, progress updates
- [ ] Click a player tile → live video appears, Esc → stops
- [ ] `--watch=tiled` opens 4 browser windows in 2×2 grid
- [ ] Ctrl-C during a run → graceful shutdown, summary printed, exit 2
- [ ] Kill the target server mid-run → runner detects, captures artifacts, exits 1
- [ ] Stats query `?include_test=false` hides soak accounts, `?include_test=true` shows them
- [ ] Full stress run (`--scenario=stress --games-per-room=10`) — no console errors, all rooms complete

## Implementation order

Sequenced so each step produces something demonstrable before moving on. The writing-plans skill will break this into concrete tasks.

1. **Server-side changes** — schema alters, register flow, stats filter, admin badge. Independent, ships first, unblocks local testing.
2. **Scaffold `tests/soak/`** — package.json, tsconfig, core/types, logger. No behavior yet.
3. **`SessionPool` + `scripts/seed-accounts.ts`** — end-to-end auth: seed, cache, load, validate login.
4. **`RoomCoordinator` + minimal `populate` scenario body** — proves multi-room orchestration.
5. **`runner.ts`** — CLI, config merging, scenario loading, top-level error handling.
6. **`--watch=none` works** — runs against local dev, produces clean logs, exits 0. First end-to-end milestone.
7. **`--watch=dashboard` status grid** — HTML + WS + tile updates (no video yet).
8. **CDP screencast / click-to-watch** — the live video feature.
9. **`--watch=tiled` mode** — native windows via `page.evaluate(window.moveTo)`.
10. **`stress` scenario** — chaos injection, rapid games.
11. **Failure handling** — watchdog, artifact capture, graceful shutdown.
12. **Smoke test script + CHECKLIST.md** — validation.
13. **Run against staging for real** — populate scoreboard, hunt bugs, report findings.

If step 6 takes longer than planned, steps 1–5 are still useful standalone.

## Out of scope for MVP

- Mobile viewport scenarios (future `mobile-populate`)
- Reconnect-storm scenarios
- Admin workflow scenarios
- Concurrent scenario execution
- Distributed runner
- Grafana / OTEL / custom metrics push
- Test account stat reset tooling
- Auto-promoting stress findings into Playwright regression tests
- New admin endpoints for account marking
- Admin UI for flagging invites as test-seed

All of these are cheap to add later because the scenario interface and session pool don't presuppose them.

## Open questions (to resolve during implementation)

1. **localStorage auth key** — exact keys used by `client/app.js` to persist the JWT and user blob; verified by reading the file during step 3.
2. **Chaos event set for `stress` scenario** — finalize which chaos events are in scope for MVP vs added incrementally (start with rapid clicks + tab nav + `setOffline`, add more as the server proves robust).
3. **CDP screencast frame rate tuning** — start at `everyNthFrame: 2` (~15fps), adjust down if bandwidth/CPU is excessive on long runs.
4. **Screen bounds detection for `tiled` mode** — default to 1920×1080, expose override via `--tiled-bounds=WxH`; auto-detect later if useful.