Files
golfgame/docs/superpowers/specs/2026-04-10-multiplayer-soak-test-design.md
adlee-was-taken 97036be319 docs: multiplayer soak & UX test harness design
Design for a standalone Playwright-based soak runner that drives 16
authenticated browser sessions across 4 concurrent rooms to populate
staging scoreboards and hunt stability bugs. Architected as a
pluggable scenario harness so future UX test scenarios (reconnect,
invite flow, admin workflows, mobile) slot in cleanly.

Also gitignores .superpowers/ (brainstorming session artifacts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 23:03:28 -04:00

28 KiB
Raw Blame History

Multiplayer Soak & UX Test Harness — Design

Date: 2026-04-10 Status: Design approved, pending implementation plan

Context

Golf Card Game is a real-time multiplayer WebSocket application with event-sourced game state, a leaderboard system, and an aggressive animation pipeline. Current test coverage is:

  • server/ — pytest unit/integration tests
  • tests/e2e/specs/ — Playwright tests exercising single-context flows (full game, stress with rapid clicks, visual regression, v3 features)

What's missing: a way to exercise the system with many concurrent authenticated users playing real multiplayer games for long durations. We can't currently:

  1. Populate staging scoreboards with realistic game history for demos and visual verification
  2. Hunt race conditions, WebSocket leaks, and room cleanup bugs under sustained concurrent load
  3. Validate multiplayer UX end-to-end across rooms without manual coordination
  4. Exercise authentication, room lifecycle, and stats aggregation as a cohesive system

This spec defines a multi-scenario soak and UX test harness: a standalone Playwright-based runner that drives 16 authenticated browser sessions across 4 concurrent rooms playing many games against each other (plus optional CPU opponents). It starts as a soak tool with two scenarios (populate, stress) and grows into the project's general-purpose multi-user UX test platform.

Goals

  1. Scoreboard population — run long multi-round games against staging with varied CPU personalities to produce realistic scoreboard data
  2. Stability stress — run rapid short games with chaos injection to surface race conditions and cleanup bugs
  3. Extensibility — new scenarios (reconnect, invite flow, admin workflow, mobile) slot in without runner changes
  4. Watchability — a dashboard mode with click-to-watch live video of any player, usable for demos and debugging
  5. Per-run isolation — test account traffic must be cleanly separable from real user traffic in stats queries

Non-goals

  • Replacing the existing tests/e2e/specs/ Playwright tests (they serve a different purpose — single-context edge cases)
  • Distributed runner across multiple machines
  • Concurrent scenario execution (one scenario per run for MVP)
  • Grafana/OTEL integration
  • Auto-promoting findings to regression tests

Constraints

  • Staging auth gate — staging runs INVITE_ONLY=true; seeding must go through the register endpoint with an invite code
  • Invite code 5VC2MCCN — provisioned with 16 uses, used once per test account on first-ever run, cached afterward
  • Per-IP rate limitingDAILY_SIGNUPS_PER_IP=20 on prod, lower default elsewhere; seeding must stay within budget
  • Room idle cleanupROOM_IDLE_TIMEOUT_SECONDS=300 means the scenario must keep rooms active or tolerate cleanup cascades
  • Existing bot codetests/e2e/bot/golf-bot.ts already provides createGame, joinGame, addCPU, playTurn, playGame; the harness reuses it verbatim

Architecture

Module layout

runner.ts (entry)
  ├─ SessionPool          owns 16 BrowserContexts, seeds/logs in, allocates
  ├─ Scenario             pluggable interface, per-scenario file
  ├─ RoomCoordinator      host→joiners room-code handoff via Deferred<string>
  ├─ Dashboard (optional) HTTP + WS server, status grid + click-to-watch video
  └─ GolfBot (reused)     tests/e2e/bot/golf-bot.ts, unchanged

Default: one browser, 16 contexts (lowest RAM, fastest startup). WATCH=tiled is the exception — it launches two browsers, one headed (hosts) and one headless (joiners), because Chromium's headed/headless flag is browser-scoped, not context-scoped. See the tiled implementation detail below.

Location

New sibling directory tests/soak/ — does not modify tests/e2e/. Shares GolfBot via direct import from ../e2e/bot/.

Rationale: Playwright Test is designed for short isolated tests. A single test() running 16 contexts for hours fights the test model (worker limits, all-or-nothing failure, single giant trace file). A standalone node script gives first-class CLI flags, full control over the event loop, clean home for the dashboard server, and reuses the GolfBot class unchanged. Existing tests/e2e/specs/stress.spec.ts stays as-is for single-context edge cases.

Components

SessionPool

Owns the lifecycle of 16 authenticated BrowserContexts.

Responsibilities:

  • On first run: register 16 accounts via POST /api/auth/register with invite code 5VC2MCCN, cache credentials to .env.stresstest
  • On subsequent runs: read cached credentials, create contexts, inject auth into each (localStorage token, or re-login via cached password if token rejected)
  • Expose acquire({ count }): Promise<Session[]> — scenarios request N authenticated sessions without caring how they got there
  • On scenario completion: close all contexts cleanly

Session shape:

interface Session {
  context: BrowserContext;
  page: Page;
  bot: GolfBot;
  account: Account;  // { username, password, token }
  key: string;       // stable identifier, e.g., "soak_07"
}

.env.stresstest format (gitignored, local-only, plaintext — this is a test tool):

SOAK_ACCOUNT_00=soak_00_a7bx:Hunter2!xK9mQ:eyJhbGc...
SOAK_ACCOUNT_01=soak_01_c3pz:Kc82!wQm4Rt:eyJhbGc...
...
SOAK_ACCOUNT_15=soak_15_m9fy:Px7!eR4sTn2:eyJhbGc...

Line format: username:password:token. Password kept so the pool can recover from token expiry automatically.

Scenario interface

export interface ScenarioNeeds {
  accounts: number;
  rooms?: number;
  cpusPerRoom?: number;
}

export interface ScenarioContext {
  config: ScenarioConfig;        // CLI flags merged with scenario defaults
  sessions: Session[];           // pre-authenticated, pre-navigated
  coordinator: RoomCoordinator;
  dashboard: DashboardReporter;  // no-op when watch mode doesn't use it
  logger: Logger;
  signal: AbortSignal;           // graceful shutdown
  heartbeat(roomId: string): void;  // resets the per-room watchdog
}

export interface ScenarioResult {
  gamesCompleted: number;
  errors: ScenarioError[];
  durationMs: number;
  customMetrics?: Record<string, number>;
}

export interface Scenario {
  name: string;
  description: string;
  defaultConfig: ScenarioConfig;
  needs: ScenarioNeeds;
  run(ctx: ScenarioContext): Promise<ScenarioResult>;
}

Scenarios are plain objects exported as default from files in tests/soak/scenarios/. The runner discovers them via a registry (scenarios/index.ts) that maps name → module. No filesystem scanning, no magic.

RoomCoordinator

~30 lines. Solves host→joiners room-code handoff:

class RoomCoordinator {
  private rooms = new Map<string, Deferred<string>>();

  announce(roomId: string, code: string) { this.get(roomId).resolve(code); }
  async await(roomId: string): Promise<string> { return this.get(roomId).promise; }
  private get(roomId: string) {
    if (!this.rooms.has(roomId)) this.rooms.set(roomId, deferred());
    return this.rooms.get(roomId)!;
  }
}

Usage:

// Host
const code = await host.bot.createGame(host.account.username);
coordinator.announce('room-1', code);

// Joiners (concurrent)
const code = await coordinator.await('room-1');
await joiner.bot.joinGame(code, joiner.account.username);

No polling, no sleeps, no cross-page scraping.

Dashboard

Optional — only instantiated when WATCH=dashboard.

Server side (dashboard/server.ts): vanilla http + ws module. Serves a single static HTML page, accepts WebSocket connections, relays messages between scenarios and the browser.

Client side (dashboard/index.html + dashboard.js): 2×2 room grid, per-player tiles with live status (current player, score, held card, phase, moves), progress bars per hole, activity log at the bottom. No framework, ~300 lines total.

Click-to-watch: clicking a player tile sends start_stream(sessionKey) over WS. The runner attaches a CDP session to that player's page via context.newCDPSession(page), calls Page.startScreencast with {format: 'jpeg', quality: 60, maxWidth: 640, maxHeight: 360, everyNthFrame: 2}, and forwards each Page.screencastFrame event to the dashboard as { sessionKey, jpeg_b64 }. The dashboard renders it into an <img> that swaps src on each frame.

Returning to the grid sends stop_stream(sessionKey) and the runner detaches the CDP session. On WS disconnect, all active screencasts stop. This keeps CPU cost zero except while someone is actively watching.

DashboardReporter interface exposed to scenarios:

interface DashboardReporter {
  update(roomId: string, state: Partial<RoomState>): void;
  log(level: 'info'|'warn'|'error', msg: string, meta?: object): void;
  incrementMetric(name: string, by?: number): void;
}

When WATCH is not dashboard, all three methods are no-ops; structured logs still go to stdout.

Runner

runner.ts is the CLI entry point. Parses flags, resolves config precedence, launches browser(s), instantiates SessionPool + RoomCoordinator + (optional) Dashboard, loads the requested scenario by name, executes it, reports results, cleans up.

Scenarios

Scenario 1: populate

Goal: produce realistic scoreboard data for staging demos.

Config:

{
  name: 'populate',
  description: 'Long multi-round games to populate scoreboards',
  needs: { accounts: 16, rooms: 4, cpusPerRoom: 1 },
  defaultConfig: {
    gamesPerRoom: 10,
    holes: 9,
    decks: 2,
    cpuPersonalities: ['Sofia', 'Marcus', 'Kenji', 'Priya'],
    thinkTimeMs: [800, 2200],
    interGamePauseMs: 3000,
  },
}

Shape: 4 rooms × 4 accounts + 1 CPU each. Each room runs gamesPerRoom sequential games. Inside a room: host creates game → joiners join → host adds CPU → host starts game → all sessions loop on isMyTurn() + playTurn() with randomized human-like think time between turns. Between games, rooms pause briefly to mimic natural pacing.

Scenario 2: stress

Goal: hunt race conditions and stability bugs.

Config:

{
  name: 'stress',
  description: 'Rapid short games for stability & race condition hunting',
  needs: { accounts: 16, rooms: 4, cpusPerRoom: 2 },
  defaultConfig: {
    gamesPerRoom: 50,
    holes: 1,
    decks: 1,
    thinkTimeMs: [50, 150],
    interGamePauseMs: 200,
    chaosChance: 0.05,
  },
}

Shape: same as populate but tight loops, 1-hole games, and a chaos injector that fires with 5% probability per turn. Chaos events:

  • Rapid concurrent clicks on multiple cards
  • Random tab-navigation away and back
  • Simultaneous click on card + discard button
  • Brief WebSocket drop via Playwright's context.setOffline() followed by reconnect

Each chaos event is logged with enough context to reproduce (room, player, turn, event type).

Future scenarios (not MVP, design anticipates them)

  • reconnect — 2 accounts, deliberate mid-game disconnect, verify recovery
  • invite-flow — 0 accounts (fresh signups), exercise invite request → approval → first-game pipeline
  • admin-workflow — 1 admin account, drive the admin panel
  • mobile-populate — reuses populate with devices['iPhone 13'] context options
  • replay-viewer — watches completed games via the replay UI

Each is a new file in tests/soak/scenarios/, zero runner changes.

Data flow

Cold start (first-ever run)

  1. Runner reads .env.stresstest → file missing
  2. SessionPool.seedAccounts():
    • For i in 0..15: POST /api/auth/register with { username, password, email, invite_code: '5VC2MCCN' }
    • Receive { user, token, expires_at }, write to .env.stresstest
  3. Server sets is_test_account=true automatically because the invite code has marks_as_test=true (see Server changes)
  4. Runner proceeds to normal startup

Warm start (subsequent runs)

  1. Runner reads .env.stresstest → 16 entries
  2. SessionPool creates 16 BrowserContexts
  3. For each context: inject token into localStorage using the key the client app reads on load (resolved during implementation by inspecting client/app.js; see Open Questions)
  4. Each session navigates to / and lands post-auth
  5. If any token is rejected (401), pool silently re-logs in via cached password and refreshes the token in .env.stresstest

Seeding: explicit script vs automatic fallback

Two paths to the same result, for flexibility:

  • Preferred: explicit npm run seed — runs scripts/seed-accounts.ts once during bring-up. Gives clear feedback, fails loudly on rate limits or network issues, lets you verify the accounts exist before a real run.
  • Fallback: auto-seed on cold start — if runner.ts starts and .env.stresstest is missing, SessionPool invokes the same seeding logic transparently. Useful for CI or fresh clones where nobody ran the explicit step.

Both paths share the same code in core/session-pool.ts; the script is a thin CLI wrapper around SessionPool.seedAccounts(). Documented in tests/soak/README.md with "run npm run seed first" as the happy path.

Room code handoff

Host session calls createGame → receives room code → coordinator.announce(roomId, code). Joiner sessions await coordinator.await(roomId) → receive code → call joinGame. All in-process, no polling.

Watch modes

Mode Flag Rendering When to use
none WATCH=none Pure headless, JSONL stdout CI, overnight unattended
dashboard WATCH=dashboard (default) HTML status grid + click-to-watch live video Interactive runs, demos, debugging
tiled WATCH=tiled 4 native Chromium windows positioned 2×2 Hands-on power-user debugging

tiled implementation detail

Two browsers launched: one headed (headless: false, slowMo: 50) for the 4 host contexts, one headless for the 12 joiner contexts. Host windows positioned via page.evaluate(() => window.moveTo(x, y)) after load. Grid computed from screen size with a default of 1920×1080.

Server-side changes

All changes are additive and fit the existing inline migration pattern in server/stores/user_store.py.

1. Schema

Two new columns + one partial index:

ALTER TABLE users_v2 ADD COLUMN IF NOT EXISTS is_test_account BOOLEAN DEFAULT FALSE;
CREATE INDEX IF NOT EXISTS idx_users_v2_is_test_account ON users_v2(is_test_account)
  WHERE is_test_account = TRUE;
ALTER TABLE invite_codes ADD COLUMN IF NOT EXISTS marks_as_test BOOLEAN DEFAULT FALSE;

Partial index because ~99% of rows will be FALSE; we only want to accelerate the "show test accounts" admin queries, not pay index-maintenance cost on every normal write.

2. Register flow propagates the flag

In services/auth_service.py, after resolving the invite code, read marks_as_test and pass through to user_store.create_user:

invite = await admin_service.get_invite_code(invite_code)
is_test = bool(invite and invite.marks_as_test)
user = await user_store.create_user(
    username=..., password_hash=..., email=...,
    is_test_account=is_test,
)

Users signing up without an invite or with a non-test invite are unaffected.

3. One-time: flag 5VC2MCCN as test-seed

Executed once against staging (and any other environment the harness runs against):

UPDATE invite_codes SET marks_as_test = TRUE WHERE code = '5VC2MCCN';

Documented in the seeder script as a comment, and in tests/soak/README.md as a bring-up step. No admin UI for flagging invites as test-seed in MVP — add later if needed.

4. Stats filtering

Add include_test: bool = False parameter to stats queries in services/stats_service.py:

async def get_leaderboard(self, limit=50, include_test=False):
    query = """
      SELECT ... FROM player_stats ps
      JOIN users_v2 u ON u.id = ps.user_id
      WHERE ($1 OR NOT u.is_test_account)
      ORDER BY ps.total_points DESC
      LIMIT $2
    """
    return await conn.fetch(query, include_test, limit)

Router in routers/stats.py exposes include_test as an optional query parameter. Default False — real users visiting the site never see soak traffic. Admin panel and debugging views pass ?include_test=true.

Same treatment for:

  • get_player_stats(user_id, include_test) — gates individual profile lookups
  • get_recent_games(include_test) — hides games where any participant is a test account by default

5. Admin panel surfacing

Small additions to client/admin.html + client/admin.js:

  • User list: "Test" badge column for is_test_account=true rows
  • Invite codes: "Test-seed" indicator next to marks_as_test=true codes
  • Leaderboard + user list: "Include test accounts" toggle → passes ?include_test=true

Out of scope (server-side)

  • New admin endpoint for marking existing accounts as test
  • Admin UI for flagging invites as test-seed at creation time
  • Separate "test stats only" aggregation (admins invert their mental filter)
  • test_only=true query mode

Error handling

Failure taxonomy

Category Example Strategy
Recoverable game error Animation flag stuck, click missed target Log, continue, bot retries via existing GolfBot fallbacks
Recoverable session error WS disconnect for one player, token expires Reconnect session, rejoin game if possible, abort that room only if unrecoverable
Unrecoverable room error Room stuck >60s, impossible state Kill the room, capture artifacts, let other rooms continue
Fatal runner error Staging unreachable, invite code exhausted, OOM Stop everything cleanly, dump summary, exit non-zero

Core principle: per-room isolation. A failure in room 3 never unwinds rooms 1, 2, 4. Each room runs in its own Promise.allSettled branch.

Per-room watchdog

Each room gets a watchdog that resets on every ctx.heartbeat(roomId) call. If a room hasn't heartbeat'd in 60s, the watchdog captures artifacts, aborts that room only, and the runner continues with the remaining rooms.

Scenarios call heartbeat at each significant progress point (turn played, game started, game finished). The helper DashboardReporter.update() internally calls heartbeat as a convenience, so scenarios that use the dashboard reporter get watchdog resets for free. Scenarios that run with WATCH=none still need to call heartbeat explicitly at least once per 60s — a single call at the top of the per-turn loop is sufficient.

Artifact capture on failure

Captured per-room into tests/soak/artifacts/<run-id>/<room-id>/:

  • Screenshot of every context in the affected room
  • page.content() HTML snapshot per context
  • Last 200 console log messages per context (already captured by GolfBot)
  • Game state JSON from the state parser
  • Error stack trace
  • Scenario config snapshot

Directory structure:

tests/soak/artifacts/
  2026-04-10-populate-14.23.05/
    run.log           # structured JSONL, full run
    summary.json      # final stats
    room-0/
      screenshot-host.png
      screenshot-joiner-1.png
      page-host.html
      console.txt
      state.json
      error.txt

Artifacts directory is gitignored. Runs older than 7 days auto-pruned on startup.

Structured logging

Single logger, JSON Lines to stdout, pretty mirror to the dashboard. Every log line carries run_id, scenario, room (when applicable), and timestamp. Grep-friendly and jq-friendly.

Graceful shutdown

SIGINT / SIGTERM trigger shutdown via AbortController:

  1. Global AbortSignal flips to aborted
  2. Scenarios check ctx.signal.aborted in loops, finish current turn, exit cleanly
  3. Runner waits up to 10s for scenarios to unwind
  4. After 10s, force-closes all contexts + browser
  5. Writes final summary.json and prints results
  6. Exit codes: 0 = all rooms completed target games, 1 = any room failed, 2 = interrupted before completion

Double Ctrl-C = immediate force exit.

Periodic health probes

Every 30s during a run:

  • GET /api/health against the target server
  • Count of open browser contexts vs expected
  • Runner memory usage

If /api/health fails 3 consecutive times, declare fatal error, capture artifacts, stop. This prevents staging outages from being misattributed to bot bugs.

Retry policy

Retry only at the session level, never at the scenario level.

  • WS drop → reconnect session, rejoin game if possible, 3 attempts max
  • Token rejected → re-login via cached password, 1 attempt
  • Click missed → existing GolfBot retry (already built in)

Never retry: whole games, whole scenarios, fatal errors.

Cleanup guarantees

Three cleanup points, all going through the same cleanup() function wrapped in top-level try/finally:

  1. Success — close contexts, close browsers, flush logs, write summary
  2. Exception — capture artifacts first, then close contexts, flush logs, write partial summary
  3. Signal interrupt — graceful shutdown as above, best-effort artifact capture

File layout

tests/soak/
├── package.json              # standalone (separate from tests/e2e/)
├── tsconfig.json
├── README.md                 # quickstart + flag reference + bring-up steps
├── .env.stresstest.example   # template (real file gitignored)
│
├── runner.ts                 # CLI entry — `npm run soak`
├── config.ts                 # CLI parsing + defaults merging
│
├── core/
│   ├── session-pool.ts
│   ├── room-coordinator.ts
│   ├── screencaster.ts       # CDP attach/detach on demand
│   ├── watchdog.ts
│   ├── artifacts.ts
│   ├── logger.ts
│   └── types.ts              # Scenario, Session, ScenarioContext interfaces
│
├── scenarios/
│   ├── populate.ts
│   ├── stress.ts
│   └── index.ts              # name → module registry
│
├── dashboard/
│   ├── server.ts             # http + ws
│   ├── index.html
│   ├── dashboard.css
│   └── dashboard.js
│
├── scripts/
│   ├── seed-accounts.ts      # one-shot seeding
│   ├── reset-accounts.ts     # future: wipe test account stats
│   └── smoke.sh              # bring-up validation
│
└── artifacts/                # gitignored, auto-pruned 7d
    └── <run-id>/...

Dependencies

New tests/soak/package.json:

{
  "name": "golf-soak",
  "private": true,
  "scripts": {
    "soak": "tsx runner.ts",
    "soak:populate": "tsx runner.ts --scenario=populate",
    "soak:stress": "tsx runner.ts --scenario=stress",
    "seed": "tsx scripts/seed-accounts.ts",
    "smoke": "scripts/smoke.sh"
  },
  "dependencies": {
    "playwright-core": "^1.40.0",
    "ws": "^8.16.0"
  },
  "devDependencies": {
    "tsx": "^4.7.0",
    "@types/ws": "^8.5.0",
    "@types/node": "^20.10.0",
    "typescript": "^5.3.0"
  }
}

Three runtime deps: playwright-core (already in tests/e2e/), ws (WebSocket for dashboard), tsx (dev-only, runs TypeScript directly). No HTTP framework, no bundler, no build step.

CLI flags

--scenario=populate|stress    required
--accounts=<n>                total sessions (default: scenario.needs.accounts)
--rooms=<n>                   default from scenario.needs
--cpus-per-room=<n>           default from scenario.needs
--games-per-room=<n>          default from scenario.defaultConfig
--holes=<n>                   default from scenario.defaultConfig
--watch=none|dashboard|tiled  default: dashboard
--dashboard-port=<n>          default: 7777
--target=<url>                default: TEST_URL env or http://localhost:8000
--run-id=<string>             default: ISO timestamp
--list                        print available scenarios and exit
--dry-run                     validate config without running

Derived: accounts-per-room = accounts / rooms. Must divide evenly; runner errors out with a clear message if not.

Config precedence: CLI flags → environment variables → scenario defaultConfig → runner defaults.

Meta-testing

Unit tests (Vitest, minimal)

  • room-coordinator.ts — announce/await correctness, timeout behavior
  • watchdog.ts — fires on timeout, resets on heartbeat, cancels cleanly
  • config.ts — CLI precedence, required field validation

Bring-up smoke test (tests/soak/scripts/smoke.sh)

Runs against local dev server with minimum viable config:

TEST_URL=http://localhost:8000 \
  npm run soak -- \
  --scenario=populate \
  --accounts=2 \
  --rooms=1 \
  --cpus-per-room=0 \
  --games-per-room=1 \
  --holes=1 \
  --watch=none

Exit 0 = full harness works end-to-end. ~30 seconds. Run after any change.

Manual validation checklist

Documented in tests/soak/CHECKLIST.md:

  • Seed 16 accounts against staging using the invite code
  • --scenario=populate --rooms=1 --games-per-room=1 completes cleanly
  • --scenario=populate --rooms=4 --games-per-room=1 — 4 rooms in parallel, no cross-contamination
  • --watch=dashboard opens browser, grid renders, progress updates
  • Click a player tile → live video appears, Esc → stops
  • --watch=tiled opens 4 browser windows in 2×2 grid
  • Ctrl-C during a run → graceful shutdown, summary printed, exit 2
  • Kill the target server mid-run → runner detects, captures artifacts, exits 1
  • Stats query ?include_test=false hides soak accounts, ?include_test=true shows them
  • Full stress run (--scenario=stress --games-per-room=10) — no console errors, all rooms complete

Implementation order

Sequenced so each step produces something demonstrable before moving on. The writing-plans skill will break this into concrete tasks.

  1. Server-side changes — schema alters, register flow, stats filter, admin badge. Independent, ships first, unblocks local testing.
  2. Scaffold tests/soak/ — package.json, tsconfig, core/types, logger. No behavior yet.
  3. SessionPool + scripts/seed-accounts.ts — end-to-end auth: seed, cache, load, validate login.
  4. RoomCoordinator + minimal populate scenario body — proves multi-room orchestration.
  5. runner.ts — CLI, config merging, scenario loading, top-level error handling.
  6. --watch=none works — runs against local dev, produces clean logs, exits 0. First end-to-end milestone.
  7. --watch=dashboard status grid — HTML + WS + tile updates (no video yet).
  8. CDP screencast / click-to-watch — the live video feature.
  9. --watch=tiled mode — native windows via page.evaluate(window.moveTo).
  10. stress scenario — chaos injection, rapid games.
  11. Failure handling — watchdog, artifact capture, graceful shutdown.
  12. Smoke test script + CHECKLIST.md — validation.
  13. Run against staging for real — populate scoreboard, hunt bugs, report findings.

If step 6 takes longer than planned, steps 15 are still useful standalone.

Out of scope for MVP

  • Mobile viewport scenarios (future mobile-populate)
  • Reconnect-storm scenarios
  • Admin workflow scenarios
  • Concurrent scenario execution
  • Distributed runner
  • Grafana / OTEL / custom metrics push
  • Test account stat reset tooling
  • Auto-promoting stress findings into Playwright regression tests
  • New admin endpoints for account marking
  • Admin UI for flagging invites as test-seed

All of these are cheap to add later because the scenario interface and session pool don't presuppose them.

Open questions (to resolve during implementation)

  1. localStorage auth key — exact keys used by client/app.js to persist the JWT and user blob; verified by reading the file during step 3.
  2. Chaos event set for stress scenario — finalize which chaos events are in scope for MVP vs added incrementally (start with rapid clicks + tab nav + setOffline, add more as the server proves robust).
  3. CDP screencast frame rate tuning — start at everyNthFrame: 2 (~15fps), adjust down if bandwidth/CPU is excessive on long runs.
  4. Screen bounds detection for tiled mode — default to 1920×1080, expose override via --tiled-bounds=WxH; auto-detect later if useful.