docs: multiplayer soak & UX test harness design
Design for a standalone Playwright-based soak runner that drives 16 authenticated browser sessions across 4 concurrent rooms to populate staging scoreboards and hunt stability bugs. Architected as a pluggable scenario harness so future UX test scenarios (reconnect, invite flow, admin workflows, mobile) slot in cleanly. Also gitignores .superpowers/ (brainstorming session artifacts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
1
.gitignore
vendored
1
.gitignore
vendored
@@ -214,6 +214,7 @@ cython_debug/
|
||||
|
||||
# Claude Code
|
||||
.claude/
|
||||
.superpowers/
|
||||
|
||||
# Virtualenv in project root
|
||||
bin/
|
||||
|
||||
@@ -0,0 +1,638 @@
|
||||
# Multiplayer Soak & UX Test Harness — Design
|
||||
|
||||
**Date:** 2026-04-10
|
||||
**Status:** Design approved, pending implementation plan
|
||||
|
||||
## Context
|
||||
|
||||
Golf Card Game is a real-time multiplayer WebSocket application with event-sourced game state, a leaderboard system, and an aggressive animation pipeline. Current test coverage is:
|
||||
|
||||
- `server/` — pytest unit/integration tests
|
||||
- `tests/e2e/specs/` — Playwright tests exercising single-context flows (full game, stress with rapid clicks, visual regression, v3 features)
|
||||
|
||||
What's missing: a way to exercise the system with **many concurrent authenticated users playing real multiplayer games** for long durations. We can't currently:
|
||||
|
||||
1. Populate staging scoreboards with realistic game history for demos and visual verification
|
||||
2. Hunt race conditions, WebSocket leaks, and room cleanup bugs under sustained concurrent load
|
||||
3. Validate multiplayer UX end-to-end across rooms without manual coordination
|
||||
4. Exercise authentication, room lifecycle, and stats aggregation as a cohesive system
|
||||
|
||||
This spec defines a **multi-scenario soak and UX test harness**: a standalone Playwright-based runner that drives 16 authenticated browser sessions across 4 concurrent rooms playing many games against each other (plus optional CPU opponents). It starts as a soak tool with two scenarios (`populate`, `stress`) and grows into the project's general-purpose multi-user UX test platform.
|
||||
|
||||
## Goals
|
||||
|
||||
1. **Scoreboard population** — run long multi-round games against staging with varied CPU personalities to produce realistic scoreboard data
|
||||
2. **Stability stress** — run rapid short games with chaos injection to surface race conditions and cleanup bugs
|
||||
3. **Extensibility** — new scenarios (reconnect, invite flow, admin workflow, mobile) slot in without runner changes
|
||||
4. **Watchability** — a dashboard mode with click-to-watch live video of any player, usable for demos and debugging
|
||||
5. **Per-run isolation** — test account traffic must be cleanly separable from real user traffic in stats queries
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Replacing the existing `tests/e2e/specs/` Playwright tests (they serve a different purpose — single-context edge cases)
|
||||
- Distributed runner across multiple machines
|
||||
- Concurrent scenario execution (one scenario per run for MVP)
|
||||
- Grafana/OTEL integration
|
||||
- Auto-promoting findings to regression tests
|
||||
|
||||
## Constraints
|
||||
|
||||
- **Staging auth gate** — staging runs `INVITE_ONLY=true`; seeding must go through the register endpoint with an invite code
|
||||
- **Invite code `5VC2MCCN`** — provisioned with 16 uses, used once per test account on first-ever run, cached afterward
|
||||
- **Per-IP rate limiting** — `DAILY_SIGNUPS_PER_IP=20` on prod, lower default elsewhere; seeding must stay within budget
|
||||
- **Room idle cleanup** — `ROOM_IDLE_TIMEOUT_SECONDS=300` means the scenario must keep rooms active or tolerate cleanup cascades
|
||||
- **Existing bot code** — `tests/e2e/bot/golf-bot.ts` already provides `createGame`, `joinGame`, `addCPU`, `playTurn`, `playGame`; the harness reuses it verbatim
|
||||
|
||||
## Architecture
|
||||
|
||||
### Module layout
|
||||
|
||||
```
|
||||
runner.ts (entry)
|
||||
├─ SessionPool owns 16 BrowserContexts, seeds/logs in, allocates
|
||||
├─ Scenario pluggable interface, per-scenario file
|
||||
├─ RoomCoordinator host→joiners room-code handoff via Deferred<string>
|
||||
├─ Dashboard (optional) HTTP + WS server, status grid + click-to-watch video
|
||||
└─ GolfBot (reused) tests/e2e/bot/golf-bot.ts, unchanged
|
||||
```
|
||||
|
||||
Default: one browser, 16 contexts (lowest RAM, fastest startup). `WATCH=tiled` is the exception — it launches two browsers, one headed (hosts) and one headless (joiners), because Chromium's headed/headless flag is browser-scoped, not context-scoped. See the `tiled` implementation detail below.
|
||||
|
||||
### Location
|
||||
|
||||
New sibling directory `tests/soak/` — does not modify `tests/e2e/`. Shares `GolfBot` via direct import from `../e2e/bot/`.
|
||||
|
||||
Rationale: Playwright Test is designed for short isolated tests. A single `test()` running 16 contexts for hours fights the test model (worker limits, all-or-nothing failure, single giant trace file). A standalone node script gives first-class CLI flags, full control over the event loop, clean home for the dashboard server, and reuses the `GolfBot` class unchanged. Existing `tests/e2e/specs/stress.spec.ts` stays as-is for single-context edge cases.
|
||||
|
||||
## Components
|
||||
|
||||
### SessionPool
|
||||
|
||||
Owns the lifecycle of 16 authenticated `BrowserContext`s.
|
||||
|
||||
**Responsibilities:**
|
||||
- On first run: register 16 accounts via `POST /api/auth/register` with invite code `5VC2MCCN`, cache credentials to `.env.stresstest`
|
||||
- On subsequent runs: read cached credentials, create contexts, inject auth into each (localStorage token, or re-login via cached password if token rejected)
|
||||
- Expose `acquire({ count }): Promise<Session[]>` — scenarios request N authenticated sessions without caring how they got there
|
||||
- On scenario completion: close all contexts cleanly
|
||||
|
||||
**`Session` shape:**
|
||||
```typescript
|
||||
interface Session {
|
||||
context: BrowserContext;
|
||||
page: Page;
|
||||
bot: GolfBot;
|
||||
account: Account; // { username, password, token }
|
||||
key: string; // stable identifier, e.g., "soak_07"
|
||||
}
|
||||
```
|
||||
|
||||
**`.env.stresstest` format** (gitignored, local-only, plaintext — this is a test tool):
|
||||
```
|
||||
SOAK_ACCOUNT_00=soak_00_a7bx:Hunter2!xK9mQ:eyJhbGc...
|
||||
SOAK_ACCOUNT_01=soak_01_c3pz:Kc82!wQm4Rt:eyJhbGc...
|
||||
...
|
||||
SOAK_ACCOUNT_15=soak_15_m9fy:Px7!eR4sTn2:eyJhbGc...
|
||||
```
|
||||
|
||||
Line format: `username:password:token`. Password kept so the pool can recover from token expiry automatically.
|
||||
|
||||
### Scenario interface
|
||||
|
||||
```typescript
|
||||
export interface ScenarioNeeds {
|
||||
accounts: number;
|
||||
rooms?: number;
|
||||
cpusPerRoom?: number;
|
||||
}
|
||||
|
||||
export interface ScenarioContext {
|
||||
config: ScenarioConfig; // CLI flags merged with scenario defaults
|
||||
sessions: Session[]; // pre-authenticated, pre-navigated
|
||||
coordinator: RoomCoordinator;
|
||||
dashboard: DashboardReporter; // no-op when watch mode doesn't use it
|
||||
logger: Logger;
|
||||
signal: AbortSignal; // graceful shutdown
|
||||
heartbeat(roomId: string): void; // resets the per-room watchdog
|
||||
}
|
||||
|
||||
export interface ScenarioResult {
|
||||
gamesCompleted: number;
|
||||
errors: ScenarioError[];
|
||||
durationMs: number;
|
||||
customMetrics?: Record<string, number>;
|
||||
}
|
||||
|
||||
export interface Scenario {
|
||||
name: string;
|
||||
description: string;
|
||||
defaultConfig: ScenarioConfig;
|
||||
needs: ScenarioNeeds;
|
||||
run(ctx: ScenarioContext): Promise<ScenarioResult>;
|
||||
}
|
||||
```
|
||||
|
||||
Scenarios are plain objects exported as default from files in `tests/soak/scenarios/`. The runner discovers them via a registry (`scenarios/index.ts`) that maps name → module. No filesystem scanning, no magic.
|
||||
|
||||
### RoomCoordinator
|
||||
|
||||
~30 lines. Solves host→joiners room-code handoff:
|
||||
|
||||
```typescript
|
||||
class RoomCoordinator {
|
||||
private rooms = new Map<string, Deferred<string>>();
|
||||
|
||||
announce(roomId: string, code: string) { this.get(roomId).resolve(code); }
|
||||
async await(roomId: string): Promise<string> { return this.get(roomId).promise; }
|
||||
private get(roomId: string) {
|
||||
if (!this.rooms.has(roomId)) this.rooms.set(roomId, deferred());
|
||||
return this.rooms.get(roomId)!;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Usage:
|
||||
```typescript
|
||||
// Host
|
||||
const code = await host.bot.createGame(host.account.username);
|
||||
coordinator.announce('room-1', code);
|
||||
|
||||
// Joiners (concurrent)
|
||||
const code = await coordinator.await('room-1');
|
||||
await joiner.bot.joinGame(code, joiner.account.username);
|
||||
```
|
||||
|
||||
No polling, no sleeps, no cross-page scraping.
|
||||
|
||||
### Dashboard
|
||||
|
||||
Optional — only instantiated when `WATCH=dashboard`.
|
||||
|
||||
**Server side** (`dashboard/server.ts`): vanilla `http` + `ws` module. Serves a single static HTML page, accepts WebSocket connections, relays messages between scenarios and the browser.
|
||||
|
||||
**Client side** (`dashboard/index.html` + `dashboard.js`): 2×2 room grid, per-player tiles with live status (current player, score, held card, phase, moves), progress bars per hole, activity log at the bottom. No framework, ~300 lines total.
|
||||
|
||||
**Click-to-watch**: clicking a player tile sends `start_stream(sessionKey)` over WS. The runner attaches a CDP session to that player's page via `context.newCDPSession(page)`, calls `Page.startScreencast` with `{format: 'jpeg', quality: 60, maxWidth: 640, maxHeight: 360, everyNthFrame: 2}`, and forwards each `Page.screencastFrame` event to the dashboard as `{ sessionKey, jpeg_b64 }`. The dashboard renders it into an `<img>` that swaps `src` on each frame.
|
||||
|
||||
Returning to the grid sends `stop_stream(sessionKey)` and the runner detaches the CDP session. On WS disconnect, all active screencasts stop. This keeps CPU cost zero except while someone is actively watching.
|
||||
|
||||
**`DashboardReporter` interface exposed to scenarios:**
|
||||
```typescript
|
||||
interface DashboardReporter {
|
||||
update(roomId: string, state: Partial<RoomState>): void;
|
||||
log(level: 'info'|'warn'|'error', msg: string, meta?: object): void;
|
||||
incrementMetric(name: string, by?: number): void;
|
||||
}
|
||||
```
|
||||
|
||||
When `WATCH` is not `dashboard`, all three methods are no-ops; structured logs still go to stdout.
|
||||
|
||||
### Runner
|
||||
|
||||
`runner.ts` is the CLI entry point. Parses flags, resolves config precedence, launches browser(s), instantiates `SessionPool` + `RoomCoordinator` + (optional) `Dashboard`, loads the requested scenario by name, executes it, reports results, cleans up.
|
||||
|
||||
## Scenarios
|
||||
|
||||
### Scenario 1: `populate`
|
||||
|
||||
**Goal:** produce realistic scoreboard data for staging demos.
|
||||
|
||||
**Config:**
|
||||
```typescript
|
||||
{
|
||||
name: 'populate',
|
||||
description: 'Long multi-round games to populate scoreboards',
|
||||
needs: { accounts: 16, rooms: 4, cpusPerRoom: 1 },
|
||||
defaultConfig: {
|
||||
gamesPerRoom: 10,
|
||||
holes: 9,
|
||||
decks: 2,
|
||||
cpuPersonalities: ['Sofia', 'Marcus', 'Kenji', 'Priya'],
|
||||
thinkTimeMs: [800, 2200],
|
||||
interGamePauseMs: 3000,
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
**Shape:** 4 rooms × 4 accounts + 1 CPU each. Each room runs `gamesPerRoom` sequential games. Inside a room: host creates game → joiners join → host adds CPU → host starts game → all sessions loop on `isMyTurn()` + `playTurn()` with randomized human-like think time between turns. Between games, rooms pause briefly to mimic natural pacing.
|
||||
|
||||
### Scenario 2: `stress`
|
||||
|
||||
**Goal:** hunt race conditions and stability bugs.
|
||||
|
||||
**Config:**
|
||||
```typescript
|
||||
{
|
||||
name: 'stress',
|
||||
description: 'Rapid short games for stability & race condition hunting',
|
||||
needs: { accounts: 16, rooms: 4, cpusPerRoom: 2 },
|
||||
defaultConfig: {
|
||||
gamesPerRoom: 50,
|
||||
holes: 1,
|
||||
decks: 1,
|
||||
thinkTimeMs: [50, 150],
|
||||
interGamePauseMs: 200,
|
||||
chaosChance: 0.05,
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
**Shape:** same as `populate` but tight loops, 1-hole games, and a chaos injector that fires with 5% probability per turn. Chaos events:
|
||||
- Rapid concurrent clicks on multiple cards
|
||||
- Random tab-navigation away and back
|
||||
- Simultaneous click on card + discard button
|
||||
- Brief WebSocket drop via Playwright's `context.setOffline()` followed by reconnect
|
||||
|
||||
Each chaos event is logged with enough context to reproduce (room, player, turn, event type).
|
||||
|
||||
### Future scenarios (not MVP, design anticipates them)
|
||||
|
||||
- `reconnect` — 2 accounts, deliberate mid-game disconnect, verify recovery
|
||||
- `invite-flow` — 0 accounts (fresh signups), exercise invite request → approval → first-game pipeline
|
||||
- `admin-workflow` — 1 admin account, drive the admin panel
|
||||
- `mobile-populate` — reuses `populate` with `devices['iPhone 13']` context options
|
||||
- `replay-viewer` — watches completed games via the replay UI
|
||||
|
||||
Each is a new file in `tests/soak/scenarios/`, zero runner changes.
|
||||
|
||||
## Data flow
|
||||
|
||||
### Cold start (first-ever run)
|
||||
|
||||
1. Runner reads `.env.stresstest` → file missing
|
||||
2. `SessionPool.seedAccounts()`:
|
||||
- For `i` in `0..15`: `POST /api/auth/register` with `{ username, password, email, invite_code: '5VC2MCCN' }`
|
||||
- Receive `{ user, token, expires_at }`, write to `.env.stresstest`
|
||||
3. Server sets `is_test_account=true` automatically because the invite code has `marks_as_test=true` (see Server changes)
|
||||
4. Runner proceeds to normal startup
|
||||
|
||||
### Warm start (subsequent runs)
|
||||
|
||||
1. Runner reads `.env.stresstest` → 16 entries
|
||||
2. `SessionPool` creates 16 `BrowserContext`s
|
||||
3. For each context: inject token into localStorage using the key the client app reads on load (resolved during implementation by inspecting `client/app.js`; see Open Questions)
|
||||
4. Each session navigates to `/` and lands post-auth
|
||||
5. If any token is rejected (401), pool silently re-logs in via cached password and refreshes the token in `.env.stresstest`
|
||||
|
||||
### Seeding: explicit script vs automatic fallback
|
||||
|
||||
Two paths to the same result, for flexibility:
|
||||
|
||||
- **Preferred: explicit `npm run seed`** — runs `scripts/seed-accounts.ts` once during bring-up. Gives clear feedback, fails loudly on rate limits or network issues, lets you verify the accounts exist before a real run.
|
||||
- **Fallback: auto-seed on cold start** — if `runner.ts` starts and `.env.stresstest` is missing, `SessionPool` invokes the same seeding logic transparently. Useful for CI or fresh clones where nobody ran the explicit step.
|
||||
|
||||
Both paths share the same code in `core/session-pool.ts`; the script is a thin CLI wrapper around `SessionPool.seedAccounts()`. Documented in `tests/soak/README.md` with "run `npm run seed` first" as the happy path.
|
||||
|
||||
### Room code handoff
|
||||
|
||||
Host session calls `createGame` → receives room code → `coordinator.announce(roomId, code)`. Joiner sessions `await coordinator.await(roomId)` → receive code → call `joinGame`. All in-process, no polling.
|
||||
|
||||
## Watch modes
|
||||
|
||||
| Mode | Flag | Rendering | When to use |
|
||||
|---|---|---|---|
|
||||
| `none` | `WATCH=none` | Pure headless, JSONL stdout | CI, overnight unattended |
|
||||
| `dashboard` | `WATCH=dashboard` *(default)* | HTML status grid + click-to-watch live video | Interactive runs, demos, debugging |
|
||||
| `tiled` | `WATCH=tiled` | 4 native Chromium windows positioned 2×2 | Hands-on power-user debugging |
|
||||
|
||||
### `tiled` implementation detail
|
||||
|
||||
Two browsers launched: one headed (`headless: false, slowMo: 50`) for the 4 host contexts, one headless for the 12 joiner contexts. Host windows positioned via `page.evaluate(() => window.moveTo(x, y))` after load. Grid computed from screen size with a default of 1920×1080.
|
||||
|
||||
## Server-side changes
|
||||
|
||||
All changes are additive and fit the existing inline migration pattern in `server/stores/user_store.py`.
|
||||
|
||||
### 1. Schema
|
||||
|
||||
Two new columns + one partial index:
|
||||
|
||||
```sql
|
||||
ALTER TABLE users_v2 ADD COLUMN IF NOT EXISTS is_test_account BOOLEAN DEFAULT FALSE;
|
||||
CREATE INDEX IF NOT EXISTS idx_users_v2_is_test_account ON users_v2(is_test_account)
|
||||
WHERE is_test_account = TRUE;
|
||||
ALTER TABLE invite_codes ADD COLUMN IF NOT EXISTS marks_as_test BOOLEAN DEFAULT FALSE;
|
||||
```
|
||||
|
||||
Partial index because ~99% of rows will be `FALSE`; we only want to accelerate the "show test accounts" admin queries, not pay index-maintenance cost on every normal write.
|
||||
|
||||
### 2. Register flow propagates the flag
|
||||
|
||||
In `services/auth_service.py`, after resolving the invite code, read `marks_as_test` and pass through to `user_store.create_user`:
|
||||
|
||||
```python
|
||||
invite = await admin_service.get_invite_code(invite_code)
|
||||
is_test = bool(invite and invite.marks_as_test)
|
||||
user = await user_store.create_user(
|
||||
username=..., password_hash=..., email=...,
|
||||
is_test_account=is_test,
|
||||
)
|
||||
```
|
||||
|
||||
Users signing up without an invite or with a non-test invite are unaffected.
|
||||
|
||||
### 3. One-time: flag `5VC2MCCN` as test-seed
|
||||
|
||||
Executed once against staging (and any other environment the harness runs against):
|
||||
|
||||
```sql
|
||||
UPDATE invite_codes SET marks_as_test = TRUE WHERE code = '5VC2MCCN';
|
||||
```
|
||||
|
||||
Documented in the seeder script as a comment, and in `tests/soak/README.md` as a bring-up step. No admin UI for flagging invites as test-seed in MVP — add later if needed.
|
||||
|
||||
### 4. Stats filtering
|
||||
|
||||
Add `include_test: bool = False` parameter to stats queries in `services/stats_service.py`:
|
||||
|
||||
```python
|
||||
async def get_leaderboard(self, limit=50, include_test=False):
|
||||
query = """
|
||||
SELECT ... FROM player_stats ps
|
||||
JOIN users_v2 u ON u.id = ps.user_id
|
||||
WHERE ($1 OR NOT u.is_test_account)
|
||||
ORDER BY ps.total_points DESC
|
||||
LIMIT $2
|
||||
"""
|
||||
return await conn.fetch(query, include_test, limit)
|
||||
```
|
||||
|
||||
Router in `routers/stats.py` exposes `include_test` as an optional query parameter. Default `False` — real users visiting the site never see soak traffic. Admin panel and debugging views pass `?include_test=true`.
|
||||
|
||||
Same treatment for:
|
||||
- `get_player_stats(user_id, include_test)` — gates individual profile lookups
|
||||
- `get_recent_games(include_test)` — hides games where any participant is a test account by default
|
||||
|
||||
### 5. Admin panel surfacing
|
||||
|
||||
Small additions to `client/admin.html` + `client/admin.js`:
|
||||
- User list: "Test" badge column for `is_test_account=true` rows
|
||||
- Invite codes: "Test-seed" indicator next to `marks_as_test=true` codes
|
||||
- Leaderboard + user list: "Include test accounts" toggle → passes `?include_test=true`
|
||||
|
||||
### Out of scope (server-side)
|
||||
|
||||
- New admin endpoint for marking existing accounts as test
|
||||
- Admin UI for flagging invites as test-seed at creation time
|
||||
- Separate "test stats only" aggregation (admins invert their mental filter)
|
||||
- `test_only=true` query mode
|
||||
|
||||
## Error handling
|
||||
|
||||
### Failure taxonomy
|
||||
|
||||
| Category | Example | Strategy |
|
||||
|---|---|---|
|
||||
| Recoverable game error | Animation flag stuck, click missed target | Log, continue, bot retries via existing `GolfBot` fallbacks |
|
||||
| Recoverable session error | WS disconnect for one player, token expires | Reconnect session, rejoin game if possible, abort that room only if unrecoverable |
|
||||
| Unrecoverable room error | Room stuck >60s, impossible state | Kill the room, capture artifacts, let other rooms continue |
|
||||
| Fatal runner error | Staging unreachable, invite code exhausted, OOM | Stop everything cleanly, dump summary, exit non-zero |
|
||||
|
||||
**Core principle: per-room isolation.** A failure in room 3 never unwinds rooms 1, 2, 4. Each room runs in its own `Promise.allSettled` branch.
|
||||
|
||||
### Per-room watchdog
|
||||
|
||||
Each room gets a watchdog that resets on every `ctx.heartbeat(roomId)` call. If a room hasn't heartbeat'd in 60s, the watchdog captures artifacts, aborts that room only, and the runner continues with the remaining rooms.
|
||||
|
||||
Scenarios call `heartbeat` at each significant progress point (turn played, game started, game finished). The helper `DashboardReporter.update()` internally calls `heartbeat` as a convenience, so scenarios that use the dashboard reporter get watchdog resets for free. Scenarios that run with `WATCH=none` still need to call `heartbeat` explicitly at least once per 60s — a single call at the top of the per-turn loop is sufficient.
|
||||
|
||||
### Artifact capture on failure
|
||||
|
||||
Captured per-room into `tests/soak/artifacts/<run-id>/<room-id>/`:
|
||||
- Screenshot of every context in the affected room
|
||||
- `page.content()` HTML snapshot per context
|
||||
- Last 200 console log messages per context (already captured by `GolfBot`)
|
||||
- Game state JSON from the state parser
|
||||
- Error stack trace
|
||||
- Scenario config snapshot
|
||||
|
||||
Directory structure:
|
||||
```
|
||||
tests/soak/artifacts/
|
||||
2026-04-10-populate-14.23.05/
|
||||
run.log # structured JSONL, full run
|
||||
summary.json # final stats
|
||||
room-0/
|
||||
screenshot-host.png
|
||||
screenshot-joiner-1.png
|
||||
page-host.html
|
||||
console.txt
|
||||
state.json
|
||||
error.txt
|
||||
```
|
||||
|
||||
Artifacts directory is gitignored. Runs older than 7 days auto-pruned on startup.
|
||||
|
||||
### Structured logging
|
||||
|
||||
Single logger, JSON Lines to stdout, pretty mirror to the dashboard. Every log line carries `run_id`, `scenario`, `room` (when applicable), and `timestamp`. Grep-friendly and `jq`-friendly.
|
||||
|
||||
### Graceful shutdown
|
||||
|
||||
`SIGINT` / `SIGTERM` trigger shutdown via `AbortController`:
|
||||
1. Global `AbortSignal` flips to aborted
|
||||
2. Scenarios check `ctx.signal.aborted` in loops, finish current turn, exit cleanly
|
||||
3. Runner waits up to 10s for scenarios to unwind
|
||||
4. After 10s, force-closes all contexts + browser
|
||||
5. Writes final `summary.json` and prints results
|
||||
6. Exit codes: `0` = all rooms completed target games, `1` = any room failed, `2` = interrupted before completion
|
||||
|
||||
Double Ctrl-C = immediate force exit.
|
||||
|
||||
### Periodic health probes
|
||||
|
||||
Every 30s during a run:
|
||||
- `GET /api/health` against the target server
|
||||
- Count of open browser contexts vs expected
|
||||
- Runner memory usage
|
||||
|
||||
If `/api/health` fails 3 consecutive times, declare fatal error, capture artifacts, stop. This prevents staging outages from being misattributed to bot bugs.
|
||||
|
||||
### Retry policy
|
||||
|
||||
Retry only at the session level, never at the scenario level.
|
||||
- WS drop → reconnect session, rejoin game if possible, 3 attempts max
|
||||
- Token rejected → re-login via cached password, 1 attempt
|
||||
- Click missed → existing `GolfBot` retry (already built in)
|
||||
|
||||
Never retry: whole games, whole scenarios, fatal errors.
|
||||
|
||||
### Cleanup guarantees
|
||||
|
||||
Three cleanup points, all going through the same `cleanup()` function wrapped in top-level `try/finally`:
|
||||
1. **Success** — close contexts, close browsers, flush logs, write summary
|
||||
2. **Exception** — capture artifacts first, then close contexts, flush logs, write partial summary
|
||||
3. **Signal interrupt** — graceful shutdown as above, best-effort artifact capture
|
||||
|
||||
## File layout
|
||||
|
||||
```
|
||||
tests/soak/
|
||||
├── package.json # standalone (separate from tests/e2e/)
|
||||
├── tsconfig.json
|
||||
├── README.md # quickstart + flag reference + bring-up steps
|
||||
├── .env.stresstest.example # template (real file gitignored)
|
||||
│
|
||||
├── runner.ts # CLI entry — `npm run soak`
|
||||
├── config.ts # CLI parsing + defaults merging
|
||||
│
|
||||
├── core/
|
||||
│ ├── session-pool.ts
|
||||
│ ├── room-coordinator.ts
|
||||
│ ├── screencaster.ts # CDP attach/detach on demand
|
||||
│ ├── watchdog.ts
|
||||
│ ├── artifacts.ts
|
||||
│ ├── logger.ts
|
||||
│ └── types.ts # Scenario, Session, ScenarioContext interfaces
|
||||
│
|
||||
├── scenarios/
|
||||
│ ├── populate.ts
|
||||
│ ├── stress.ts
|
||||
│ └── index.ts # name → module registry
|
||||
│
|
||||
├── dashboard/
|
||||
│ ├── server.ts # http + ws
|
||||
│ ├── index.html
|
||||
│ ├── dashboard.css
|
||||
│ └── dashboard.js
|
||||
│
|
||||
├── scripts/
|
||||
│ ├── seed-accounts.ts # one-shot seeding
|
||||
│ ├── reset-accounts.ts # future: wipe test account stats
|
||||
│ └── smoke.sh # bring-up validation
|
||||
│
|
||||
└── artifacts/ # gitignored, auto-pruned 7d
|
||||
└── <run-id>/...
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
|
||||
New `tests/soak/package.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "golf-soak",
|
||||
"private": true,
|
||||
"scripts": {
|
||||
"soak": "tsx runner.ts",
|
||||
"soak:populate": "tsx runner.ts --scenario=populate",
|
||||
"soak:stress": "tsx runner.ts --scenario=stress",
|
||||
"seed": "tsx scripts/seed-accounts.ts",
|
||||
"smoke": "scripts/smoke.sh"
|
||||
},
|
||||
"dependencies": {
|
||||
"playwright-core": "^1.40.0",
|
||||
"ws": "^8.16.0"
|
||||
},
|
||||
"devDependencies": {
|
||||
"tsx": "^4.7.0",
|
||||
"@types/ws": "^8.5.0",
|
||||
"@types/node": "^20.10.0",
|
||||
"typescript": "^5.3.0"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Three runtime deps: `playwright-core` (already in `tests/e2e/`), `ws` (WebSocket for dashboard), `tsx` (dev-only, runs TypeScript directly). No HTTP framework, no bundler, no build step.
|
||||
|
||||
## CLI flags
|
||||
|
||||
```
|
||||
--scenario=populate|stress required
|
||||
--accounts=<n> total sessions (default: scenario.needs.accounts)
|
||||
--rooms=<n> default from scenario.needs
|
||||
--cpus-per-room=<n> default from scenario.needs
|
||||
--games-per-room=<n> default from scenario.defaultConfig
|
||||
--holes=<n> default from scenario.defaultConfig
|
||||
--watch=none|dashboard|tiled default: dashboard
|
||||
--dashboard-port=<n> default: 7777
|
||||
--target=<url> default: TEST_URL env or http://localhost:8000
|
||||
--run-id=<string> default: ISO timestamp
|
||||
--list print available scenarios and exit
|
||||
--dry-run validate config without running
|
||||
```
|
||||
|
||||
Derived: `accounts-per-room = accounts / rooms`. Must divide evenly; runner errors out with a clear message if not.
|
||||
|
||||
Config precedence: CLI flags → environment variables → scenario `defaultConfig` → runner defaults.
|
||||
|
||||
## Meta-testing
|
||||
|
||||
### Unit tests (Vitest, minimal)
|
||||
|
||||
- `room-coordinator.ts` — announce/await correctness, timeout behavior
|
||||
- `watchdog.ts` — fires on timeout, resets on heartbeat, cancels cleanly
|
||||
- `config.ts` — CLI precedence, required field validation
|
||||
|
||||
### Bring-up smoke test (`tests/soak/scripts/smoke.sh`)
|
||||
|
||||
Runs against local dev server with minimum viable config:
|
||||
```bash
|
||||
TEST_URL=http://localhost:8000 \
|
||||
npm run soak -- \
|
||||
--scenario=populate \
|
||||
--accounts=2 \
|
||||
--rooms=1 \
|
||||
--cpus-per-room=0 \
|
||||
--games-per-room=1 \
|
||||
--holes=1 \
|
||||
--watch=none
|
||||
```
|
||||
|
||||
Exit 0 = full harness works end-to-end. ~30 seconds. Run after any change.
|
||||
|
||||
### Manual validation checklist
|
||||
|
||||
Documented in `tests/soak/CHECKLIST.md`:
|
||||
- [ ] Seed 16 accounts against staging using the invite code
|
||||
- [ ] `--scenario=populate --rooms=1 --games-per-room=1` completes cleanly
|
||||
- [ ] `--scenario=populate --rooms=4 --games-per-room=1` — 4 rooms in parallel, no cross-contamination
|
||||
- [ ] `--watch=dashboard` opens browser, grid renders, progress updates
|
||||
- [ ] Click a player tile → live video appears, Esc → stops
|
||||
- [ ] `--watch=tiled` opens 4 browser windows in 2×2 grid
|
||||
- [ ] Ctrl-C during a run → graceful shutdown, summary printed, exit 2
|
||||
- [ ] Kill the target server mid-run → runner detects, captures artifacts, exits 1
|
||||
- [ ] Stats query `?include_test=false` hides soak accounts, `?include_test=true` shows them
|
||||
- [ ] Full stress run (`--scenario=stress --games-per-room=10`) — no console errors, all rooms complete
|
||||
|
||||
## Implementation order
|
||||
|
||||
Sequenced so each step produces something demonstrable before moving on. The writing-plans skill will break this into concrete tasks.
|
||||
|
||||
1. **Server-side changes** — schema alters, register flow, stats filter, admin badge. Independent, ships first, unblocks local testing.
|
||||
2. **Scaffold `tests/soak/`** — package.json, tsconfig, core/types, logger. No behavior yet.
|
||||
3. **`SessionPool` + `scripts/seed-accounts.ts`** — end-to-end auth: seed, cache, load, validate login.
|
||||
4. **`RoomCoordinator` + minimal `populate` scenario body** — proves multi-room orchestration.
|
||||
5. **`runner.ts`** — CLI, config merging, scenario loading, top-level error handling.
|
||||
6. **`--watch=none` works** — runs against local dev, produces clean logs, exits 0. First end-to-end milestone.
|
||||
7. **`--watch=dashboard` status grid** — HTML + WS + tile updates (no video yet).
|
||||
8. **CDP screencast / click-to-watch** — the live video feature.
|
||||
9. **`--watch=tiled` mode** — native windows via `page.evaluate(window.moveTo)`.
|
||||
10. **`stress` scenario** — chaos injection, rapid games.
|
||||
11. **Failure handling** — watchdog, artifact capture, graceful shutdown.
|
||||
12. **Smoke test script + CHECKLIST.md** — validation.
|
||||
13. **Run against staging for real** — populate scoreboard, hunt bugs, report findings.
|
||||
|
||||
If step 6 takes longer than planned, steps 1–5 are still useful standalone.
|
||||
|
||||
## Out of scope for MVP
|
||||
|
||||
- Mobile viewport scenarios (future `mobile-populate`)
|
||||
- Reconnect-storm scenarios
|
||||
- Admin workflow scenarios
|
||||
- Concurrent scenario execution
|
||||
- Distributed runner
|
||||
- Grafana / OTEL / custom metrics push
|
||||
- Test account stat reset tooling
|
||||
- Auto-promoting stress findings into Playwright regression tests
|
||||
- New admin endpoints for account marking
|
||||
- Admin UI for flagging invites as test-seed
|
||||
|
||||
All of these are cheap to add later because the scenario interface and session pool don't presuppose them.
|
||||
|
||||
## Open questions (to resolve during implementation)
|
||||
|
||||
1. **localStorage auth key** — exact keys used by `client/app.js` to persist the JWT and user blob; verified by reading the file during step 3.
|
||||
2. **Chaos event set for `stress` scenario** — finalize which chaos events are in scope for MVP vs added incrementally (start with rapid clicks + tab nav + `setOffline`, add more as the server proves robust).
|
||||
3. **CDP screencast frame rate tuning** — start at `everyNthFrame: 2` (~15fps), adjust down if bandwidth/CPU is excessive on long runs.
|
||||
4. **Screen bounds detection for `tiled` mode** — default to 1920×1080, expose override via `--tiled-bounds=WxH`; auto-detect later if useful.
|
||||
Reference in New Issue
Block a user