Batched remaining harness tasks (27-30, 33):
Task 27 — Artifact capture on failure: screenshots, HTML snapshots,
game state JSON, and console error tails are captured into
tests/soak/artifacts/<run-id>/ when a scenario throws. Successful
runs get a summary.json. Old runs (>7d) are pruned on startup.
Task 28 — Graceful shutdown: first SIGINT/SIGTERM flips the abort
signal (scenarios finish current turn then unwind). 10s after, a
hard-kill fires if cleanup hangs. Double Ctrl-C = immediate exit.
Exit codes: 0 success, 1 errors, 2 interrupted.
Task 29 — Periodic health probes: every 30s GET /health against the
target server. Three consecutive failures abort the run with
health_fatal, preventing staging outages from being misattributed
to harness bugs. Corrected endpoint from /api/health to /health
per server/routers/health.py.
Task 30 — Smoke test script: tests/soak/scripts/smoke.sh, a 60s
end-to-end canary that health-probes the target, seeds if needed,
and runs one minimal populate game.
Task 33 — Version bump to v3.3.4: both index.html footers (was
v3.1.6), new footer added to admin.html (had none), pyproject.toml.
Also fixes discovered during stress testing:
- SessionPool sets baseURL on all contexts so relative goto('/')
resolves correctly between games (was "invalid URL" error)
- RoomCoordinator key is now unique per game-start (Date.now
suffix) so Deferred promises don't carry stale room codes from
previous games
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Watchdog class with 4 Vitest tests (27 total now), wired into
ctx.heartbeat in the runner. One watchdog per room with a 60s
timeout; firing logs an error, marks the room's dashboard tile
as errored, and triggers the abort signal so the scenario unwinds.
Watchdogs are explicitly stopped in the runner's finally block
so pending timers don't keep the node process alive on exit.
Also fixes a multi-game bug discovered during stress scenario
verification: after a game ends sessions stay parked on the
game_over screen, which hides the lobby and makes a subsequent
#create-room-btn click time out. runOneMultiplayerGame now
navigates every session to / before each game — localStorage
auth persists so nothing re-logs in.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rapid short games with a parallel chaos loop that has a 5% per-turn
chance of firing one of:
- rapid_clicks: 5 quick clicks at the player's own cards
- tab_blur: window blur/focus event pair
- brief_offline: 300ms network outage via context.setOffline
Chaos counts roll up into ScenarioResult.customMetrics.chaos_fired.
Important detail: chaos loop has a 3-second initial delay so room
creation, joiners, and game start can complete without interference.
Chaos during lobby setup (especially brief_offline) was causing
#create-room-btn to go unstable.
Verified: stress smoke with --games-per-room=3, 4 accounts + 1 CPU,
first game completed with 37 turns and chaos events fired across all
three event types.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Partitions sessions into N rooms, runs gamesPerRoom games per room
in parallel via Promise.allSettled so a failure in one room never
unwinds the others. Errors roll up into ScenarioResult.errors.
Verified via tsx: listScenarios() returns [populate], getScenario()
resolves by name and returns undefined for unknown names.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Encapsulates the host-creates/joiners-join/loop-until-done flow so
populate and stress scenarios don't duplicate it. Honors abort
signal and a max-duration timeout, heartbeats on every turn.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>