docs(CLAUDE.md): staging deploy verification checklist

Encodes the lessons from the v3.3.5 → v3.3.5.1 hotfix cascade: CI green is necessary but not sufficient. Walk the chain: clean worktree → tag sha matches → container up recently → new code introspectable → env vars present → DB state correct → end-to-end smoke. Each step calls out a specific failure mode we just hit, so future-me doesn't assume the next deploy will 'just work' when the primitives underneath (git fetch tag-cache, compose env wiring, image reuse) can silently skip changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(ci): force-update tags on deploy fetch
2026-04-18 14:06:37 -04:00 · 2026-04-18 14:02:06 -04:00
3 changed files with 60 additions and 4 deletions
--- a/.gitea/workflows/deploy-prod.yml
+++ b/.gitea/workflows/deploy-prod.yml
@@ -29,8 +29,10 @@ jobs:
            docker pull "$IMAGE:$TAG"
            docker tag "$IMAGE:$TAG" golfgame-app:latest
-            # Update code for compose/env changes
+            # Update code for compose/env changes. `--tags --force` so a
-            git fetch origin
+            # moved tag (hotfix on top of existing version) updates locally
            # instead of silently checking out the stale cached position.
            git fetch origin --tags --force
            git checkout "$TAG"
            # Restart app
--- a/.gitea/workflows/deploy-staging.yml
+++ b/.gitea/workflows/deploy-staging.yml
@@ -21,8 +21,11 @@ jobs:
            cd /opt/golfgame
-            # Pull latest code and checkout the release tag
+            # Pull latest code and checkout the release tag. `--tags --force`
-            git fetch origin
+            # so that a tag moved on origin (e.g. hotfix on top of an existing
            # version) actually updates locally instead of silently reusing a
            # stale cached tag position.
            git fetch origin --tags --force
            git checkout "$TAG"
            # Build the image
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -166,6 +166,57 @@ python server/simulate.py 100 --compare
 - **Middleware** (`server/middleware/`): Security headers, request ID tracking, rate limiting
 - **Handlers** (`server/handlers.py`): WebSocket message dispatch (extracted from main.py)
 ## Staging Deploy Verification Checklist
 After any release that triggers a staging deploy (via `.gitea/workflows/deploy-staging.yml`), do NOT trust "CI went green" — walk the full chain end-to-end. A green CI run does not prove the deploy did what you intended: `git fetch` won't update already-cached tags, compose yaml and `.env` can drift out of sync, and container images can cache without visible signal. The v3.3.5 → v3.3.5.1 saga cost us two releases because each of these bit in turn.
 **Run through every step before declaring a staging deploy successful:**
 1. **Worktree is clean on staging BEFORE cutting the release.**
   ```bash
   ssh root@staging.golfcards.club 'cd /opt/golfgame && git status --short'
   ```
   Must be empty. Dirty files or untracked files will abort `git checkout $TAG` mid-pipeline. If you ever scp files to staging for hot-patching, land those changes on main + commit before the release, OR `git reset --hard HEAD && git clean -fd` on staging first.
 2. **Staging is actually at the new tag, not a stale cached position.**
   ```bash
   ssh root@staging.golfcards.club 'cd /opt/golfgame && git rev-parse HEAD && git log --oneline -1'
   ```
   Compare the sha to `git rev-parse v3.x.y` locally. If they differ, a moved tag was force-pushed and the runner used stale cache. Workflows now `git fetch --tags --force`, but verify.
 3. **Container is running the new image (not a recycled old one).**
   ```bash
   ssh root@staging.golfcards.club 'docker ps --format "{{.Names}} {{.Status}}"; docker inspect golfgame-app-1 --format "{{.Created}}"'
   ```
   `Up X seconds/minutes` with a recent `.Created` time; not `Up 13 hours`. A compose restart picks up the *current* `:latest` image — confirm that image was built by this release, not a prior one.
 4. **New code is actually in the container.** Introspect a signature/attribute that changed in this release:
   ```bash
   ssh root@staging.golfcards.club 'docker exec golfgame-app-1 python -c "
   import sys, inspect; sys.path.insert(0, \"/app/server\")
   from services.game_logger import GameLogger
   print(inspect.signature(GameLogger.log_game_start_async).parameters.keys())
   "'
   ```
 5. **Container env has every var the code reads.** Any config added to `server/config.py` this release needs TWO edits to flow through: `.env` on the host AND `- FOO=${FOO:-default}` in the compose yaml's `environment:` block. Setting only `.env` silently does nothing.
   ```bash
   ssh root@staging.golfcards.club 'docker exec golfgame-app-1 printenv | grep -iE "YOUR_NEW_VAR"'
   ```
 6. **DB schema + invariants hold.** Sample the tables this release touches and confirm the new columns/values look right:
   ```bash
   ssh root@staging.golfcards.club 'docker exec golfgame-postgres-1 psql -U golf -d golf -c "SELECT status, COUNT(*) FROM games_v2 GROUP BY status;"'
   ```
 7. **End-to-end smoke.** For a feature visible through the API, curl it and verify the response shape and content match expectations:
   ```bash
   curl -s 'https://staging.golfcards.club/api/stats/leaderboard?metric=wins' | python3 -m json.tool
   ```
   For features that only fire on specific game events (GAME_OVER, abandonment, etc.), run a soak game or manual repro and re-check the DB — don't assume "code is deployed" = "code has executed."
 **If any step fails, stop and diagnose before running the next release.** Cascading hotfixes amplify the problem — each force-moved tag is another chance for the runner's cache to lie to you.
 ## Common Development Tasks
 ### Adjusting Animation Speed