golfgame/docs/v2/V2_07_PRODUCTION.md
Aaron D. Lee bea85e6b28 Huge v2 uplift, now deployable with real user management and tooling!
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 11:32:15 -05:00

1000 lines
25 KiB
Markdown

# V2_07: Production Deployment & Operations
> **Scope**: Docker, deployment, health checks, monitoring, security, rate limiting
> **Dependencies**: All other V2 documents
> **Complexity**: High (DevOps/Infrastructure)
---
## Overview
Production readiness requires:
- **Containerization**: Docker images for consistent deployment
- **Health Checks**: Liveness and readiness probes
- **Monitoring**: Metrics, logging, error tracking
- **Security**: HTTPS, headers, secrets management
- **Rate Limiting**: API protection from abuse (Phase 1 priority)
- **Graceful Operations**: Zero-downtime deploys, proper shutdown
---
## 1. Docker Configuration
### Application Dockerfile
```dockerfile
# Dockerfile
FROM python:3.11-slim as base
# Set environment
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY server/ ./server/
COPY client/ ./client/
# Create non-root user
RUN useradd --create-home --shell /bin/bash appuser \
&& chown -R appuser:appuser /app
USER appuser
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
EXPOSE 8000
CMD ["python", "-m", "uvicorn", "server.main:app", "--host", "0.0.0.0", "--port", "8000"]
```
### Production Docker Compose
```yaml
# docker-compose.prod.yml
version: '3.8'
services:
app:
build:
context: .
dockerfile: Dockerfile
environment:
- DATABASE_URL=postgresql://golf:${DB_PASSWORD}@postgres:5432/golfgame
- REDIS_URL=redis://redis:6379/0
- SECRET_KEY=${SECRET_KEY}
- RESEND_API_KEY=${RESEND_API_KEY}
- SENTRY_DSN=${SENTRY_DSN}
- ENVIRONMENT=production
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
deploy:
replicas: 2
restart_policy:
condition: on-failure
max_attempts: 3
resources:
limits:
memory: 512M
reservations:
memory: 256M
networks:
- internal
- web
labels:
- "traefik.enable=true"
- "traefik.http.routers.golf.rule=Host(`golf.example.com`)"
- "traefik.http.routers.golf.tls=true"
- "traefik.http.routers.golf.tls.certresolver=letsencrypt"
worker:
build:
context: .
dockerfile: Dockerfile
command: python -m arq server.worker.WorkerSettings
environment:
- DATABASE_URL=postgresql://golf:${DB_PASSWORD}@postgres:5432/golfgame
- REDIS_URL=redis://redis:6379/0
depends_on:
- postgres
- redis
deploy:
replicas: 1
resources:
limits:
memory: 256M
postgres:
image: postgres:15-alpine
environment:
POSTGRES_DB: golfgame
POSTGRES_USER: golf
POSTGRES_PASSWORD: ${DB_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
healthcheck:
test: ["CMD-SHELL", "pg_isready -U golf -d golfgame"]
interval: 10s
timeout: 5s
retries: 5
networks:
- internal
redis:
image: redis:7-alpine
command: redis-server --appendonly yes --maxmemory 128mb --maxmemory-policy allkeys-lru
volumes:
- redis_data:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
networks:
- internal
traefik:
image: traefik:v2.10
command:
- "--api.dashboard=true"
- "--providers.docker=true"
- "--providers.docker.exposedbydefault=false"
- "--entrypoints.web.address=:80"
- "--entrypoints.websecure.address=:443"
- "--certificatesresolvers.letsencrypt.acme.httpchallenge=true"
- "--certificatesresolvers.letsencrypt.acme.email=${ACME_EMAIL}"
- "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
ports:
- "80:80"
- "443:443"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- letsencrypt:/letsencrypt
networks:
- web
volumes:
postgres_data:
redis_data:
letsencrypt:
networks:
internal:
web:
external: true
```
---
## 2. Health Checks & Readiness
### Health Endpoint Implementation
```python
# server/health.py
from fastapi import APIRouter, Response
from datetime import datetime
import asyncpg
import redis.asyncio as redis
router = APIRouter(tags=["health"])
@router.get("/health")
async def health_check():
"""Basic liveness check - is the app running?"""
return {"status": "ok", "timestamp": datetime.utcnow().isoformat()}
@router.get("/ready")
async def readiness_check(
db: asyncpg.Pool = Depends(get_db_pool),
redis_client: redis.Redis = Depends(get_redis)
):
"""Readiness check - can the app handle requests?"""
checks = {}
overall_healthy = True
# Check database
try:
async with db.acquire() as conn:
await conn.fetchval("SELECT 1")
checks["database"] = {"status": "ok"}
except Exception as e:
checks["database"] = {"status": "error", "message": str(e)}
overall_healthy = False
# Check Redis
try:
await redis_client.ping()
checks["redis"] = {"status": "ok"}
except Exception as e:
checks["redis"] = {"status": "error", "message": str(e)}
overall_healthy = False
status_code = 200 if overall_healthy else 503
return Response(
content=json.dumps({
"status": "ok" if overall_healthy else "degraded",
"checks": checks,
"timestamp": datetime.utcnow().isoformat()
}),
status_code=status_code,
media_type="application/json"
)
@router.get("/metrics")
async def metrics(
db: asyncpg.Pool = Depends(get_db_pool),
redis_client: redis.Redis = Depends(get_redis)
):
"""Expose application metrics for monitoring."""
async with db.acquire() as conn:
active_games = await conn.fetchval(
"SELECT COUNT(*) FROM games WHERE completed_at IS NULL"
)
total_users = await conn.fetchval("SELECT COUNT(*) FROM users")
games_today = await conn.fetchval(
"SELECT COUNT(*) FROM games WHERE created_at > NOW() - INTERVAL '1 day'"
)
connected_players = await redis_client.scard("connected_players")
return {
"active_games": active_games,
"total_users": total_users,
"games_today": games_today,
"connected_players": connected_players,
"timestamp": datetime.utcnow().isoformat()
}
```
---
## 3. Rate Limiting (Phase 1 Priority)
Rate limiting is a Phase 1 priority for security. Implement early to prevent abuse.
### Rate Limiter Implementation
```python
# server/ratelimit.py
from fastapi import Request, HTTPException
from typing import Optional
import redis.asyncio as redis
import time
import hashlib
class RateLimiter:
"""Token bucket rate limiter using Redis."""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
async def is_allowed(
self,
key: str,
limit: int,
window_seconds: int
) -> tuple[bool, dict]:
"""Check if request is allowed under rate limit.
Returns (allowed, info) where info contains:
- remaining: requests remaining in window
- reset: seconds until window resets
- limit: the limit that was applied
"""
now = int(time.time())
window_key = f"ratelimit:{key}:{now // window_seconds}"
async with self.redis.pipeline(transaction=True) as pipe:
pipe.incr(window_key)
pipe.expire(window_key, window_seconds)
results = await pipe.execute()
current_count = results[0]
remaining = max(0, limit - current_count)
reset = window_seconds - (now % window_seconds)
info = {
"remaining": remaining,
"reset": reset,
"limit": limit
}
return current_count <= limit, info
def get_client_key(self, request: Request, user_id: Optional[str] = None) -> str:
"""Generate rate limit key for client."""
if user_id:
return f"user:{user_id}"
# For anonymous users, use IP hash
client_ip = request.client.host
forwarded = request.headers.get("X-Forwarded-For")
if forwarded:
client_ip = forwarded.split(",")[0].strip()
# Hash IP for privacy
return f"ip:{hashlib.sha256(client_ip.encode()).hexdigest()[:16]}"
# Rate limit configurations per endpoint type
RATE_LIMITS = {
"api_general": (100, 60), # 100 requests per minute
"api_auth": (10, 60), # 10 auth attempts per minute
"api_create_room": (5, 60), # 5 room creations per minute
"websocket_connect": (10, 60), # 10 WS connections per minute
"email_send": (3, 300), # 3 emails per 5 minutes
}
```
### Rate Limit Middleware
```python
# server/middleware.py
from fastapi import Request
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.responses import JSONResponse
class RateLimitMiddleware(BaseHTTPMiddleware):
def __init__(self, app, rate_limiter: RateLimiter):
super().__init__(app)
self.limiter = rate_limiter
async def dispatch(self, request: Request, call_next):
# Determine rate limit tier based on path
path = request.url.path
if path.startswith("/api/auth"):
limit, window = RATE_LIMITS["api_auth"]
elif path == "/api/rooms":
limit, window = RATE_LIMITS["api_create_room"]
elif path.startswith("/api"):
limit, window = RATE_LIMITS["api_general"]
else:
# No rate limiting for static files
return await call_next(request)
# Get user ID if authenticated
user_id = getattr(request.state, "user_id", None)
client_key = self.limiter.get_client_key(request, user_id)
allowed, info = await self.limiter.is_allowed(
f"{path}:{client_key}", limit, window
)
# Add rate limit headers to response
response = await call_next(request) if allowed else JSONResponse(
status_code=429,
content={
"error": "Rate limit exceeded",
"retry_after": info["reset"]
}
)
response.headers["X-RateLimit-Limit"] = str(info["limit"])
response.headers["X-RateLimit-Remaining"] = str(info["remaining"])
response.headers["X-RateLimit-Reset"] = str(info["reset"])
if not allowed:
response.headers["Retry-After"] = str(info["reset"])
return response
```
### WebSocket Rate Limiting
```python
# In server/main.py
async def websocket_endpoint(websocket: WebSocket):
client_key = rate_limiter.get_client_key(websocket)
allowed, info = await rate_limiter.is_allowed(
f"ws_connect:{client_key}",
*RATE_LIMITS["websocket_connect"]
)
if not allowed:
await websocket.close(code=1008, reason="Rate limit exceeded")
return
# Also rate limit messages within the connection
message_limiter = ConnectionMessageLimiter(
max_messages=30,
window_seconds=10
)
await websocket.accept()
try:
while True:
data = await websocket.receive_text()
if not message_limiter.check():
await websocket.send_json({
"type": "error",
"message": "Slow down! Too many messages."
})
continue
await handle_message(websocket, data)
except WebSocketDisconnect:
pass
```
---
## 4. Security Headers & HTTPS
### Security Middleware
```python
# server/security.py
from starlette.middleware.base import BaseHTTPMiddleware
class SecurityHeadersMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next):
response = await call_next(request)
# Security headers
response.headers["X-Content-Type-Options"] = "nosniff"
response.headers["X-Frame-Options"] = "DENY"
response.headers["X-XSS-Protection"] = "1; mode=block"
response.headers["Referrer-Policy"] = "strict-origin-when-cross-origin"
response.headers["Permissions-Policy"] = "geolocation=(), microphone=(), camera=()"
# Content Security Policy
csp = "; ".join([
"default-src 'self'",
"script-src 'self'",
"style-src 'self' 'unsafe-inline'", # For inline styles
"img-src 'self' data:",
"font-src 'self'",
"connect-src 'self' wss://*.example.com",
"frame-ancestors 'none'",
"base-uri 'self'",
"form-action 'self'"
])
response.headers["Content-Security-Policy"] = csp
# HSTS (only in production)
if request.url.scheme == "https":
response.headers["Strict-Transport-Security"] = (
"max-age=31536000; includeSubDomains; preload"
)
return response
```
### CORS Configuration
```python
# server/main.py
from fastapi.middleware.cors import CORSMiddleware
app.add_middleware(
CORSMiddleware,
allow_origins=[
"https://golf.example.com",
"https://www.golf.example.com",
],
allow_credentials=True,
allow_methods=["GET", "POST", "PUT", "DELETE"],
allow_headers=["*"],
)
```
---
## 5. Error Tracking with Sentry
### Sentry Integration
```python
# server/main.py
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.redis import RedisIntegration
from sentry_sdk.integrations.asyncpg import AsyncPGIntegration
if os.getenv("SENTRY_DSN"):
sentry_sdk.init(
dsn=os.getenv("SENTRY_DSN"),
environment=os.getenv("ENVIRONMENT", "development"),
traces_sample_rate=0.1, # 10% of transactions for performance
profiles_sample_rate=0.1,
integrations=[
FastApiIntegration(transaction_style="endpoint"),
RedisIntegration(),
AsyncPGIntegration(),
],
# Filter out sensitive data
before_send=filter_sensitive_data,
)
def filter_sensitive_data(event, hint):
"""Remove sensitive data before sending to Sentry."""
if "request" in event:
headers = event["request"].get("headers", {})
# Remove auth headers
headers.pop("authorization", None)
headers.pop("cookie", None)
return event
```
### Custom Error Handler
```python
# server/errors.py
from fastapi import Request
from fastapi.responses import JSONResponse
import sentry_sdk
import traceback
async def global_exception_handler(request: Request, exc: Exception):
"""Handle all unhandled exceptions."""
# Log to Sentry
sentry_sdk.capture_exception(exc)
# Log locally
logger.error(f"Unhandled exception: {exc}", exc_info=True)
# Return generic error to client
return JSONResponse(
status_code=500,
content={
"error": "Internal server error",
"request_id": request.state.request_id
}
)
# Register handler
app.add_exception_handler(Exception, global_exception_handler)
```
---
## 6. Structured Logging
### Logging Configuration
```python
# server/logging_config.py
import logging
import json
from datetime import datetime
class JSONFormatter(logging.Formatter):
"""Format logs as JSON for aggregation."""
def format(self, record):
log_data = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"logger": record.name,
"message": record.getMessage(),
}
# Add extra fields
if hasattr(record, "request_id"):
log_data["request_id"] = record.request_id
if hasattr(record, "user_id"):
log_data["user_id"] = record.user_id
if hasattr(record, "game_id"):
log_data["game_id"] = record.game_id
# Add exception info
if record.exc_info:
log_data["exception"] = self.formatException(record.exc_info)
return json.dumps(log_data)
def setup_logging():
"""Configure application logging."""
handler = logging.StreamHandler()
if os.getenv("ENVIRONMENT") == "production":
handler.setFormatter(JSONFormatter())
else:
handler.setFormatter(logging.Formatter(
"%(asctime)s - %(name)s - %(levelname)s - %(message)s"
))
logging.root.handlers = [handler]
logging.root.setLevel(logging.INFO)
# Reduce noise from libraries
logging.getLogger("uvicorn.access").setLevel(logging.WARNING)
logging.getLogger("websockets").setLevel(logging.WARNING)
```
### Request ID Middleware
```python
# server/middleware.py
import uuid
class RequestIDMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next):
request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
request.state.request_id = request_id
response = await call_next(request)
response.headers["X-Request-ID"] = request_id
return response
```
---
## 7. Graceful Shutdown
### Shutdown Handler
```python
# server/main.py
import signal
import asyncio
shutdown_event = asyncio.Event()
@app.on_event("startup")
async def startup():
# Register signal handlers
loop = asyncio.get_running_loop()
for sig in (signal.SIGTERM, signal.SIGINT):
loop.add_signal_handler(sig, lambda: asyncio.create_task(shutdown()))
@app.on_event("shutdown")
async def shutdown():
logger.info("Shutdown initiated...")
# Stop accepting new connections
shutdown_event.set()
# Save all active games to Redis
await save_all_active_games()
# Close WebSocket connections gracefully
for ws in list(active_connections):
try:
await ws.close(code=1001, reason="Server shutting down")
except:
pass
# Wait for in-flight requests (max 30 seconds)
await asyncio.sleep(5)
# Close database pool
await db_pool.close()
# Close Redis connections
await redis_client.close()
logger.info("Shutdown complete")
async def save_all_active_games():
"""Persist all active games before shutdown."""
for game_id, game in active_games.items():
try:
await state_cache.save_game(game)
logger.info(f"Saved game {game_id}")
except Exception as e:
logger.error(f"Failed to save game {game_id}: {e}")
```
---
## 8. Secrets Management
### Environment Configuration
```python
# server/config.py
from pydantic import BaseSettings, PostgresDsn, RedisDsn
class Settings(BaseSettings):
# Database
database_url: PostgresDsn
# Redis
redis_url: RedisDsn
# Security
secret_key: str
jwt_algorithm: str = "HS256"
jwt_expiry_hours: int = 24
# Email
resend_api_key: str
email_from: str = "Golf Game <noreply@golf.example.com>"
# Monitoring
sentry_dsn: str = ""
environment: str = "development"
# Rate limiting
rate_limit_enabled: bool = True
class Config:
env_file = ".env"
case_sensitive = False
settings = Settings()
```
### Production Secrets (Example for Docker Swarm)
```yaml
# docker-compose.prod.yml
secrets:
db_password:
external: true
secret_key:
external: true
resend_api_key:
external: true
services:
app:
secrets:
- db_password
- secret_key
- resend_api_key
environment:
- DATABASE_URL=postgresql://golf@postgres:5432/golfgame?password_file=/run/secrets/db_password
```
---
## 9. Database Migrations
### Alembic Configuration
```ini
# alembic.ini
[alembic]
script_location = migrations
sqlalchemy.url = env://DATABASE_URL
[logging]
level = INFO
```
### Migration Script Template
```python
# migrations/versions/001_initial.py
"""Initial schema
Revision ID: 001
Create Date: 2024-01-01
"""
from alembic import op
import sqlalchemy as sa
revision = '001'
down_revision = None
def upgrade():
# Users table
op.create_table(
'users',
sa.Column('id', sa.UUID(), primary_key=True),
sa.Column('username', sa.String(50), unique=True, nullable=False),
sa.Column('email', sa.String(255), unique=True, nullable=False),
sa.Column('password_hash', sa.String(255), nullable=False),
sa.Column('created_at', sa.DateTime(timezone=True), server_default=sa.func.now()),
sa.Column('is_admin', sa.Boolean(), default=False),
)
# Games table
op.create_table(
'games',
sa.Column('id', sa.UUID(), primary_key=True),
sa.Column('room_code', sa.String(10), nullable=False),
sa.Column('created_at', sa.DateTime(timezone=True), server_default=sa.func.now()),
sa.Column('completed_at', sa.DateTime(timezone=True)),
)
# Events table
op.create_table(
'events',
sa.Column('id', sa.BigInteger(), primary_key=True, autoincrement=True),
sa.Column('game_id', sa.UUID(), sa.ForeignKey('games.id'), nullable=False),
sa.Column('event_type', sa.String(50), nullable=False),
sa.Column('data', sa.JSON(), nullable=False),
sa.Column('timestamp', sa.DateTime(timezone=True), server_default=sa.func.now()),
)
# Indexes
op.create_index('idx_events_game_id', 'events', ['game_id'])
op.create_index('idx_users_email', 'users', ['email'])
op.create_index('idx_users_username', 'users', ['username'])
def downgrade():
op.drop_table('events')
op.drop_table('games')
op.drop_table('users')
```
### Migration Commands
```bash
# Create new migration
alembic revision --autogenerate -m "Add user sessions"
# Run migrations
alembic upgrade head
# Rollback one version
alembic downgrade -1
# Show current version
alembic current
```
---
## 10. Deployment Checklist
### Pre-deployment
- [ ] All environment variables set
- [ ] Database migrations applied
- [ ] Secrets configured in secret manager
- [ ] SSL certificates provisioned
- [ ] Rate limiting configured and tested
- [ ] Error tracking (Sentry) configured
- [ ] Logging aggregation set up
- [ ] Health check endpoints verified
- [ ] Backup strategy implemented
### Deployment
- [ ] Run database migrations
- [ ] Deploy new containers with rolling update
- [ ] Verify health checks pass
- [ ] Monitor error rates in Sentry
- [ ] Check application logs
- [ ] Verify WebSocket connections work
- [ ] Test critical user flows
### Post-deployment
- [ ] Monitor performance metrics
- [ ] Check database connection pool usage
- [ ] Verify Redis memory usage
- [ ] Review error logs
- [ ] Test graceful shutdown/restart
---
## 11. Monitoring Dashboard (Grafana)
### Key Metrics to Track
```yaml
# Example Prometheus metrics
metrics:
# Application
- http_requests_total
- http_request_duration_seconds
- websocket_connections_active
- games_active
- games_completed_total
# Infrastructure
- container_cpu_usage_seconds_total
- container_memory_usage_bytes
- pg_stat_activity_count
- redis_connected_clients
- redis_used_memory_bytes
# Business
- users_registered_total
- games_played_today
- average_game_duration_seconds
```
### Alert Rules
```yaml
# alertmanager rules
groups:
- name: golf-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
- alert: DatabaseConnectionExhausted
expr: pg_stat_activity_count > 90
for: 2m
labels:
severity: warning
annotations:
summary: "Database connections near limit"
- alert: HighMemoryUsage
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Container memory usage above 90%"
```
---
## 12. Backup Strategy
### Database Backups
```bash
#!/bin/bash
# backup.sh - Daily database backup
BACKUP_DIR=/backups
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/golfgame_${DATE}.sql.gz"
# Backup with pg_dump
pg_dump -h postgres -U golf golfgame | gzip > "$BACKUP_FILE"
# Upload to S3/B2/etc
aws s3 cp "$BACKUP_FILE" s3://golf-backups/
# Cleanup old local backups (keep 7 days)
find "$BACKUP_DIR" -name "*.sql.gz" -mtime +7 -delete
# Cleanup old S3 backups (keep 30 days) via lifecycle policy
```
### Redis Persistence
```conf
# redis.conf
appendonly yes
appendfsync everysec
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
```
---
## Summary
This document covers all production deployment concerns:
1. **Docker**: Multi-stage builds, health checks, resource limits
2. **Rate Limiting**: Token bucket algorithm, per-endpoint limits (Phase 1 priority)
3. **Security**: Headers, CORS, CSP, HSTS
4. **Monitoring**: Sentry, structured logging, Prometheus metrics
5. **Operations**: Graceful shutdown, migrations, backups
6. **Deployment**: Checklist, rolling updates, health verification
Rate limiting is implemented in Phase 1 as a security priority to protect against abuse before public launch.