AWS ECS Deploy and Rollback Runbook
Purpose
This document defines the practical release workflow for ECS once the staging account and task definitions exist.
It covers:
- build and publish steps
- staging deployment
- smoke tests
- production rollout
- rollback handling
The goal is to make deployments repeatable and boring.
Deployment Principles
- Every deploy should be traceable to a Git commit.
- Every deploy should produce immutable container images.
- Staging should be deployed before production.
- Rollback should use the previously known-good task definition and image digest.
- Database migrations should be backward compatible where possible.
Build and Publish
Inputs
- Git commit SHA
- application test results
- container build context
- target environment
Output Artifacts
- container image digest
- ECR image tag
- ECS task definition revision
- deployment metadata
Recommended Tagging
Use both:
- a commit-based tag, such as
sha-<shortsha> - a release tag, such as
staging-YYYYMMDDorprod-YYYYMMDD
That gives you both exact reproducibility and a human-readable release marker.
Staging Deployment Flow
1. Validate the Build
Before touching ECS:
- run linting
- run unit tests
- build the container image
- verify the image starts locally or in a test job
2. Push to ECR
Push the release image to the staging ECR repository.
Record:
- repository name
- image digest
- image tag
- commit SHA
3. Register ECS Task Definitions
Register a new task definition revision for each changed ECS service.
Update:
- image references
- environment variable values
- secrets references
- logging configuration
- health check parameters
4. Deploy to Staging ECS
Update the staging ECS service to use the new task definition revision.
Wait for:
- task startup to complete
- health checks to pass
- target groups to become healthy
- logs to appear in CloudWatch
5. Run Smoke Tests
Smoke tests should confirm:
- API responds
- login works
- database connectivity works
- Redis connectivity works
- worker services start cleanly
- background jobs process as expected
6. Record the Result
Store the result of the staging deployment:
- pass or fail
- task definition revision
- image digest
- timestamp
- any observed warnings
Production Rollout Flow
1. Confirm Staging Is Green
Do not promote unless staging has:
- passed smoke tests
- logged clean ECS service events
- proven rollback once in staging
2. Promote the Same Artifact
Use the exact same image digest that was validated in staging.
Do not rebuild for production unless the build itself is the thing being promoted.
3. Update Production ECS
Register production task definition revisions from the same release metadata.
Then update the production ECS services in a controlled order:
- backend API
- worker services
- frontend/admin if they are deployed as ECS services
4. Observe the Deployment
Watch for:
- ECS task restarts
- ALB target health
- 5xx errors
- database connectivity errors
- Redis connectivity errors
Rollback Runbook
Rollback Triggers
Rollback if any of the following happen:
- health checks fail repeatedly
- API smoke tests fail
- workers cannot connect to database or Redis
- elevated 5xx rates appear
- startup crashes loop
Rollback Steps
- Identify the last known-good task definition revision.
- Repoint the ECS service to that revision.
- Confirm tasks start successfully.
- Verify target health returns to green.
- Re-run smoke tests.
- Record the rollback event.
Rollback Rules
- Do not change multiple variables during rollback.
- Revert one release at a time.
- If rollback fails, pause and debug the environment rather than piling on more changes.
Database Migration Handling
Database migration sequencing should follow the deployment flow:
- run the Alembic migration task against the target database
- deploy the new application services
- verify runtime behavior
- remove old code paths only after stability is confirmed
If a migration is not backward compatible, it should be treated as a separate change with a specific rollback plan.
Smoke Test Checklist
- API health endpoint returns healthy
- Login endpoint accepts known credentials in staging
- Database query path works
- Redis-backed behavior works
- Background worker starts and processes a sample job
- CloudWatch logs are visible
- ALB target group is healthy
Operational Notes
Approval Gate
Production rollout should require a human approval gate after staging succeeds.
Release Identity
Every deployed service should be traceable to:
- Git commit SHA
- ECR image digest
- ECS task definition revision
Safety Default
If there is any doubt, keep the previous production revision live until the new revision proves stable.