AWS ECS Deploy and Rollback Runbook

Purpose

This document defines the practical release workflow for ECS once the staging account and task definitions exist.

It covers:

build and publish steps
staging deployment
smoke tests
production rollout
rollback handling

The goal is to make deployments repeatable and boring.

Deployment Principles

Every deploy should be traceable to a Git commit.
Every deploy should produce immutable container images.
Staging should be deployed before production.
Rollback should use the previously known-good task definition and image digest.
Database migrations should be backward compatible where possible.

Build and Publish

Inputs

Git commit SHA
application test results
container build context
target environment

Output Artifacts

container image digest
ECR image tag
ECS task definition revision
deployment metadata

Recommended Tagging

Use both:

a commit-based tag, such as sha-<shortsha>
a release tag, such as staging-YYYYMMDD or prod-YYYYMMDD

That gives you both exact reproducibility and a human-readable release marker.

Staging Deployment Flow

1. Validate the Build

Before touching ECS:

run linting
run unit tests
build the container image
verify the image starts locally or in a test job

2. Push to ECR

Push the release image to the staging ECR repository.

Record:

repository name
image digest
image tag
commit SHA

3. Register ECS Task Definitions

Register a new task definition revision for each changed ECS service.

Update:

image references
environment variable values
secrets references
logging configuration
health check parameters

4. Deploy to Staging ECS

Update the staging ECS service to use the new task definition revision.

Wait for:

task startup to complete
health checks to pass
target groups to become healthy
logs to appear in CloudWatch

5. Run Smoke Tests

Smoke tests should confirm:

API responds
login works
database connectivity works
Redis connectivity works
worker services start cleanly
background jobs process as expected

6. Record the Result

Store the result of the staging deployment:

pass or fail
task definition revision
image digest
timestamp
any observed warnings

Production Rollout Flow

1. Confirm Staging Is Green

Do not promote unless staging has:

passed smoke tests
logged clean ECS service events
proven rollback once in staging

2. Promote the Same Artifact

Use the exact same image digest that was validated in staging.

Do not rebuild for production unless the build itself is the thing being promoted.

3. Update Production ECS

Register production task definition revisions from the same release metadata.

Then update the production ECS services in a controlled order:

backend API
worker services
frontend/admin if they are deployed as ECS services

4. Observe the Deployment

Watch for:

ECS task restarts
ALB target health
5xx errors
database connectivity errors
Redis connectivity errors

Rollback Runbook

Rollback Triggers

Rollback if any of the following happen:

health checks fail repeatedly
API smoke tests fail
workers cannot connect to database or Redis
elevated 5xx rates appear
startup crashes loop

Rollback Steps

Identify the last known-good task definition revision.
Repoint the ECS service to that revision.
Confirm tasks start successfully.
Verify target health returns to green.
Re-run smoke tests.
Record the rollback event.

Rollback Rules

Do not change multiple variables during rollback.
Revert one release at a time.
If rollback fails, pause and debug the environment rather than piling on more changes.

Database Migration Handling

Database migration sequencing should follow the deployment flow:

run the Alembic migration task against the target database
deploy the new application services
verify runtime behavior
remove old code paths only after stability is confirmed

If a migration is not backward compatible, it should be treated as a separate change with a specific rollback plan.

Smoke Test Checklist

API health endpoint returns healthy
Login endpoint accepts known credentials in staging
Database query path works
Redis-backed behavior works
Background worker starts and processes a sample job
CloudWatch logs are visible
ALB target group is healthy

Operational Notes

Approval Gate

Production rollout should require a human approval gate after staging succeeds.

Release Identity

Every deployed service should be traceable to:

Git commit SHA
ECR image digest
ECS task definition revision

Safety Default

If there is any doubt, keep the previous production revision live until the new revision proves stable.