Skip to content

AWS ECS Deploy and Rollback Runbook

Purpose

This document defines the practical release workflow for ECS once the staging account and task definitions exist.

It covers:

  • build and publish steps
  • staging deployment
  • smoke tests
  • production rollout
  • rollback handling

The goal is to make deployments repeatable and boring.

Deployment Principles

  • Every deploy should be traceable to a Git commit.
  • Every deploy should produce immutable container images.
  • Staging should be deployed before production.
  • Rollback should use the previously known-good task definition and image digest.
  • Database migrations should be backward compatible where possible.

Build and Publish

Inputs

  • Git commit SHA
  • application test results
  • container build context
  • target environment

Output Artifacts

  • container image digest
  • ECR image tag
  • ECS task definition revision
  • deployment metadata

Use both:

  • a commit-based tag, such as sha-<shortsha>
  • a release tag, such as staging-YYYYMMDD or prod-YYYYMMDD

That gives you both exact reproducibility and a human-readable release marker.

Staging Deployment Flow

1. Validate the Build

Before touching ECS:

  • run linting
  • run unit tests
  • build the container image
  • verify the image starts locally or in a test job

2. Push to ECR

Push the release image to the staging ECR repository.

Record:

  • repository name
  • image digest
  • image tag
  • commit SHA

3. Register ECS Task Definitions

Register a new task definition revision for each changed ECS service.

Update:

  • image references
  • environment variable values
  • secrets references
  • logging configuration
  • health check parameters

4. Deploy to Staging ECS

Update the staging ECS service to use the new task definition revision.

Wait for:

  • task startup to complete
  • health checks to pass
  • target groups to become healthy
  • logs to appear in CloudWatch

5. Run Smoke Tests

Smoke tests should confirm:

  • API responds
  • login works
  • database connectivity works
  • Redis connectivity works
  • worker services start cleanly
  • background jobs process as expected

6. Record the Result

Store the result of the staging deployment:

  • pass or fail
  • task definition revision
  • image digest
  • timestamp
  • any observed warnings

Production Rollout Flow

1. Confirm Staging Is Green

Do not promote unless staging has:

  • passed smoke tests
  • logged clean ECS service events
  • proven rollback once in staging

2. Promote the Same Artifact

Use the exact same image digest that was validated in staging.

Do not rebuild for production unless the build itself is the thing being promoted.

3. Update Production ECS

Register production task definition revisions from the same release metadata.

Then update the production ECS services in a controlled order:

  1. backend API
  2. worker services
  3. frontend/admin if they are deployed as ECS services

4. Observe the Deployment

Watch for:

  • ECS task restarts
  • ALB target health
  • 5xx errors
  • database connectivity errors
  • Redis connectivity errors

Rollback Runbook

Rollback Triggers

Rollback if any of the following happen:

  • health checks fail repeatedly
  • API smoke tests fail
  • workers cannot connect to database or Redis
  • elevated 5xx rates appear
  • startup crashes loop

Rollback Steps

  1. Identify the last known-good task definition revision.
  2. Repoint the ECS service to that revision.
  3. Confirm tasks start successfully.
  4. Verify target health returns to green.
  5. Re-run smoke tests.
  6. Record the rollback event.

Rollback Rules

  • Do not change multiple variables during rollback.
  • Revert one release at a time.
  • If rollback fails, pause and debug the environment rather than piling on more changes.

Database Migration Handling

Database migration sequencing should follow the deployment flow:

  1. run the Alembic migration task against the target database
  2. deploy the new application services
  3. verify runtime behavior
  4. remove old code paths only after stability is confirmed

If a migration is not backward compatible, it should be treated as a separate change with a specific rollback plan.

Smoke Test Checklist

  • API health endpoint returns healthy
  • Login endpoint accepts known credentials in staging
  • Database query path works
  • Redis-backed behavior works
  • Background worker starts and processes a sample job
  • CloudWatch logs are visible
  • ALB target group is healthy

Operational Notes

Approval Gate

Production rollout should require a human approval gate after staging succeeds.

Release Identity

Every deployed service should be traceable to:

  • Git commit SHA
  • ECR image digest
  • ECS task definition revision

Safety Default

If there is any doubt, keep the previous production revision live until the new revision proves stable.