AWS ECS Migration and CI/CD Plan

Executive Summary

The application is currently designed around Docker Compose on a single EC2 host. In production, the target is a rebuild of that runtime onto ECS, not a greenfield platform replacement. Moving to AWS ECS is a sensible next step if the goal is to reduce host management, improve deployment consistency, and enable a cleaner CI/CD pipeline.

This document proposes a staged migration that keeps the application architecture familiar while changing only the orchestration layer first. The recommended path is:

Package the existing services as ECS tasks.
Push images to Amazon ECR from CI.
Deploy into a staging AWS account first.
Validate the release process, health checks, secrets handling, networking, and rollback.
Promote the same pattern to production after staging is stable, while coexisting with the existing production EC2 path during transition.

The safest approach is to avoid redesigning the application at the same time. Keep the first move operationally small: Docker images, ECS task definitions, load balancer wiring, and environment management.

Goals

Remove dependence on a manually managed EC2 Docker host.
Introduce repeatable deployments through CI/CD.
Use the staging AWS account as the validation environment before production cutover.
Preserve the current application behavior as much as possible.
Keep rollback simple while the platform is being migrated.

Non-Goals

Rewriting application code for Kubernetes-style orchestration.
Introducing multiple cloud providers.
Re-architecting the database layer during the first migration phase.
Changing the domain model, API contract, or frontend behavior as part of the platform move.

Recommended Target Architecture

The first ECS target should mirror the current Docker Compose service boundaries:

ECS on Fargate for stateless application containers.
Application Load Balancer in front of the API and web applications.
Amazon ECR for container images.
AWS Secrets Manager or SSM Parameter Store for secrets and environment-specific configuration.
Amazon RDS for PostgreSQL when the required extension set supports it, otherwise a self-managed PostgreSQL host on EC2 for TimescaleDB/PostGIS workloads.
Amazon ElastiCache Redis or an EC2-hosted Valkey service for cache and queue usage, depending on the environment.
Anisette as a private internal ECS service with persistent file storage when the FindMy workflows need it.
CloudWatch Logs for container logs and alerting.
IAM task roles for service-specific AWS permissions.

If the application currently runs multiple containers on one host, start by identifying which ones should remain in ECS and which dependencies should move to managed AWS services. Stateless services are the easiest first candidates.

Suggested Service Split

Phase 1 Candidate Services

API service
Frontend service
Admin panel service
Background worker services that do not require special host access

Managed Infrastructure

PostgreSQL database
Redis-compatible cache
Container registry
Centralized logging

Services To Review Carefully

Any service that depends on local filesystem state
Any service that uses host networking assumptions
Any service that performs one-off migrations or admin tasks
Any service with custom startup ordering requirements
Any private support service such as Anisette that needs service discovery and persistent file storage

Migration Approach

Phase 0: Discovery and Inventory

Before changing infrastructure, document the current runtime assumptions.

Capture:

All containers currently started by Docker Compose.
Required environment variables and secrets.
Ports exposed to users and to internal services.
Startup dependencies between services.
Persistent volumes and files written at runtime.
Health endpoints and readiness behavior.
Database and cache connection strings.
Any host-level dependencies, such as cron jobs or filesystem mounts.

This phase should produce an inventory that can be translated into ECS task definitions without guesswork.

Phase 1: Build and Image Strategy

Standardize builds so CI produces immutable images.

Recommended pattern:

Build one image per deployable service.
Tag images with both the Git SHA and a human-readable release tag.
Push images to ECR.
Avoid building images inside ECS or on the target host.

Database schema migrations should run as a dedicated one-off ECS task that uses the API image and executes alembic upgrade head before the long-running services roll forward.

For the first pass, keep the existing Dockerfiles or Compose build context unless they are clearly unfit for CI.

Concrete implementation for this repository:

Use a dedicated docker-bake.hcl file for the image matrix.
Build api, frontend, admin, and services from the same source tree but with separate contexts and Dockerfiles.
Tag each image as sha-<git-sha> for deployment and also as staging for a mutable convenience tag.
Generate infra/envs/staging/image-tags.auto.tfvars.json during the build so Terraform receives the exact image tags without manual editing.
Keep compose.yml for local development and move runtime assets into the image so ECS receives the same artifact model as CI.

The API image should include the runtime assets it depends on, especially static/ and the pre-generated openapi.json, so ECS does not rely on host bind mounts.

Phase 2: Staging Account Foundation

Use the staging AWS account to validate the entire delivery chain.

Create:

ECR repositories for each deployable image.
ECS cluster and task definitions.
Application Load Balancer and target groups.
Security groups and VPC subnets.
Managed database and cache resources.
Secrets and parameters for staging configuration.
CloudWatch log groups and alarms.

The staging account should be treated as a real deployment target, not just a smoke-test environment.

Phase 3: CI/CD Pipeline

The pipeline should do the following:

Run linting and tests.
Build container images.
Push images to ECR.
Register a new ECS task definition revision.
Update the ECS service.
Wait for deployment health to stabilize.
Fail fast and stop if health checks do not pass.

If the repository is hosted on GitHub, GitHub Actions is a pragmatic default. If your team already standardizes on AWS-native tooling, CodeBuild and CodePipeline are also valid.

Release Versioning TODO

Before implementing the pipeline, decide how ECS will be told which image to run:

commit SHA tag
image digest
release tag or build number

The recommended direction is immutable ECR plus commit SHA tags, with ECS task definitions updated by CI/CD rather than Terraform.

Phase 4: Cutover Strategy

Do not switch production traffic until staging has proven:

Health checks remain green under normal load.
The application starts reliably from a clean deploy.
Database migrations can be applied safely.
Logs and metrics are sufficient for troubleshooting.
Rollback can be executed in minutes.

Use one of these rollout patterns:

Blue/green if you want the cleanest rollback story.
Rolling update if you want the simplest implementation.
Staged traffic shift if you need a conservative production migration.

Phase 5: Production Migration

Once staging is stable, move production in the same order:

Mirror the staging ECS layout in the production account.
Validate secrets, load balancer rules, and log retention.
Run a controlled migration window.
Keep the original EC2 deployment available until confidence is high.
Remove the EC2 path only after a successful observation period.

CI/CD Design

Recommended Pipeline Stages

Source
  -> Lint
  -> Test
  -> Build
  -> Scan
  -> Push to ECR
  -> Deploy to Staging ECS
  -> Smoke Test
  -> Manual Approval
  -> Deploy to Production ECS

Required Checks

Unit tests with coverage.
Linting and formatting.
Container build validation.
Health endpoint verification.
Basic API smoke tests after deploy.
Database migration check, if applicable.

Release Artifacts

Store the following as part of the release process:

Git commit SHA
Image digest
Task definition revision
Environment name
Deployment timestamp

This makes rollback and audit trails much easier.

Staging Account Plan

The staging account should be configured to resemble production closely enough to catch platform problems early.

Staging Must Include

Separate VPC and security groups.
Separate ECR repositories or namespaces.
Separate secrets and environment parameters.
A copy of production-like ECS task sizing.
A realistic database and cache configuration.
Application logs and alarms.

Staging Validation Checklist

Deploy a known release from CI.
Confirm the app comes up cleanly.
Confirm database connectivity.
Confirm cache connectivity.
Confirm login and core API paths.
Confirm background workers process jobs.
Confirm logs appear in CloudWatch.
Confirm rollback to the previous task revision works.

Operational Concerns

Networking

ECS introduces stricter network planning than a single EC2 host. Confirm:

Public endpoints only expose the load balancer.
Internal services stay in private subnets.
Database and cache are not publicly reachable.
Security groups allow only required traffic.

Secrets and Configuration

Do not bake secrets into images. Prefer:

Secrets Manager for credentials and API keys.
Parameter Store for non-sensitive runtime settings.
ECS environment variables only for non-secret values.

Database Migrations

Migrations should be handled as a separate deploy step or an explicit pre-deploy job.

Recommended rule:

Apply backward-compatible migrations first.
Deploy the new application version second.
Remove old code paths only after the new version is confirmed stable.

Logging and Monitoring

At minimum, capture:

Container stdout and stderr in CloudWatch.
ALB access logs if needed for troubleshooting.
ECS service events.
Database and cache health metrics.

Add alarms for:

Failed task restarts.
Unhealthy target groups.
Elevated 5xx rates.
Database connection failures.

Risks and Mitigations

Risk	Impact	Mitigation
Hidden host dependency	Deployment fails in ECS	Inventory runtime assumptions before migration
Missing secrets wiring	Services fail on startup	Use staging to validate all secrets and parameters
Database mismatch	Data issues or downtime	Keep migrations backward compatible and test in staging
Weak health checks	ECS replaces healthy tasks too aggressively	Add explicit readiness and smoke tests
Rollback is slow	Extended outage risk	Keep previous task definition and image available
Cost surprises	Higher-than-expected AWS spend	Set budgets and review ECS/RDS sizing early

Proposed Work Breakdown

Workstream 1: Architecture Discovery

Document current Compose services.
Identify deployable ECS units.
Map config, secrets, and ports.
Record persistent state dependencies.

Workstream 2: Staging Infrastructure

Create AWS staging resources.
Provision ECR, ECS, ALB, the database host or managed database service, cache service, and logging.
Set up least-privilege IAM roles.
Configure DNS and TLS for staging.

Workstream 3: CI/CD Pipeline

Add image build and push automation.
Add ECS deployment automation.
Add post-deploy smoke tests.
Add rollback support.

Workstream 4: Application Validation

Test startup on ECS.
Validate auth flows and core APIs.
Validate background jobs.
Validate migrations and persistence.

Workstream 5: Production Rollout

Rehearse production deployment in staging.
Define go/no-go criteria.
Execute cutover.
Monitor and stabilize.

Decision Points

Before implementation, decide:

Whether the first ECS deployment uses Fargate or EC2-backed ECS.
Whether the database stays on EC2 or uses a managed PostgreSQL service for each environment.
Whether CI/CD is implemented with GitHub Actions or AWS-native tooling.
Whether services are deployed as one stack or split into multiple ECS services.
Whether rollout uses blue/green or rolling updates.
How release version selection is represented in the pipeline.

Recommended First Pass

If you want the shortest path to a working ECS deployment, start with this version:

ECS Fargate
One ECS service per deployable container
ALB in front of the public web and API surfaces
EC2 PostgreSQL host for TimescaleDB/PostGIS if required
cache service appropriate to the environment
GitHub Actions for CI/CD
Staging account first, production second
Immutable ECR with commit SHA tags and ECS task-definition updates from CI/CD

That approach keeps the migration simple enough to be delivered in phases without forcing a full redesign.

Next Steps

Produce a service inventory from the current Docker Compose setup. See 02 AWS ECS Runtime Inventory.
Confirm which dependencies move to managed AWS services.
Create the staging AWS baseline. See 05 AWS Staging Infrastructure Plan.
Document the deploy and rollback procedure in 06 AWS ECS Deploy and Rollback Runbook.
Build the first image-push and ECS deploy pipeline.
Validate one service end to end before migrating the rest.