AWS ECS Staging Blockers and Functional Cluster Path

This note captures the remaining problems in the staging ECS rollout, the intent of the current infrastructure path, and the next fixes needed to reach a functional cluster without hand-built runtime steps.

Current Aim

The current path is trying to achieve these outcomes:

Build and push immutable ECS images to ECR.
Run infrastructure in Terraform, not in ad hoc console changes.
Keep ECS workloads in private subnets with controlled outbound access.
Use VPC endpoints for AWS APIs where practical.
Run database migrations as a one-off ECS task from the API image.
Start the long-running app and worker services only after the database is ready.
Keep the worker image shared across worker roles, with command overrides per service.

What Is Working

These pieces are in place and currently usable:

VPC, ALB, ECR repositories, Secrets Manager, KMS, logs, and the staging ECS cluster.
The database host boots the tracker database, the tracker role, and the required extensions.
api is running.
notification-service is running.
unified-geofence is running.
materialized-view-service is running.

What Is Still Not Working

`anisette`

The service is still failing during task initialization.
The latest error is an EFS mount timeout from mount.nfs4.
That means the container never reaches app startup.

`tracker-fetcher-2`

The service is still not stable.
It has been hitting the same EFS mount timeout during task initialization.
This blocks the fetcher from talking to anisette and from reading its shared account store.

`frontend` and `admin`

They have been draining and replacing tasks during rollout.
They are not currently in a clean, steady running state.
Even when they are not the main blocker, they still need to be confirmed healthy behind the ALB.

Why These Failures Matter

tracker-fetcher-2 needs the shared worker image and the shared /data/account.json storage path.
anisette needs its own EFS-backed storage and service discovery path.
The app is not functionally complete until the web frontends are healthy and the worker dependencies are stable.
Without these pieces, migration and deploy automation remain incomplete even if individual containers can start.

Root Causes Seen So Far

ECS needed HTTPS egress to reach ECR and AWS APIs. That rule had drifted and had to be recreated.
The ECS security group also needed the database, cache, and EFS egress rules restored.
The ECS task role initially only covered one EFS path, but the cluster now uses separate worker and anisette filesystems.
The anisette service is the last remaining task still showing an EFS mount timeout rather than an app-level failure.

Suggested Fixes

Keep ECS tasks in the private subnets with assign_public_ip = false.
Keep the ECS security group HTTPS egress rule in place so image pulls and AWS API calls work.
Keep the ECS-to-EFS, ECS-to-database, and ECS-to-cache egress rules in place.
Keep the EFS security group ingress rule from ECS on port 2049.
Keep the ECS task role permissions broad enough to cover both the worker EFS and the anisette EFS.
If anisette continues to fail, inspect the task ENI, mount target health, and the exact storage definition for that service.
Force a fresh ECS deployment after security group or IAM changes so the tasks do not keep retrying against stale conditions.
Treat Alembic as a one-off ECS job from the API image before the long-running app services are considered ready.

Suggested Order To Finish The Cluster

Stabilise anisette.
Stabilise tracker-fetcher-2.
Confirm frontend and admin are healthy behind the ALB.
Run the migration task from the API image and confirm the schema is present.
Verify the long-running services stay up after deployment.
Only then treat image tag updates as the normal deploy mechanism.

Practical Boundary For The Next Pass

Do not add more infrastructure features until the remaining services are actually stable.
Do not hand-run app startup steps if Terraform and ECS can express them.
Keep the deployment path focused on reproducible infrastructure, reproducible images, and a predictable migration step.