Skip to content

ECS Staging Service Status and Fix Plan

Living tracker. Check items off as each fix lands and is verified. Log data captured 2026-04-22 ~08:00–09:20 UTC. Updated ~14:15 UTC.


Current Service State (as of ~14:15 UTC)

Service ECS Image App
api 1/1 ✅ sha-b04415821ec3-4 Healthy — ALB /health 200, serving real API traffic
frontend 1/1 ✅ sha-bootstrap Running; API proxy works if CNAME resolves
admin 1/1 ✅ sha-bootstrap Running; API proxy works if CNAME resolves
anisette 1/1 ✅ sha-bootstrap Running; EFS mounted; service discovery registered
tracker-fetcher-2 1/1 ✅ sha-bootstrap Running; Apple FindMy fetches failing (application-level)
notification-service 1/1 ✅ sha-bootstrap Healthy — DB listener reconnecting normally every 5 min
unified-geofence 1/1 ✅ sha-bootstrap Running
materialized-view-service 1/1 ✅ sha-bootstrap Views exist; No data warnings expected (no location reports yet)

Note on images: All services still run sha-bootstrap. The api was updated separately to sha-b04415821ec3-4. terraform.tfvars currently shows sha-bootstrap for all — needs syncing if any service is force-deployed via Terraform (see drift note below).


Fix Checklist

Fix 1 — Restore ecs_https SG rule: terraform apply

  • Run terraform apply in infra/envs/staging
  • Confirm plan includes adding ecs_https egress (443 → 0.0.0.0/0) to the ECS SG
  • Confirm plan also updates ALB API health check path from /api/v1/health/health
  • Confirm EFS VPC endpoint resource is handled cleanly (already exists in AWS)
  • Apply succeeds with no errors
  • Root cause fixed: added lifecycle { ignore_changes = [egress, ingress] } to all SGs to prevent recurring rule drift on in-place updates

Why: ECR image pulls require two hops — manifest via the ecr.dkr Interface endpoint (covered by the VPCe SG rule), and layer blobs from S3 via the S3 Gateway endpoint. Gateway endpoint traffic is routed via the route table but the ECS SG still evaluates the destination as S3 public IPs. The ecs_https rule (443 → 0.0.0.0/0) is missing from the ECS SG — confirmed live. Without it, all four services time out at dial tcp 52.95.x.x:443.

Evidence:

CannotPullContainerError: dial tcp 52.95.191.34:443: i/o timeout   # frontend
CannotPullContainerError: dial tcp 3.5.244.104:443: i/o timeout    # admin
CannotPullContainerError: dial tcp 52.95.144.34:443: i/o timeout   # anisette
CannotPullContainerError: dial tcp 3.5.245.207:443: i/o timeout    # tracker-fetcher-2

Fix 2 — Force redeploy of image-pull-blocked services

After Fix 1 apply:

  • Force new deployment: frontend
  • Force new deployment: admin
  • Force new deployment: anisette
  • Force new deployment: tracker-fetcher-2
  • Confirm each service reaches running state (at least pulls image successfully)
for svc in frontend admin anisette tracker-fetcher-2; do
  aws ecs update-service \
    --cluster tracker-restapi-staging-cluster \
    --service tracker-restapi-staging-$svc \
    --force-new-deployment \
    --profile glimpse-staging --region eu-west-2
done

Fix 3 — Run Alembic migrations job

  • Identify a private subnet ID (e.g., from Terraform outputs or console)
  • Run the migrations ECS task
  • Check /aws/ecs/tracker-restapi-staging/migrations log group — confirm INFO [alembic.runtime.migration] Running upgrade
  • Confirm log ends with no error and task exits 0
  • Root cause fixed: alembic/env.py and app/core/config.py changed from quote_plus URL string to URL.create() — psycopg3 receives credentials as kwargs, bypassing URL @-parsing bug

Why: All DB-connected services fail with missing schema. notification-service logs UndefinedTable: processed_notifications continuously. materialized-view-service can connect to the DB but finds no views (location_history, location_history_hourly, location_history_daily). The API APScheduler jobstore also times out trying to connect to a schema that doesn't exist.

Evidence:

notification-service:   UndefinedTable: relation "processed_notifications" does not exist
materialized-view-service: Continuous aggregate location_history_hourly does not exist
materialized-view-service: Materialized view location_history does not exist or is inaccessible

Command:

./scripts/run_staging_migrations.sh

The script resolves the staging Terraform outputs for ecs_cluster_arn, migration_task_definition_arn, private_subnet_ids, ecs_security_group_id, and the migrations CloudWatch log group. Use AWS_PROFILE, AWS_REGION, or TERRAFORM_DIR to override defaults if needed.


Fix 4 — Investigate Redis (Valkey) unreachable from API

  • SSH/SSM into the DB host and confirm systemctl status valkey-server (or redis-server) is active
  • Confirm Valkey is listening: ss -tlnp | grep 6379
  • If not running, check /var/log/tracker-db-bootstrap.log for install failures
  • If running, test from ECS task: connect to REDIS_HOST:6379 from within the container

Status: Valkey confirmed active and listening on 0.0.0.0:6379. API is serving real traffic — Redis timeout at startup is non-fatal and not blocking core functionality. Connection from ECS task not yet explicitly verified but SG rules are correct.


Fix 5 — Force redeploy all running services after migrations

  • Force redeploy notification-service
  • Force redeploy materialized-view-service
  • Force redeploy unified-geofence
  • Force redeploy api (also picks up ALB health check fix from Fix 1)

Fix 6 — Rebuild and push frontend/admin images

  • Code fixes applied (tracker-frontend/nginx.conf, tracker-frontend/Dockerfile, tracker-admin/nginx.conf)
  • Frontend and admin running 1/1
  • CNAME tracker.staging.glimpse.technology set up — nginx can resolve upstream at startup

Note: Current running images are sha-bootstrap (reverted from -r2 rebuild). CNAME resolution makes the bootstrap nginx work. When next rebuilt, the variable-form proxy_pass with VPC resolver is already in the code.


Fix 7 — Confirm anisette and tracker-fetcher-2 reach app startup

  • anisette — 1/1, EFS mounted, registered as anisette-v3.anisette-v3.local:6969
  • tracker-fetcher-2 — 1/1, EFS mounted, Apple account initialized, batch processing running
  • Investigate tracker-fetcher-2 Apple FindMy fetch failures (application-level — see Fix 9)

Fix 8 — Confirm all services stable

  • api — running 1/1, ALB health check passing at /health, serving real traffic
  • frontend — running 1/1, serving static files
  • admin — running 1/1, serving correctly
  • anisette — running 1/1, EFS mounted, service discovery registered
  • tracker-fetcher-2 — running 1/1, EFS mounted, batch cycles running (fetch failures are app-level)
  • notification-service — running 1/1, DB listener reconnecting normally (no schema errors)
  • unified-geofence — running 1/1
  • materialized-view-service — running 1/1, views created, No data warnings are expected on empty DB

Fix 9 — Create TimescaleDB views on staging DB

  • scripts/create_timescaledb_views.sql run against staging DB
  • location_history_hourly and location_history_daily continuous aggregates created
  • location_history materialized view created (old table dropped cleanly by the script's guard block)
  • materialized-view-service sees the views — no more View inaccessible or does not exist errors

Current state: Service reports No data in view and No continuous aggregate policy found — both expected on a fresh environment:

  • "No data": no location reports have been ingested yet; views will populate as tracker-fetcher-2 stores reports
  • "No policy": create_timescaledb_views.sql does not add add_continuous_aggregate_policy() calls; TimescaleDB won't background-refresh, but materialized-view-service does manual refreshes on its own schedule

Note: scripts/create_timescaledb_views.sql is now part of the deployment migrations bootstrap, so future environments get the views automatically. The script also has an idempotent guard block that drops any pre-existing location_history table/view before recreating it.


Fix 10 — Sync image tags (resolved)

  • Image tags managed via infra/envs/staging/image-tags.auto.tfvars.json (separate from terraform.tfvars)
  • Current file: api=sha-b04415821ec3-4, frontend=sha-b04415821ec3-2, admin=sha-b04415821ec3-r2, services=sha-b04415821ec3-r1, anisette=sha-bootstrap

Note: image-tags.auto.tfvars.json is the authoritative source for image tags and takes precedence over terraform.tfvars. Do not set image_tags in terraform.tfvars.


Fix 11 — Restore PSYCOPG_DATABASE_URI before deploying services -r1 image

  • PSYCOPG_DATABASE_URI restored to services/shared/config.py
  • Uses psycopg.conninfo.make_conninfo(host=..., port=..., dbname=..., user=..., password=...) — key=value format, no URL percent-encoding, no @-in-password parsing bug; accepted by both psycopg3 and asyncpg
  • Rebuild services image with a new tag (e.g. sha-b04415821ec3-r2)
  • Update image-tags.auto.tfvars.json services tag to the new value
  • terraform apply (or force-deploy) to update all service containers

Why: PSYCOPG_DATABASE_URI was removed during the URL.create() refactor. image-tags.auto.tfvars.json already points services to sha-b04415821ec3-r1 which lacks the property. If Terraform is applied with that tag, notification-service will crash immediately with 'ServiceSettings' object has no attribute 'PSYCOPG_DATABASE_URI'. The code fix is in place — just needs a rebuild before applying.


The fixes are not independent — run them in this sequence.

Critical path

Step 1 → Step 3 → Step 4 → Step 6

Everything else (Redis investigation, frontend rebuild, anisette/fetcher verification) can run in parallel alongside those steps.


Step 1: terraform plan, review, then terraform apply

Run plan first and read the output carefully before applying. Three things to verify:

  • Expect + aws_vpc_security_group_egress_rule for ecs_https (443 → 0.0.0.0/0) — the critical fix.
  • Expect an in-place update to the ALB API target group health check path.
  • aws_vpc_endpoint.efs will attempt a create — but the endpoint already exists in AWS outside of Terraform state. If Terraform plans to create a duplicate, import the existing one first:
VPCE_ID=$(aws ec2 describe-vpc-endpoints \
  --profile glimpse-staging --region eu-west-2 \
  --filters "Name=service-name,Values=com.amazonaws.eu-west-2.elasticfilesystem" \
  --query 'VpcEndpoints[0].VpcEndpointId' --output text)

cd infra/envs/staging
terraform import -var-file=terraform.tfvars aws_vpc_endpoint.efs[0] $VPCE_ID

Step 2: Investigate Redis (in parallel with Step 3)

SSM into the database host while other work continues:

INSTANCE_ID=$(aws ec2 describe-instances \
  --profile glimpse-staging --region eu-west-2 \
  --filters "Name=tag:Name,Values=tracker-restapi-staging-db" \
  --query 'Reservations[0].Instances[0].InstanceId' --output text)

aws ssm start-session --target $INSTANCE_ID \
  --profile glimpse-staging --region eu-west-2

Then on the host:

systemctl status valkey-server || systemctl status redis-server
ss -tlnp | grep 6379
# If not running:
cat /var/log/tracker-db-bootstrap.log | tail -50

Step 3: Force redeploy the four pull-blocked services

Run immediately after terraform apply completes — before migrations, so image pulls happen while you're working on Redis:

for svc in frontend admin anisette tracker-fetcher-2; do
  aws ecs update-service \
    --cluster tracker-restapi-staging-cluster \
    --service tracker-restapi-staging-$svc \
    --force-new-deployment \
    --profile glimpse-staging --region eu-west-2
done

Step 4: Run migrations

Once the migration task can pull its image (same ecs_https fix unblocks it):

SUBNET=$(aws ec2 describe-subnets \
  --profile glimpse-staging --region eu-west-2 \
  --filters "Name=tag:Name,Values=tracker-restapi-staging-private-*" \
  --query 'Subnets[0].SubnetId' --output text)

aws ecs run-task \
  --cluster tracker-restapi-staging-cluster \
  --task-definition tracker-restapi-staging-migrations \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[$SUBNET],securityGroups=[sg-097f9e121ce42c166],assignPublicIp=DISABLED}" \
  --profile glimpse-staging --region eu-west-2

Watch the log group /aws/ecs/tracker-restapi-staging/migrations — confirm Running upgrade lines and a clean exit.


Step 5: Rebuild and push the frontend image (in parallel with Steps 2–4)

The code fix is already committed. Build, tag with something meaningful (not sha-bootstrap), push to ECR, then either update image_tags.frontend in terraform.tfvars and re-apply, or force a deploy pointing at the new tag.


Step 6: Force redeploy the currently-running services

After migrations complete, these services need to restart to pick up the new schema:

for svc in api notification-service materialized-view-service unified-geofence; do
  aws ecs update-service \
    --cluster tracker-restapi-staging-cluster \
    --service tracker-restapi-staging-$svc \
    --force-new-deployment \
    --profile glimpse-staging --region eu-west-2
done

Step 7: Verify each service (Fix 8 checklist)

Work through the Fix 8 checklist. The two most uncertain outcomes at this point will be anisette (unknown app behaviour once it actually starts past the EFS mount) and tracker-fetcher-2 (needs anisette reachable via Cloud Map service discovery).


Changes Applied

File / Component Change Status
infra/modules/security/main.tf Added lifecycle { ignore_changes = [egress, ingress] } to all SGs ✅ Applied
infra/modules/alb/main.tf API health check path /api/v1/health/health ✅ Applied
infra/envs/staging/main.tf Added aws_vpc_endpoint.efs (conditioned on workers/anisette) ✅ Applied
alembic/env.py quote_plus f-string → URL.create() for psycopg3 @-in-password fix ✅ In api image
app/core/config.py Same URL.create() fix ✅ In api image
app/core/database.py URL.create() + .set(drivername=...) for async engine ✅ In api image
services/shared/config.py URL.create() fix; PSYCOPG_DATABASE_URI removed (needs restoring — see Fix 11) ⚠️ Not deployed
services/shared/database.py URL.create() fix ⚠️ Not deployed
tracker-frontend/nginx.conf VPC resolver + variable proxy_pass (__API_URL__) ⚠️ Not deployed (sha-bootstrap running)
tracker-frontend/Dockerfile --chown=nginx:nginx --chmod=644 so nginx user can sed the conf ⚠️ Not deployed
tracker-admin/nginx.conf VPC resolver + variable proxy_pass (__API_URL__) ⚠️ Not deployed
infra/envs/staging/terraform.tfvars Image tags reverted to sha-bootstrap; admin_hostname added ✅ Current

Drift: API ECS task definition runs sha-b04415821ec3-4 but terraform.tfvars says sha-bootstrap — see Fix 10.


Log Locations

/aws/ecs/tracker-restapi-staging/api
/aws/ecs/tracker-restapi-staging/frontend
/aws/ecs/tracker-restapi-staging/admin
/aws/ecs/tracker-restapi-staging/anisette
/aws/ecs/tracker-restapi-staging/tracker-fetcher-2
/aws/ecs/tracker-restapi-staging/unified-geofence
/aws/ecs/tracker-restapi-staging/notification-service
/aws/ecs/tracker-restapi-staging/materialized-view-service
/aws/ecs/tracker-restapi-staging/migrations