ECS Staging Service Status and Fix Plan

Living tracker. Check items off as each fix lands and is verified. Log data captured 2026-04-22 ~08:00–09:20 UTC. Updated ~14:15 UTC.

Current Service State (as of ~14:15 UTC)

Service	ECS	Image	App
`api`	1/1 ✅	`sha-b04415821ec3-4`	Healthy — ALB `/health` 200, serving real API traffic
`frontend`	1/1 ✅	`sha-bootstrap`	Running; API proxy works if CNAME resolves
`admin`	1/1 ✅	`sha-bootstrap`	Running; API proxy works if CNAME resolves
`anisette`	1/1 ✅	`sha-bootstrap`	Running; EFS mounted; service discovery registered
`tracker-fetcher-2`	1/1 ✅	`sha-bootstrap`	Running; Apple FindMy fetches failing (application-level)
`notification-service`	1/1 ✅	`sha-bootstrap`	Healthy — DB listener reconnecting normally every 5 min
`unified-geofence`	1/1 ✅	`sha-bootstrap`	Running
`materialized-view-service`	1/1 ✅	`sha-bootstrap`	Views exist; `No data` warnings expected (no location reports yet)

Note on images: All services still run sha-bootstrap. The api was updated separately to sha-b04415821ec3-4. terraform.tfvars currently shows sha-bootstrap for all — needs syncing if any service is force-deployed via Terraform (see drift note below).

Fix Checklist

Fix 1 — Restore `ecs_https` SG rule: `terraform apply`

Run terraform apply in infra/envs/staging
Confirm plan includes adding ecs_https egress (443 → 0.0.0.0/0) to the ECS SG
Confirm plan also updates ALB API health check path from /api/v1/health → /health
Confirm EFS VPC endpoint resource is handled cleanly (already exists in AWS)
Apply succeeds with no errors
Root cause fixed: added lifecycle { ignore_changes = [egress, ingress] } to all SGs to prevent recurring rule drift on in-place updates

Why: ECR image pulls require two hops — manifest via the ecr.dkr Interface endpoint (covered by the VPCe SG rule), and layer blobs from S3 via the S3 Gateway endpoint. Gateway endpoint traffic is routed via the route table but the ECS SG still evaluates the destination as S3 public IPs. The ecs_https rule (443 → 0.0.0.0/0) is missing from the ECS SG — confirmed live. Without it, all four services time out at dial tcp 52.95.x.x:443.

Evidence:

CannotPullContainerError: dial tcp 52.95.191.34:443: i/o timeout   # frontend
CannotPullContainerError: dial tcp 3.5.244.104:443: i/o timeout    # admin
CannotPullContainerError: dial tcp 52.95.144.34:443: i/o timeout   # anisette
CannotPullContainerError: dial tcp 3.5.245.207:443: i/o timeout    # tracker-fetcher-2

Fix 2 — Force redeploy of image-pull-blocked services

After Fix 1 apply:

Force new deployment: frontend
Force new deployment: admin
Force new deployment: anisette
Force new deployment: tracker-fetcher-2
Confirm each service reaches running state (at least pulls image successfully)

for svc in frontend admin anisette tracker-fetcher-2; do
  aws ecs update-service \
    --cluster tracker-restapi-staging-cluster \
    --service tracker-restapi-staging-$svc \
    --force-new-deployment \
    --profile glimpse-staging --region eu-west-2
done

Fix 3 — Run Alembic migrations job

Identify a private subnet ID (e.g., from Terraform outputs or console)
Run the migrations ECS task
Check /aws/ecs/tracker-restapi-staging/migrations log group — confirm INFO [alembic.runtime.migration] Running upgrade
Confirm log ends with no error and task exits 0
Root cause fixed: alembic/env.py and app/core/config.py changed from quote_plus URL string to URL.create() — psycopg3 receives credentials as kwargs, bypassing URL @-parsing bug

Why: All DB-connected services fail with missing schema. notification-service logs UndefinedTable: processed_notifications continuously. materialized-view-service can connect to the DB but finds no views (location_history, location_history_hourly, location_history_daily). The API APScheduler jobstore also times out trying to connect to a schema that doesn't exist.

Evidence:

notification-service:   UndefinedTable: relation "processed_notifications" does not exist
materialized-view-service: Continuous aggregate location_history_hourly does not exist
materialized-view-service: Materialized view location_history does not exist or is inaccessible

Command:

./scripts/run_staging_migrations.sh

The script resolves the staging Terraform outputs for ecs_cluster_arn, migration_task_definition_arn, private_subnet_ids, ecs_security_group_id, and the migrations CloudWatch log group. Use AWS_PROFILE, AWS_REGION, or TERRAFORM_DIR to override defaults if needed.

Fix 4 — Investigate Redis (Valkey) unreachable from API

SSH/SSM into the DB host and confirm systemctl status valkey-server (or redis-server) is active
Confirm Valkey is listening: ss -tlnp | grep 6379
If not running, check /var/log/tracker-db-bootstrap.log for install failures
If running, test from ECS task: connect to REDIS_HOST:6379 from within the container

Status: Valkey confirmed active and listening on 0.0.0.0:6379. API is serving real traffic — Redis timeout at startup is non-fatal and not blocking core functionality. Connection from ECS task not yet explicitly verified but SG rules are correct.

Fix 5 — Force redeploy all running services after migrations

Force redeploy notification-service
Force redeploy materialized-view-service
Force redeploy unified-geofence
Force redeploy api (also picks up ALB health check fix from Fix 1)

Fix 6 — Rebuild and push frontend/admin images

Code fixes applied (tracker-frontend/nginx.conf, tracker-frontend/Dockerfile, tracker-admin/nginx.conf)
Frontend and admin running 1/1
CNAME tracker.staging.glimpse.technology set up — nginx can resolve upstream at startup

Note: Current running images are sha-bootstrap (reverted from -r2 rebuild). CNAME resolution makes the bootstrap nginx work. When next rebuilt, the variable-form proxy_pass with VPC resolver is already in the code.

Fix 7 — Confirm anisette and tracker-fetcher-2 reach app startup

anisette — 1/1, EFS mounted, registered as anisette-v3.anisette-v3.local:6969
tracker-fetcher-2 — 1/1, EFS mounted, Apple account initialized, batch processing running
Investigate tracker-fetcher-2 Apple FindMy fetch failures (application-level — see Fix 9)

Fix 8 — Confirm all services stable

api — running 1/1, ALB health check passing at /health, serving real traffic
frontend — running 1/1, serving static files
admin — running 1/1, serving correctly
anisette — running 1/1, EFS mounted, service discovery registered
tracker-fetcher-2 — running 1/1, EFS mounted, batch cycles running (fetch failures are app-level)
notification-service — running 1/1, DB listener reconnecting normally (no schema errors)
unified-geofence — running 1/1
materialized-view-service — running 1/1, views created, No data warnings are expected on empty DB

Fix 9 — Create TimescaleDB views on staging DB

scripts/create_timescaledb_views.sql run against staging DB
location_history_hourly and location_history_daily continuous aggregates created
location_history materialized view created (old table dropped cleanly by the script's guard block)
materialized-view-service sees the views — no more View inaccessible or does not exist errors

Current state: Service reports No data in view and No continuous aggregate policy found — both expected on a fresh environment:

"No data": no location reports have been ingested yet; views will populate as tracker-fetcher-2 stores reports
"No policy": create_timescaledb_views.sql does not add add_continuous_aggregate_policy() calls; TimescaleDB won't background-refresh, but materialized-view-service does manual refreshes on its own schedule

Note: scripts/create_timescaledb_views.sql is now part of the deployment migrations bootstrap, so future environments get the views automatically. The script also has an idempotent guard block that drops any pre-existing location_history table/view before recreating it.

Fix 10 — Sync image tags (resolved)

Image tags managed via infra/envs/staging/image-tags.auto.tfvars.json (separate from terraform.tfvars)
Current file: api=sha-b04415821ec3-4, frontend=sha-b04415821ec3-2, admin=sha-b04415821ec3-r2, services=sha-b04415821ec3-r1, anisette=sha-bootstrap

Note: image-tags.auto.tfvars.json is the authoritative source for image tags and takes precedence over terraform.tfvars. Do not set image_tags in terraform.tfvars.

Fix 11 — Restore `PSYCOPG_DATABASE_URI` before deploying services `-r1` image

PSYCOPG_DATABASE_URI restored to services/shared/config.py
Uses psycopg.conninfo.make_conninfo(host=..., port=..., dbname=..., user=..., password=...) — key=value format, no URL percent-encoding, no @-in-password parsing bug; accepted by both psycopg3 and asyncpg
Rebuild services image with a new tag (e.g. sha-b04415821ec3-r2)
Update image-tags.auto.tfvars.json services tag to the new value
terraform apply (or force-deploy) to update all service containers

Why: PSYCOPG_DATABASE_URI was removed during the URL.create() refactor. image-tags.auto.tfvars.json already points services to sha-b04415821ec3-r1 which lacks the property. If Terraform is applied with that tag, notification-service will crash immediately with 'ServiceSettings' object has no attribute 'PSYCOPG_DATABASE_URI'. The code fix is in place — just needs a rebuild before applying.

Recommended Execution Order

The fixes are not independent — run them in this sequence.

Critical path

Step 1 → Step 3 → Step 4 → Step 6

Everything else (Redis investigation, frontend rebuild, anisette/fetcher verification) can run in parallel alongside those steps.

Step 1: `terraform plan`, review, then `terraform apply`

Run plan first and read the output carefully before applying. Three things to verify:

Expect + aws_vpc_security_group_egress_rule for ecs_https (443 → 0.0.0.0/0) — the critical fix.
Expect an in-place update to the ALB API target group health check path.
aws_vpc_endpoint.efs will attempt a create — but the endpoint already exists in AWS outside of Terraform state. If Terraform plans to create a duplicate, import the existing one first:

VPCE_ID=$(aws ec2 describe-vpc-endpoints \
  --profile glimpse-staging --region eu-west-2 \
  --filters "Name=service-name,Values=com.amazonaws.eu-west-2.elasticfilesystem" \
  --query 'VpcEndpoints[0].VpcEndpointId' --output text)

cd infra/envs/staging
terraform import -var-file=terraform.tfvars aws_vpc_endpoint.efs[0] $VPCE_ID

Step 2: Investigate Redis (in parallel with Step 3)

SSM into the database host while other work continues:

INSTANCE_ID=$(aws ec2 describe-instances \
  --profile glimpse-staging --region eu-west-2 \
  --filters "Name=tag:Name,Values=tracker-restapi-staging-db" \
  --query 'Reservations[0].Instances[0].InstanceId' --output text)

aws ssm start-session --target $INSTANCE_ID \
  --profile glimpse-staging --region eu-west-2

Then on the host:

systemctl status valkey-server || systemctl status redis-server
ss -tlnp | grep 6379
# If not running:
cat /var/log/tracker-db-bootstrap.log | tail -50

Step 3: Force redeploy the four pull-blocked services

Run immediately after terraform apply completes — before migrations, so image pulls happen while you're working on Redis:

for svc in frontend admin anisette tracker-fetcher-2; do
  aws ecs update-service \
    --cluster tracker-restapi-staging-cluster \
    --service tracker-restapi-staging-$svc \
    --force-new-deployment \
    --profile glimpse-staging --region eu-west-2
done

Step 4: Run migrations

Once the migration task can pull its image (same ecs_https fix unblocks it):

SUBNET=$(aws ec2 describe-subnets \
  --profile glimpse-staging --region eu-west-2 \
  --filters "Name=tag:Name,Values=tracker-restapi-staging-private-*" \
  --query 'Subnets[0].SubnetId' --output text)

aws ecs run-task \
  --cluster tracker-restapi-staging-cluster \
  --task-definition tracker-restapi-staging-migrations \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[$SUBNET],securityGroups=[sg-097f9e121ce42c166],assignPublicIp=DISABLED}" \
  --profile glimpse-staging --region eu-west-2

Watch the log group /aws/ecs/tracker-restapi-staging/migrations — confirm Running upgrade lines and a clean exit.

Step 5: Rebuild and push the frontend image (in parallel with Steps 2–4)

The code fix is already committed. Build, tag with something meaningful (not sha-bootstrap), push to ECR, then either update image_tags.frontend in terraform.tfvars and re-apply, or force a deploy pointing at the new tag.

Step 6: Force redeploy the currently-running services

After migrations complete, these services need to restart to pick up the new schema:

for svc in api notification-service materialized-view-service unified-geofence; do
  aws ecs update-service \
    --cluster tracker-restapi-staging-cluster \
    --service tracker-restapi-staging-$svc \
    --force-new-deployment \
    --profile glimpse-staging --region eu-west-2
done

Step 7: Verify each service (Fix 8 checklist)

Work through the Fix 8 checklist. The two most uncertain outcomes at this point will be anisette (unknown app behaviour once it actually starts past the EFS mount) and tracker-fetcher-2 (needs anisette reachable via Cloud Map service discovery).

Changes Applied

File / Component	Change	Status
`infra/modules/security/main.tf`	Added `lifecycle { ignore_changes = [egress, ingress] }` to all SGs	✅ Applied
`infra/modules/alb/main.tf`	API health check path `/api/v1/health` → `/health`	✅ Applied
`infra/envs/staging/main.tf`	Added `aws_vpc_endpoint.efs` (conditioned on workers/anisette)	✅ Applied
`alembic/env.py`	`quote_plus` f-string → `URL.create()` for psycopg3 `@`-in-password fix	✅ In api image
`app/core/config.py`	Same `URL.create()` fix	✅ In api image
`app/core/database.py`	`URL.create()` + `.set(drivername=...)` for async engine	✅ In api image
`services/shared/config.py`	`URL.create()` fix; `PSYCOPG_DATABASE_URI` removed (needs restoring — see Fix 11)	⚠️ Not deployed
`services/shared/database.py`	`URL.create()` fix	⚠️ Not deployed
`tracker-frontend/nginx.conf`	VPC resolver + variable `proxy_pass` (`__API_URL__`)	⚠️ Not deployed (sha-bootstrap running)
`tracker-frontend/Dockerfile`	`--chown=nginx:nginx --chmod=644` so nginx user can `sed` the conf	⚠️ Not deployed
`tracker-admin/nginx.conf`	VPC resolver + variable `proxy_pass` (`__API_URL__`)	⚠️ Not deployed
`infra/envs/staging/terraform.tfvars`	Image tags reverted to `sha-bootstrap`; `admin_hostname` added	✅ Current

Drift: API ECS task definition runs sha-b04415821ec3-4 but terraform.tfvars says sha-bootstrap — see Fix 10.

Log Locations

/aws/ecs/tracker-restapi-staging/api
/aws/ecs/tracker-restapi-staging/frontend
/aws/ecs/tracker-restapi-staging/admin
/aws/ecs/tracker-restapi-staging/anisette
/aws/ecs/tracker-restapi-staging/tracker-fetcher-2
/aws/ecs/tracker-restapi-staging/unified-geofence
/aws/ecs/tracker-restapi-staging/notification-service
/aws/ecs/tracker-restapi-staging/materialized-view-service
/aws/ecs/tracker-restapi-staging/migrations