ECS Staging Service Status and Fix Plan
Living tracker. Check items off as each fix lands and is verified. Log data captured 2026-04-22 ~08:00–09:20 UTC. Updated ~14:15 UTC.
Current Service State (as of ~14:15 UTC)
| Service | ECS | Image | App |
|---|---|---|---|
api |
1/1 ✅ | sha-b04415821ec3-4 |
Healthy — ALB /health 200, serving real API traffic |
frontend |
1/1 ✅ | sha-bootstrap |
Running; API proxy works if CNAME resolves |
admin |
1/1 ✅ | sha-bootstrap |
Running; API proxy works if CNAME resolves |
anisette |
1/1 ✅ | sha-bootstrap |
Running; EFS mounted; service discovery registered |
tracker-fetcher-2 |
1/1 ✅ | sha-bootstrap |
Running; Apple FindMy fetches failing (application-level) |
notification-service |
1/1 ✅ | sha-bootstrap |
Healthy — DB listener reconnecting normally every 5 min |
unified-geofence |
1/1 ✅ | sha-bootstrap |
Running |
materialized-view-service |
1/1 ✅ | sha-bootstrap |
Views exist; No data warnings expected (no location reports yet) |
Note on images: All services still run sha-bootstrap. The api was updated separately to sha-b04415821ec3-4. terraform.tfvars currently shows sha-bootstrap for all — needs syncing if any service is force-deployed via Terraform (see drift note below).
Fix Checklist
Fix 1 — Restore ecs_https SG rule: terraform apply
- Run
terraform applyininfra/envs/staging - Confirm plan includes adding
ecs_httpsegress (443 → 0.0.0.0/0) to the ECS SG - Confirm plan also updates ALB API health check path from
/api/v1/health→/health - Confirm EFS VPC endpoint resource is handled cleanly (already exists in AWS)
- Apply succeeds with no errors
- Root cause fixed: added
lifecycle { ignore_changes = [egress, ingress] }to all SGs to prevent recurring rule drift on in-place updates
Why: ECR image pulls require two hops — manifest via the ecr.dkr Interface endpoint (covered by the VPCe SG rule), and layer blobs from S3 via the S3 Gateway endpoint. Gateway endpoint traffic is routed via the route table but the ECS SG still evaluates the destination as S3 public IPs. The ecs_https rule (443 → 0.0.0.0/0) is missing from the ECS SG — confirmed live. Without it, all four services time out at dial tcp 52.95.x.x:443.
Evidence:
CannotPullContainerError: dial tcp 52.95.191.34:443: i/o timeout # frontend
CannotPullContainerError: dial tcp 3.5.244.104:443: i/o timeout # admin
CannotPullContainerError: dial tcp 52.95.144.34:443: i/o timeout # anisette
CannotPullContainerError: dial tcp 3.5.245.207:443: i/o timeout # tracker-fetcher-2
Fix 2 — Force redeploy of image-pull-blocked services
After Fix 1 apply:
- Force new deployment:
frontend - Force new deployment:
admin - Force new deployment:
anisette - Force new deployment:
tracker-fetcher-2 - Confirm each service reaches running state (at least pulls image successfully)
for svc in frontend admin anisette tracker-fetcher-2; do
aws ecs update-service \
--cluster tracker-restapi-staging-cluster \
--service tracker-restapi-staging-$svc \
--force-new-deployment \
--profile glimpse-staging --region eu-west-2
done
Fix 3 — Run Alembic migrations job
- Identify a private subnet ID (e.g., from Terraform outputs or console)
- Run the migrations ECS task
- Check
/aws/ecs/tracker-restapi-staging/migrationslog group — confirmINFO [alembic.runtime.migration] Running upgrade - Confirm log ends with no error and task exits 0
- Root cause fixed:
alembic/env.pyandapp/core/config.pychanged fromquote_plusURL string toURL.create()— psycopg3 receives credentials as kwargs, bypassing URL@-parsing bug
Why: All DB-connected services fail with missing schema. notification-service logs UndefinedTable: processed_notifications continuously. materialized-view-service can connect to the DB but finds no views (location_history, location_history_hourly, location_history_daily). The API APScheduler jobstore also times out trying to connect to a schema that doesn't exist.
Evidence:
notification-service: UndefinedTable: relation "processed_notifications" does not exist
materialized-view-service: Continuous aggregate location_history_hourly does not exist
materialized-view-service: Materialized view location_history does not exist or is inaccessible
Command:
./scripts/run_staging_migrations.sh
The script resolves the staging Terraform outputs for ecs_cluster_arn,
migration_task_definition_arn, private_subnet_ids, ecs_security_group_id,
and the migrations CloudWatch log group. Use AWS_PROFILE, AWS_REGION, or
TERRAFORM_DIR to override defaults if needed.
Fix 4 — Investigate Redis (Valkey) unreachable from API
- SSH/SSM into the DB host and confirm
systemctl status valkey-server(orredis-server) is active - Confirm Valkey is listening:
ss -tlnp | grep 6379 - If not running, check
/var/log/tracker-db-bootstrap.logfor install failures - If running, test from ECS task: connect to
REDIS_HOST:6379from within the container
Status: Valkey confirmed active and listening on 0.0.0.0:6379. API is serving real traffic — Redis timeout at startup is non-fatal and not blocking core functionality. Connection from ECS task not yet explicitly verified but SG rules are correct.
Fix 5 — Force redeploy all running services after migrations
- Force redeploy
notification-service - Force redeploy
materialized-view-service - Force redeploy
unified-geofence - Force redeploy
api(also picks up ALB health check fix from Fix 1)
Fix 6 — Rebuild and push frontend/admin images
- Code fixes applied (
tracker-frontend/nginx.conf,tracker-frontend/Dockerfile,tracker-admin/nginx.conf) - Frontend and admin running 1/1
- CNAME
tracker.staging.glimpse.technologyset up — nginx can resolve upstream at startup
Note: Current running images are sha-bootstrap (reverted from -r2 rebuild). CNAME resolution makes the bootstrap nginx work. When next rebuilt, the variable-form proxy_pass with VPC resolver is already in the code.
Fix 7 — Confirm anisette and tracker-fetcher-2 reach app startup
-
anisette— 1/1, EFS mounted, registered asanisette-v3.anisette-v3.local:6969 -
tracker-fetcher-2— 1/1, EFS mounted, Apple account initialized, batch processing running - Investigate
tracker-fetcher-2Apple FindMy fetch failures (application-level — see Fix 9)
Fix 8 — Confirm all services stable
-
api— running 1/1, ALB health check passing at/health, serving real traffic -
frontend— running 1/1, serving static files -
admin— running 1/1, serving correctly -
anisette— running 1/1, EFS mounted, service discovery registered -
tracker-fetcher-2— running 1/1, EFS mounted, batch cycles running (fetch failures are app-level) -
notification-service— running 1/1, DB listener reconnecting normally (no schema errors) -
unified-geofence— running 1/1 -
materialized-view-service— running 1/1, views created,No datawarnings are expected on empty DB
Fix 9 — Create TimescaleDB views on staging DB
-
scripts/create_timescaledb_views.sqlrun against staging DB -
location_history_hourlyandlocation_history_dailycontinuous aggregates created -
location_historymaterialized view created (old table dropped cleanly by the script's guard block) -
materialized-view-servicesees the views — no moreView inaccessible or does not existerrors
Current state: Service reports No data in view and No continuous aggregate policy found — both expected on a fresh environment:
- "No data": no location reports have been ingested yet; views will populate as tracker-fetcher-2 stores reports
- "No policy":
create_timescaledb_views.sqldoes not addadd_continuous_aggregate_policy()calls; TimescaleDB won't background-refresh, butmaterialized-view-servicedoes manual refreshes on its own schedule
Note: scripts/create_timescaledb_views.sql is now part of the deployment migrations bootstrap, so future environments get the views automatically. The script also has an idempotent guard block that drops any pre-existing location_history table/view before recreating it.
Fix 10 — Sync image tags (resolved)
- Image tags managed via
infra/envs/staging/image-tags.auto.tfvars.json(separate fromterraform.tfvars) - Current file:
api=sha-b04415821ec3-4,frontend=sha-b04415821ec3-2,admin=sha-b04415821ec3-r2,services=sha-b04415821ec3-r1,anisette=sha-bootstrap
Note: image-tags.auto.tfvars.json is the authoritative source for image tags and takes precedence over terraform.tfvars. Do not set image_tags in terraform.tfvars.
Fix 11 — Restore PSYCOPG_DATABASE_URI before deploying services -r1 image
-
PSYCOPG_DATABASE_URIrestored toservices/shared/config.py - Uses
psycopg.conninfo.make_conninfo(host=..., port=..., dbname=..., user=..., password=...)— key=value format, no URL percent-encoding, no@-in-password parsing bug; accepted by both psycopg3 and asyncpg - Rebuild services image with a new tag (e.g.
sha-b04415821ec3-r2) - Update
image-tags.auto.tfvars.jsonservicestag to the new value -
terraform apply(or force-deploy) to update all service containers
Why: PSYCOPG_DATABASE_URI was removed during the URL.create() refactor. image-tags.auto.tfvars.json already points services to sha-b04415821ec3-r1 which lacks the property. If Terraform is applied with that tag, notification-service will crash immediately with 'ServiceSettings' object has no attribute 'PSYCOPG_DATABASE_URI'. The code fix is in place — just needs a rebuild before applying.
Recommended Execution Order
The fixes are not independent — run them in this sequence.
Critical path
Step 1 → Step 3 → Step 4 → Step 6
Everything else (Redis investigation, frontend rebuild, anisette/fetcher verification) can run in parallel alongside those steps.
Step 1: terraform plan, review, then terraform apply
Run plan first and read the output carefully before applying. Three things to verify:
- Expect
+ aws_vpc_security_group_egress_ruleforecs_https(443 → 0.0.0.0/0) — the critical fix. - Expect an in-place update to the ALB API target group health check path.
aws_vpc_endpoint.efswill attempt a create — but the endpoint already exists in AWS outside of Terraform state. If Terraform plans to create a duplicate, import the existing one first:
VPCE_ID=$(aws ec2 describe-vpc-endpoints \
--profile glimpse-staging --region eu-west-2 \
--filters "Name=service-name,Values=com.amazonaws.eu-west-2.elasticfilesystem" \
--query 'VpcEndpoints[0].VpcEndpointId' --output text)
cd infra/envs/staging
terraform import -var-file=terraform.tfvars aws_vpc_endpoint.efs[0] $VPCE_ID
Step 2: Investigate Redis (in parallel with Step 3)
SSM into the database host while other work continues:
INSTANCE_ID=$(aws ec2 describe-instances \
--profile glimpse-staging --region eu-west-2 \
--filters "Name=tag:Name,Values=tracker-restapi-staging-db" \
--query 'Reservations[0].Instances[0].InstanceId' --output text)
aws ssm start-session --target $INSTANCE_ID \
--profile glimpse-staging --region eu-west-2
Then on the host:
systemctl status valkey-server || systemctl status redis-server
ss -tlnp | grep 6379
# If not running:
cat /var/log/tracker-db-bootstrap.log | tail -50
Step 3: Force redeploy the four pull-blocked services
Run immediately after terraform apply completes — before migrations, so image pulls happen while you're working on Redis:
for svc in frontend admin anisette tracker-fetcher-2; do
aws ecs update-service \
--cluster tracker-restapi-staging-cluster \
--service tracker-restapi-staging-$svc \
--force-new-deployment \
--profile glimpse-staging --region eu-west-2
done
Step 4: Run migrations
Once the migration task can pull its image (same ecs_https fix unblocks it):
SUBNET=$(aws ec2 describe-subnets \
--profile glimpse-staging --region eu-west-2 \
--filters "Name=tag:Name,Values=tracker-restapi-staging-private-*" \
--query 'Subnets[0].SubnetId' --output text)
aws ecs run-task \
--cluster tracker-restapi-staging-cluster \
--task-definition tracker-restapi-staging-migrations \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[$SUBNET],securityGroups=[sg-097f9e121ce42c166],assignPublicIp=DISABLED}" \
--profile glimpse-staging --region eu-west-2
Watch the log group /aws/ecs/tracker-restapi-staging/migrations — confirm Running upgrade lines and a clean exit.
Step 5: Rebuild and push the frontend image (in parallel with Steps 2–4)
The code fix is already committed. Build, tag with something meaningful (not sha-bootstrap), push to ECR, then either update image_tags.frontend in terraform.tfvars and re-apply, or force a deploy pointing at the new tag.
Step 6: Force redeploy the currently-running services
After migrations complete, these services need to restart to pick up the new schema:
for svc in api notification-service materialized-view-service unified-geofence; do
aws ecs update-service \
--cluster tracker-restapi-staging-cluster \
--service tracker-restapi-staging-$svc \
--force-new-deployment \
--profile glimpse-staging --region eu-west-2
done
Step 7: Verify each service (Fix 8 checklist)
Work through the Fix 8 checklist. The two most uncertain outcomes at this point will be anisette (unknown app behaviour once it actually starts past the EFS mount) and tracker-fetcher-2 (needs anisette reachable via Cloud Map service discovery).
Changes Applied
| File / Component | Change | Status |
|---|---|---|
infra/modules/security/main.tf |
Added lifecycle { ignore_changes = [egress, ingress] } to all SGs |
✅ Applied |
infra/modules/alb/main.tf |
API health check path /api/v1/health → /health |
✅ Applied |
infra/envs/staging/main.tf |
Added aws_vpc_endpoint.efs (conditioned on workers/anisette) |
✅ Applied |
alembic/env.py |
quote_plus f-string → URL.create() for psycopg3 @-in-password fix |
✅ In api image |
app/core/config.py |
Same URL.create() fix |
✅ In api image |
app/core/database.py |
URL.create() + .set(drivername=...) for async engine |
✅ In api image |
services/shared/config.py |
URL.create() fix; PSYCOPG_DATABASE_URI removed (needs restoring — see Fix 11) |
⚠️ Not deployed |
services/shared/database.py |
URL.create() fix |
⚠️ Not deployed |
tracker-frontend/nginx.conf |
VPC resolver + variable proxy_pass (__API_URL__) |
⚠️ Not deployed (sha-bootstrap running) |
tracker-frontend/Dockerfile |
--chown=nginx:nginx --chmod=644 so nginx user can sed the conf |
⚠️ Not deployed |
tracker-admin/nginx.conf |
VPC resolver + variable proxy_pass (__API_URL__) |
⚠️ Not deployed |
infra/envs/staging/terraform.tfvars |
Image tags reverted to sha-bootstrap; admin_hostname added |
✅ Current |
Drift: API ECS task definition runs sha-b04415821ec3-4 but terraform.tfvars says sha-bootstrap — see Fix 10.
Log Locations
/aws/ecs/tracker-restapi-staging/api
/aws/ecs/tracker-restapi-staging/frontend
/aws/ecs/tracker-restapi-staging/admin
/aws/ecs/tracker-restapi-staging/anisette
/aws/ecs/tracker-restapi-staging/tracker-fetcher-2
/aws/ecs/tracker-restapi-staging/unified-geofence
/aws/ecs/tracker-restapi-staging/notification-service
/aws/ecs/tracker-restapi-staging/materialized-view-service
/aws/ecs/tracker-restapi-staging/migrations