Skip to content

Connection Resilience Analysis - Tracker Fetcher Service

Problem Statement

The tracker-fetcher service experiences connection failures to Redis (port 6379) and PostgreSQL (port 5432) at 192.168.100.1, demonstrating exactly why the Task Management System Replacement is critical. Current issues include:

  • Complete service stalls during connection outages
  • No automatic retry mechanisms during failures
  • Loss of task processing capability
  • Manual intervention required for recovery
  • This is the "unreliable behaviour of celery" mentioned in the task management replacement document

Alignment with Task Management Replacement

This analysis supports the planned migration from Celery to TaskiQ with resilient architecture:

  • Current Celery Issues: Connection failures cause complete service stalls
  • TaskiQ Solution: Circuit breakers and automatic recovery mechanisms
  • Database-Driven Queues: Eliminate Redis queue state that becomes stale during outages
  • Distributed Locking: Prevent duplicate processing during connection recovery
  • Multi-Database Redis: Isolate task queues from cache operations

Current Architecture Issues

1. Connection Handling Gaps

Redis Connection Issues:

  • TaskiQRedisManager creates connections but lacks retry logic during failures
  • Health checks fail silently without triggering reconnection attempts
  • No circuit breaker pattern for degraded connections
  • Connection timeouts are too short (5 seconds) for network instability

Database Connection Issues:

  • get_db_session() has retry logic but only for initial connection
  • No retry mechanism for mid-operation failures
  • Connection pool doesn't handle network partitions gracefully
  • Pool pre-ping helps but doesn't cover all failure scenarios

2. Health Monitoring Limitations

SimpleHealthMonitor Issues:

  • Reports failures but doesn't trigger recovery actions
  • 5-minute heartbeat interval too long for critical failures
  • No escalation mechanism for persistent failures
  • Health checks are passive, not proactive

3. Task Queue Resilience

Current Behavior:

  • Tasks stop processing when connections fail
  • No automatic task retry with exponential backoff
  • Queue state lost during Redis outages
  • No graceful degradation modes

Resilience Gaps Identified

Critical Issues

  1. No Connection Recovery Loop: Services don't attempt to reconnect automatically
  2. No Circuit Breaker Pattern: Failed connections aren't isolated and retried intelligently
  3. No Graceful Degradation: Services fail completely instead of operating in reduced capacity
  4. No Task Persistence: Queue state lost during Redis failures
  5. No Health-Based Recovery: Health monitoring is informational only

Major Issues

  1. Short Connection Timeouts: 5-second timeouts too aggressive for network issues
  2. No Exponential Backoff: Retry attempts don't use intelligent spacing
  3. No Connection Pooling Resilience: Pool doesn't handle network partitions
  4. No Service Mesh Integration: No external load balancing or failover

1. Implement TaskiQ Resilient Connection Manager

Create connection manager following the task management replacement architecture:

  • Circuit breaker pattern for failed connections (as specified in replacement doc)
  • Exponential backoff with jitter for retry logic
  • Multi-database Redis support (DB 0: Cache, DB 1: Tasks, DB 2: Health, DB 3: Notifications)
  • Cluster compatibility for AWS Valkey clusters
  • Connection health monitoring with automatic recovery

2. Database-Driven Queue Management

Replace complex Redis queues with database-driven approach:

  • Single source of truth: Database contains all tracker eligibility
  • Distributed locking: Redis SET NX EX for container coordination
  • Fair processing: Order by last_processed_at ASC (oldest first)
  • Automatic sync: Queue reflects database changes immediately
  • No queue state: Eliminate Redis queue state that can become stale

3. Enhanced Health Monitoring (30-second intervals)

Upgrade health monitoring per replacement document requirements:

  • 30-second heartbeat intervals (reduced from 5 minutes)
  • Active recovery triggers instead of passive monitoring
  • Automatic failure detection and recovery initiation
  • Health-based service restart capabilities
  • Integration with container orchestration

4. Persistent Task Storage

Implement TaskiQ persistent queues:

  • Tasks survive Redis restarts without loss
  • State tracking across service restarts
  • Recovery on startup from last known state
  • Dead letter queues for failed tasks
  • Task checkpointing for long-running operations

Implementation Priority (Aligned with TaskiQ Migration)

Phase 1: TaskiQ Infrastructure Setup (Immediate)

  • Implement multi-database Redis configuration (DB 0-3 separation)
  • Create TaskiQ resilient connection manager with circuit breakers
  • Add exponential backoff retry logic with jitter
  • Implement distributed locking mechanism (Redis SET NX EX)
  • Upgrade health monitoring to 30-second intervals

Phase 2: Database-Driven Queue Implementation (Short-term)

  • Replace Redis queue state with database-driven approach
  • Implement fair processing algorithm (last_processed_at ordering)
  • Add automatic CRUD operation sync (no manual queue refresh)
  • Create persistent task storage for Redis outages
  • Implement graceful degradation modes

Phase 3: Service Migration to TaskiQ (Medium-term)

  • Migrate tracker fetcher service to TaskiQ
  • Implement APScheduler with database backend
  • Add comprehensive failure recovery patterns
  • Create advanced monitoring dashboards
  • Complete Celery removal and cleanup

Success Metrics (Per Task Management Replacement Requirements)

Reliability Metrics

  • Zero Data Loss: 100% of Apple location reports reach location_history
  • Queue Persistence: 100% of tasks survive service restarts
  • Recovery Time: Services resume within 30 seconds of restart
  • Failure Recovery: 95% of failures recover automatically without intervention

Performance Metrics

  • Tracker Fetching: 99.9% success rate for location report fetching
  • Connection Recovery: Automatic recovery within 30 seconds of connection restoration
  • Uptime: 99.9% service availability during network issues
  • Task Loss: Zero task loss during connection failures

Operational Metrics

  • Manual Interventions: Reduce to <1 per month (eliminate manual service restarts)
  • System Downtime: <0.1% unplanned downtime
  • Alert Fatigue: <5 false positive alerts per week
  • Recovery Speed: 95% of issues resolve within 5 minutes

Next Steps

This connection resilience analysis directly supports the Task Management System Replacement initiative. The immediate priority should be:

  1. Accelerate TaskiQ Migration: The connection issues demonstrate why Celery replacement is critical
  2. Implement Resilient Connection Manager: Create the foundation for reliable connections
  3. Database-Driven Queues: Eliminate Redis queue state that fails during outages
  4. Multi-Database Redis: Isolate task operations from cache operations

The current connection failures are a symptom of the broader architectural issues that the TaskiQ migration will resolve.