Skip to content

Task Management System Replacement

Executive Summary

Replace the unreliable Celery-based task management system with a resilient, cluster-compatible architecture that guarantees zero data loss and automatic recovery. This document outlines the requirements, proposed solution, and reliability guarantees for a mission-critical tracker data pipeline.

Problem Statement

Current Celery Issues

  • Frequent task failures in tracker fetching, geocoding, and geofencing
  • Queue stalls requiring manual intervention
  • Redis cluster incompatibility forcing standalone Redis/Dragonfly usage
  • No automatic recovery from failures
  • Cache pollution - redis-cli FLUSHALL affects all systems
  • Poor observability - difficult to diagnose failures

Smart Queue System Failures

  • Queue stops processing - Fetcher completely stops fetching reports from Apple
  • Unfair device processing - Some trackers never get processed while others are over-processed
  • Database sync issues - Queue doesn't update when trackers are added/updated/deleted
  • CRUD operation blindness - Queue ignores database changes from imports and updates
  • Production run changes ignored - Queue doesn't reflect tracker eligibility changes
  • Multi-tier queue unreliability - Complex queue logic fails unpredictably
  • No horizontal scaling - Cannot run multiple fetcher containers safely

Horizontal Scaling Requirements

  • 10,000+ tracker capacity - System must handle large-scale deployments
  • 5-minute API frequency - Each container must respect Apple's rate limits
  • 12-hour processing cycle - Every tracker processed at least once per 12 hours
  • 64-tracker batch size - Optimal Apple API batch size for efficiency
  • Container independence - Multiple containers must work without conflicts
  • Distributed locking - Prevent duplicate processing across containers
  • Fair processing guarantee - Every tracker gets equal processing opportunity

Business Impact

  • Data loss risk from Apple location reports to frontend
  • Manual intervention required for system recovery
  • Production instability with AWS Valkey clusters
  • Operational overhead from constant troubleshooting

Reliability Requirements

Core Guarantees

  1. Zero Data Loss: Apple location reports → location_history → frontend pipeline must be 100% reliable
  2. Queue Persistence: Tasks survive service restarts without loss
  3. Automatic Recovery: System self-heals from failures without manual intervention
  4. Cluster Compatibility: Native support for AWS Valkey clusters
  5. Service Isolation: Cache flushes don't affect task queues or health monitoring
  6. Inter-Service Resilience: Service-to-service communication survives failures

Operational Requirements

  • Pause/Resume: Services can be stopped and restarted without queue loss
  • Failure Detection: Automatic identification of stalled or failed processes
  • Recovery Automation: Failed tasks automatically retry with exponential backoff
  • Data Integrity: Materialized views automatically recreated on corruption
  • No Backfill Needed: New data flows reliably, backfill only for legacy data

Proposed Architecture

Technology Stack

  • TaskiQ: Redis cluster-native task queue (replaces Celery)
  • APScheduler: Persistent scheduling with database backend
  • Multi-Database Redis: Separate databases for cache, tasks, health, notifications
  • Circuit Breakers: Automatic failure isolation for external APIs
  • Health Monitoring: Continuous system health with automatic alerts

Smart Queue Replacement Strategy

Database-Driven Queue Management

Replace complex multi-tier Redis queues with simple database-driven approach:

  • Single Source of Truth: Database contains all tracker eligibility and timing
  • Simple Round-Robin: Fair processing based on last_processed_at timestamp
  • Automatic Sync: Queue reflects database changes immediately
  • No Queue State: Eliminate Redis queue state that can become stale
  • Distributed Locking: Redis-based locks prevent duplicate processing across containers

Fair Processing Algorithm

1. Query database for eligible trackers (production run dates, not currently locked)
2. Order by last_processed_at ASC (oldest first)
3. Apply distributed lock using Redis SET NX EX
4. Process batch of 64 trackers
5. Update last_processed_at timestamp
6. Release locks

Horizontal Scaling Architecture

Multiple independent containers processing fairly:

  • Container Independence: Each container queries database independently
  • Distributed Locking: Redis locks prevent duplicate processing
  • Lock Timeout: 30-minute lock timeout prevents stuck locks
  • Fair Distribution: Oldest-first processing ensures fairness
  • No Coordination: Containers don't need to communicate with each other

Database Change Integration

Automatic queue updates for all CRUD operations:

  • Tracker Creation: New trackers immediately eligible (last_processed_at = NULL)
  • Tracker Updates: Changes reflected in next database query
  • Bulk Imports: All imported trackers immediately available
  • Production Run Changes: Eligibility changes reflected immediately
  • Tracker Deletion: Removed from processing automatically

Distributed Locking Mechanism

Redis-based distributed locks prevent duplicate processing:

Lock Key Pattern: "tracker_lock:{tracker_id}"
Lock Value: "{container_id}:{timestamp}"
Lock TTL: 1800 seconds (30 minutes)
Lock Command: SET tracker_lock:123 container-1:1641234567 NX EX 1800

Lock Lifecycle:

  1. Acquire: Container attempts to acquire lock before processing
  2. Process: Only lock holder can process the tracker
  3. Update: Update last_processed_at timestamp in database
  4. Release: Explicitly delete lock key on completion
  5. Timeout: Lock auto-expires after 30 minutes if container fails

Lock Failure Handling:

  • Lock Busy: Skip tracker, try next one (no blocking)
  • Lock Timeout: Automatic cleanup prevents stuck locks
  • Container Crash: Lock expires, tracker becomes available
  • Network Partition: Lock TTL ensures eventual consistency

Elimination of Queue Invalidation

Current Redis pub/sub invalidation system becomes unnecessary:

  • No Queue State: Database is the only source of truth
  • No Invalidation Messages: CRUD operations don't need to notify services
  • No Race Conditions: Database changes are immediately visible
  • No Stale State: Locks prevent processing of deleted/changed trackers
  • No Manual Refresh: System self-corrects on every query

Simplified CRUD Operations:

  • Create tracker → Immediately available for processing
  • Update tracker → Changes visible on next database query
  • Delete tracker → Automatically excluded from processing
  • Bulk import → All trackers immediately available
  • Production run changes → Eligibility updated automatically

Capacity Planning

10,000 tracker processing with 2 containers:

  • Processing Rate: 64 trackers per 5 minutes per container = 768 trackers/hour per container
  • Total Capacity: 1,536 trackers/hour with 2 containers
  • 12-Hour Cycle: 18,432 tracker processing slots per 12 hours
  • Overhead Factor: 1.8x capacity provides buffer for retries and failures
  • Scalability: Add more containers linearly to increase capacity

Service Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Tracker       │     │   Geocoding     │     │   Unified       │
│   Fetcher       │───▶│   Service       │───▶│   Geofence      │
│                 │     │                 │     │   Service       │
└─────────────────┘     └─────────────────┘     └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌──────────────────────────────────────────────────────────────────┐
│                    TaskiQ Task Manager                           │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐│
│  │   DB 0      │ │   DB 1      │ │   DB 2      │ │   DB 3       ││
│  │   Cache     │ │   Tasks     │ │   Health    │ │ Notifications││
│  └─────────────┘ └─────────────┘ └─────────────┘ └──────────────┘│
└──────────────────────────────────────────────────────────────────┘

Redis Database Separation

  • DB 0: Application cache (isolated from task operations)
  • DB 1: Task queues and job state (persistent, cluster-compatible)
  • DB 2: Health monitoring and heartbeats (separate from cache)
  • DB 3: Real-time notifications and WebSocket data

Resilience Mechanisms

Queue Persistence

  • Persistent Task Storage: Tasks stored in Redis with TTL management
  • State Tracking: Each task state persisted across service restarts
  • Recovery on Startup: Services resume processing from last known state
  • Dead Letter Queues: Failed tasks moved to separate queue for analysis

Failure Detection & Recovery

  • Health Checks: Continuous monitoring of all service components
  • Heartbeat System: Services report status every 30 seconds
  • Stall Detection: Automatic identification of hung processes
  • Circuit Breakers: External API failures don't cascade to entire system
  • Exponential Backoff: Failed tasks retry with increasing delays

Inter-Service Communication

  • Message Durability: Service-to-service messages persist until acknowledged
  • Retry Logic: Failed communications automatically retry
  • Timeout Handling: Services don't wait indefinitely for responses
  • Graceful Degradation: Services continue operating when dependencies fail

Data Flow Integrity

Critical Data Pipeline

Apple FindMy API → Tracker Fetcher → Location Reports → Geocoding Service
                                                     ↓
Frontend ← Location History ← Materialized Views ← Unified Geofence Service

Integrity Guarantees

  1. Atomic Operations: Location report creation and geofence detection in single transaction
  2. Idempotent Processing: Duplicate messages don't create duplicate data
  3. Ordered Processing: Location reports processed in chronological order
  4. Consistency Checks: Materialized views validated and recreated if corrupted
  5. Audit Trail: All data transformations logged for troubleshooting

Materialized View Management

  • Automatic Refresh: Hourly refresh with failure detection
  • Corruption Detection: Data integrity checks before and after refresh
  • Automatic Recreation: Corrupted views automatically rebuilt from source data
  • Rollback Capability: Failed refreshes don't leave system in broken state

Failure Scenarios & Recovery

Service Restart Scenarios

Scenario Current Behavior New Behavior
Tracker Fetcher Restart Queue lost, manual refresh needed Queue persists, automatic resume
Geocoding Service Restart In-progress batches lost Batches resume from checkpoint
Redis Restart All queues lost Tasks persist, automatic reconnection
Full System Restart Manual queue rebuild required Automatic recovery from persistent state

Failure Recovery Patterns

  • Graceful Shutdown: Services complete current tasks before stopping
  • State Persistence: All task state saved before shutdown
  • Automatic Resume: Services pick up where they left off on restart
  • Dependency Recovery: Services wait for dependencies to become available
  • Health Reporting: System reports recovery status to monitoring

Service Communication Patterns

Reliable Messaging

  • Message Acknowledgment: Recipients confirm message processing
  • Retry Mechanisms: Failed messages automatically retried
  • Dead Letter Handling: Permanently failed messages moved to error queue
  • Ordering Guarantees: Related messages processed in correct order

Event-Driven Architecture

  • Location Report Events: Tracker fetcher publishes new location events
  • Geocoding Events: Geocoding service publishes completion events
  • Geofence Events: Unified geofence service publishes status changes
  • Health Events: All services publish health status changes

Queue Persistence Strategy

Task State Management

  • Persistent Queues: All queues survive Redis restarts
  • Task Checkpointing: Long-running tasks save progress periodically
  • Recovery Points: Services can resume from last checkpoint
  • State Validation: Task state verified on service startup

Redis Cluster Compatibility

  • Hash Tags: All related keys use consistent hash tags for same slot
  • Cluster-Aware Operations: No cross-slot operations that fail in clusters
  • Connection Pooling: Efficient connection management for cluster nodes
  • Failover Handling: Automatic reconnection when cluster nodes fail

Monitoring & Health Checks

System Health Monitoring

  • Service Health: Each service reports detailed health status
  • Queue Health: Queue depth, processing rates, error rates monitored
  • Data Pipeline Health: End-to-end data flow monitoring
  • External Dependencies: Apple FindMy API, database, Redis cluster status

Alerting & Response

  • Automatic Alerts: Critical failures trigger immediate notifications
  • Escalation Procedures: Unresolved issues escalate to operations team
  • Recovery Actions: Automated recovery attempts before human intervention
  • Performance Metrics: Response times, throughput, error rates tracked

Success Criteria

Reliability Metrics

  • Zero Data Loss: 100% of Apple location reports reach location_history
  • Queue Persistence: 100% of tasks survive service restarts
  • Recovery Time: Services resume within 30 seconds of restart
  • Failure Recovery: 95% of failures recover automatically without intervention

Performance Metrics

  • Tracker Fetching: 99.9% success rate for location report fetching
  • Geocoding: 99% success rate for location geocoding
  • Geofencing: 99.9% success rate for geofence event creation
  • View Refresh: 100% success rate for materialized view updates

Operational Metrics

  • Manual Interventions: Reduce to <1 per month
  • System Downtime: <0.1% unplanned downtime
  • Alert Fatigue: <5 false positive alerts per week
  • Recovery Speed: 95% of issues resolve within 5 minutes

Implementation Phases

Phase 1: Infrastructure Setup

  • Multi-database Redis configuration
  • TaskiQ task manager implementation
  • APScheduler persistent scheduling
  • Health monitoring framework

Phase 2: Service Migration

  • Tracker fetcher service conversion
  • Geocoding service conversion
  • Unified geofence service conversion
  • Notification service conversion

Phase 3: Legacy Cleanup

  • Remove all Celery dependencies
  • Delete unused geofence_service
  • Clean up configuration files
  • Update documentation

Phase 4: Validation & Monitoring

  • End-to-end testing of data pipeline
  • Load testing with production data volumes
  • Monitoring dashboard deployment
  • Operational runbook creation

Risk Mitigation

Technical Risks

  • Redis Cluster Compatibility: Extensive testing with AWS Valkey clusters
  • Task Migration: Parallel running of old and new systems during transition
  • Data Consistency: Comprehensive testing of data pipeline integrity
  • Performance Impact: Load testing to ensure no performance degradation

Operational Risks

  • Team Training: Comprehensive training on new system operations
  • Rollback Plan: Ability to quickly revert to Celery if critical issues arise
  • Monitoring Gaps: Comprehensive monitoring to catch issues early
  • Documentation: Complete operational documentation for troubleshooting

Conclusion

This architectural replacement addresses all current reliability issues with Celery while providing the resilience guarantees required for a mission-critical tracker data pipeline. The proposed solution ensures zero data loss, automatic recovery, and cluster compatibility while eliminating the operational overhead of constant manual intervention.

The new system will provide the reliability foundation needed to scale the tracker system while maintaining data integrity from Apple location reports through to the frontend user experience.