Task Management System Replacement
Executive Summary
Replace the unreliable Celery-based task management system with a resilient, cluster-compatible architecture that guarantees zero data loss and automatic recovery. This document outlines the requirements, proposed solution, and reliability guarantees for a mission-critical tracker data pipeline.
Problem Statement
Current Celery Issues
- Frequent task failures in tracker fetching, geocoding, and geofencing
- Queue stalls requiring manual intervention
- Redis cluster incompatibility forcing standalone Redis/Dragonfly usage
- No automatic recovery from failures
- Cache pollution -
redis-cli FLUSHALLaffects all systems - Poor observability - difficult to diagnose failures
Smart Queue System Failures
- Queue stops processing - Fetcher completely stops fetching reports from Apple
- Unfair device processing - Some trackers never get processed while others are over-processed
- Database sync issues - Queue doesn't update when trackers are added/updated/deleted
- CRUD operation blindness - Queue ignores database changes from imports and updates
- Production run changes ignored - Queue doesn't reflect tracker eligibility changes
- Multi-tier queue unreliability - Complex queue logic fails unpredictably
- No horizontal scaling - Cannot run multiple fetcher containers safely
Horizontal Scaling Requirements
- 10,000+ tracker capacity - System must handle large-scale deployments
- 5-minute API frequency - Each container must respect Apple's rate limits
- 12-hour processing cycle - Every tracker processed at least once per 12 hours
- 64-tracker batch size - Optimal Apple API batch size for efficiency
- Container independence - Multiple containers must work without conflicts
- Distributed locking - Prevent duplicate processing across containers
- Fair processing guarantee - Every tracker gets equal processing opportunity
Business Impact
- Data loss risk from Apple location reports to frontend
- Manual intervention required for system recovery
- Production instability with AWS Valkey clusters
- Operational overhead from constant troubleshooting
Reliability Requirements
Core Guarantees
- Zero Data Loss: Apple location reports → location_history → frontend pipeline must be 100% reliable
- Queue Persistence: Tasks survive service restarts without loss
- Automatic Recovery: System self-heals from failures without manual intervention
- Cluster Compatibility: Native support for AWS Valkey clusters
- Service Isolation: Cache flushes don't affect task queues or health monitoring
- Inter-Service Resilience: Service-to-service communication survives failures
Operational Requirements
- Pause/Resume: Services can be stopped and restarted without queue loss
- Failure Detection: Automatic identification of stalled or failed processes
- Recovery Automation: Failed tasks automatically retry with exponential backoff
- Data Integrity: Materialized views automatically recreated on corruption
- No Backfill Needed: New data flows reliably, backfill only for legacy data
Proposed Architecture
Technology Stack
- TaskiQ: Redis cluster-native task queue (replaces Celery)
- APScheduler: Persistent scheduling with database backend
- Multi-Database Redis: Separate databases for cache, tasks, health, notifications
- Circuit Breakers: Automatic failure isolation for external APIs
- Health Monitoring: Continuous system health with automatic alerts
Smart Queue Replacement Strategy
Database-Driven Queue Management
Replace complex multi-tier Redis queues with simple database-driven approach:
- Single Source of Truth: Database contains all tracker eligibility and timing
- Simple Round-Robin: Fair processing based on
last_processed_attimestamp - Automatic Sync: Queue reflects database changes immediately
- No Queue State: Eliminate Redis queue state that can become stale
- Distributed Locking: Redis-based locks prevent duplicate processing across containers
Fair Processing Algorithm
1. Query database for eligible trackers (production run dates, not currently locked)
2. Order by last_processed_at ASC (oldest first)
3. Apply distributed lock using Redis SET NX EX
4. Process batch of 64 trackers
5. Update last_processed_at timestamp
6. Release locks
Horizontal Scaling Architecture
Multiple independent containers processing fairly:
- Container Independence: Each container queries database independently
- Distributed Locking: Redis locks prevent duplicate processing
- Lock Timeout: 30-minute lock timeout prevents stuck locks
- Fair Distribution: Oldest-first processing ensures fairness
- No Coordination: Containers don't need to communicate with each other
Database Change Integration
Automatic queue updates for all CRUD operations:
- Tracker Creation: New trackers immediately eligible (last_processed_at = NULL)
- Tracker Updates: Changes reflected in next database query
- Bulk Imports: All imported trackers immediately available
- Production Run Changes: Eligibility changes reflected immediately
- Tracker Deletion: Removed from processing automatically
Distributed Locking Mechanism
Redis-based distributed locks prevent duplicate processing:
Lock Key Pattern: "tracker_lock:{tracker_id}"
Lock Value: "{container_id}:{timestamp}"
Lock TTL: 1800 seconds (30 minutes)
Lock Command: SET tracker_lock:123 container-1:1641234567 NX EX 1800
Lock Lifecycle:
- Acquire: Container attempts to acquire lock before processing
- Process: Only lock holder can process the tracker
- Update: Update
last_processed_attimestamp in database - Release: Explicitly delete lock key on completion
- Timeout: Lock auto-expires after 30 minutes if container fails
Lock Failure Handling:
- Lock Busy: Skip tracker, try next one (no blocking)
- Lock Timeout: Automatic cleanup prevents stuck locks
- Container Crash: Lock expires, tracker becomes available
- Network Partition: Lock TTL ensures eventual consistency
Elimination of Queue Invalidation
Current Redis pub/sub invalidation system becomes unnecessary:
- No Queue State: Database is the only source of truth
- No Invalidation Messages: CRUD operations don't need to notify services
- No Race Conditions: Database changes are immediately visible
- No Stale State: Locks prevent processing of deleted/changed trackers
- No Manual Refresh: System self-corrects on every query
Simplified CRUD Operations:
- Create tracker → Immediately available for processing
- Update tracker → Changes visible on next database query
- Delete tracker → Automatically excluded from processing
- Bulk import → All trackers immediately available
- Production run changes → Eligibility updated automatically
Capacity Planning
10,000 tracker processing with 2 containers:
- Processing Rate: 64 trackers per 5 minutes per container = 768 trackers/hour per container
- Total Capacity: 1,536 trackers/hour with 2 containers
- 12-Hour Cycle: 18,432 tracker processing slots per 12 hours
- Overhead Factor: 1.8x capacity provides buffer for retries and failures
- Scalability: Add more containers linearly to increase capacity
Service Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Tracker │ │ Geocoding │ │ Unified │
│ Fetcher │───▶│ Service │───▶│ Geofence │
│ │ │ │ │ Service │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────────────────────┐
│ TaskiQ Task Manager │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐│
│ │ DB 0 │ │ DB 1 │ │ DB 2 │ │ DB 3 ││
│ │ Cache │ │ Tasks │ │ Health │ │ Notifications││
│ └─────────────┘ └─────────────┘ └─────────────┘ └──────────────┘│
└──────────────────────────────────────────────────────────────────┘
Redis Database Separation
- DB 0: Application cache (isolated from task operations)
- DB 1: Task queues and job state (persistent, cluster-compatible)
- DB 2: Health monitoring and heartbeats (separate from cache)
- DB 3: Real-time notifications and WebSocket data
Resilience Mechanisms
Queue Persistence
- Persistent Task Storage: Tasks stored in Redis with TTL management
- State Tracking: Each task state persisted across service restarts
- Recovery on Startup: Services resume processing from last known state
- Dead Letter Queues: Failed tasks moved to separate queue for analysis
Failure Detection & Recovery
- Health Checks: Continuous monitoring of all service components
- Heartbeat System: Services report status every 30 seconds
- Stall Detection: Automatic identification of hung processes
- Circuit Breakers: External API failures don't cascade to entire system
- Exponential Backoff: Failed tasks retry with increasing delays
Inter-Service Communication
- Message Durability: Service-to-service messages persist until acknowledged
- Retry Logic: Failed communications automatically retry
- Timeout Handling: Services don't wait indefinitely for responses
- Graceful Degradation: Services continue operating when dependencies fail
Data Flow Integrity
Critical Data Pipeline
Apple FindMy API → Tracker Fetcher → Location Reports → Geocoding Service
↓
Frontend ← Location History ← Materialized Views ← Unified Geofence Service
Integrity Guarantees
- Atomic Operations: Location report creation and geofence detection in single transaction
- Idempotent Processing: Duplicate messages don't create duplicate data
- Ordered Processing: Location reports processed in chronological order
- Consistency Checks: Materialized views validated and recreated if corrupted
- Audit Trail: All data transformations logged for troubleshooting
Materialized View Management
- Automatic Refresh: Hourly refresh with failure detection
- Corruption Detection: Data integrity checks before and after refresh
- Automatic Recreation: Corrupted views automatically rebuilt from source data
- Rollback Capability: Failed refreshes don't leave system in broken state
Failure Scenarios & Recovery
Service Restart Scenarios
| Scenario | Current Behavior | New Behavior |
|---|---|---|
| Tracker Fetcher Restart | Queue lost, manual refresh needed | Queue persists, automatic resume |
| Geocoding Service Restart | In-progress batches lost | Batches resume from checkpoint |
| Redis Restart | All queues lost | Tasks persist, automatic reconnection |
| Full System Restart | Manual queue rebuild required | Automatic recovery from persistent state |
Failure Recovery Patterns
- Graceful Shutdown: Services complete current tasks before stopping
- State Persistence: All task state saved before shutdown
- Automatic Resume: Services pick up where they left off on restart
- Dependency Recovery: Services wait for dependencies to become available
- Health Reporting: System reports recovery status to monitoring
Service Communication Patterns
Reliable Messaging
- Message Acknowledgment: Recipients confirm message processing
- Retry Mechanisms: Failed messages automatically retried
- Dead Letter Handling: Permanently failed messages moved to error queue
- Ordering Guarantees: Related messages processed in correct order
Event-Driven Architecture
- Location Report Events: Tracker fetcher publishes new location events
- Geocoding Events: Geocoding service publishes completion events
- Geofence Events: Unified geofence service publishes status changes
- Health Events: All services publish health status changes
Queue Persistence Strategy
Task State Management
- Persistent Queues: All queues survive Redis restarts
- Task Checkpointing: Long-running tasks save progress periodically
- Recovery Points: Services can resume from last checkpoint
- State Validation: Task state verified on service startup
Redis Cluster Compatibility
- Hash Tags: All related keys use consistent hash tags for same slot
- Cluster-Aware Operations: No cross-slot operations that fail in clusters
- Connection Pooling: Efficient connection management for cluster nodes
- Failover Handling: Automatic reconnection when cluster nodes fail
Monitoring & Health Checks
System Health Monitoring
- Service Health: Each service reports detailed health status
- Queue Health: Queue depth, processing rates, error rates monitored
- Data Pipeline Health: End-to-end data flow monitoring
- External Dependencies: Apple FindMy API, database, Redis cluster status
Alerting & Response
- Automatic Alerts: Critical failures trigger immediate notifications
- Escalation Procedures: Unresolved issues escalate to operations team
- Recovery Actions: Automated recovery attempts before human intervention
- Performance Metrics: Response times, throughput, error rates tracked
Success Criteria
Reliability Metrics
- Zero Data Loss: 100% of Apple location reports reach location_history
- Queue Persistence: 100% of tasks survive service restarts
- Recovery Time: Services resume within 30 seconds of restart
- Failure Recovery: 95% of failures recover automatically without intervention
Performance Metrics
- Tracker Fetching: 99.9% success rate for location report fetching
- Geocoding: 99% success rate for location geocoding
- Geofencing: 99.9% success rate for geofence event creation
- View Refresh: 100% success rate for materialized view updates
Operational Metrics
- Manual Interventions: Reduce to <1 per month
- System Downtime: <0.1% unplanned downtime
- Alert Fatigue: <5 false positive alerts per week
- Recovery Speed: 95% of issues resolve within 5 minutes
Implementation Phases
Phase 1: Infrastructure Setup
- Multi-database Redis configuration
- TaskiQ task manager implementation
- APScheduler persistent scheduling
- Health monitoring framework
Phase 2: Service Migration
- Tracker fetcher service conversion
- Geocoding service conversion
- Unified geofence service conversion
- Notification service conversion
Phase 3: Legacy Cleanup
- Remove all Celery dependencies
- Delete unused geofence_service
- Clean up configuration files
- Update documentation
Phase 4: Validation & Monitoring
- End-to-end testing of data pipeline
- Load testing with production data volumes
- Monitoring dashboard deployment
- Operational runbook creation
Risk Mitigation
Technical Risks
- Redis Cluster Compatibility: Extensive testing with AWS Valkey clusters
- Task Migration: Parallel running of old and new systems during transition
- Data Consistency: Comprehensive testing of data pipeline integrity
- Performance Impact: Load testing to ensure no performance degradation
Operational Risks
- Team Training: Comprehensive training on new system operations
- Rollback Plan: Ability to quickly revert to Celery if critical issues arise
- Monitoring Gaps: Comprehensive monitoring to catch issues early
- Documentation: Complete operational documentation for troubleshooting
Conclusion
This architectural replacement addresses all current reliability issues with Celery while providing the resilience guarantees required for a mission-critical tracker data pipeline. The proposed solution ensures zero data loss, automatic recovery, and cluster compatibility while eliminating the operational overhead of constant manual intervention.
The new system will provide the reliability foundation needed to scale the tracker system while maintaining data integrity from Apple location reports through to the frontend user experience.