Phase 3: Service Integration Implementation
This document describes the implementation of Phase 3 of the distributed health monitoring system, which integrates health monitoring into existing services.
Overview
Phase 3 builds upon the foundation established in Phase 1 (Health Redis Infrastructure) and Phase 2 (Health Publisher Framework) to integrate comprehensive health monitoring into our microservices.
Implementation Status
✅ Completed Components
1. Tracker Fetcher Service Integration
File: services/tracker_fetcher/health_monitor.py
- Purpose: Comprehensive health monitoring for the tracker fetcher service
- Features:
- Apple API connectivity monitoring
- Queue health metrics (immediate, hot, warm, cold, retry queues)
- Fetch performance tracking
- Geofence event generation monitoring
- Anisette server connectivity checks
- Batch processing performance metrics
Key Metrics Tracked:
fetch_rate_per_hour: Trackers processed per hoursuccess_rate_percent: Percentage of successful fetch attemptsapple_account_authenticated: Apple account authentication statusanisette_server_reachable: Anisette server connectivitytotal_queue_size: Combined size of all queuesapple_api_response_time_ms: Apple API response timereports_found_per_tracker: Average reports found per trackergeofence_events_generated: Total geofence events created
Alert Conditions:
- CRITICAL: Apple account not authenticated
- CRITICAL: Anisette server unreachable
- WARNING: Large queue backlog (>10,000 trackers)
- WARNING/CRITICAL: Low fetch success rate (<80%/50%)
- WARNING: Low fetch rate (<10 trackers/hour)
- WARNING: Slow Apple API response (>10 seconds)
2. Service Integration Points
File: services/tracker_fetcher/service.py
- Health Monitor Initialization: Automatic health monitor creation and startup
- Metrics Recording: Integration points for recording fetch attempts and performance
- Graceful Shutdown: Proper health monitor cleanup on service stop
Integration Features:
- Automatic health monitoring startup with service
- Real-time metrics recording during batch processing
- Performance tracking for Apple API calls
- Queue health monitoring integration
- Graceful shutdown handling
3. Testing Infrastructure
File: scripts/test_tracker_fetcher_health_integration.py
- Purpose: Comprehensive testing of health monitoring integration
- Test Coverage:
- Health monitor creation and initialization
- Health indicators collection
- Metrics recording functionality
- Alert generation and thresholds
- Complete health data collection
Architecture
Health Monitor Hierarchy
BaseServiceHealthMonitor (services/shared/service_health_monitor.py)
├── Database connectivity monitoring
├── Redis connectivity monitoring
├── Basic metrics collection
└── Health status determination
TrackerFetcherHealthMonitor (services/tracker_fetcher/health_monitor.py)
├── Extends BaseServiceHealthMonitor
├── Apple API specific monitoring
├── Queue health metrics
├── Fetch performance tracking
└── Service-specific alert generation
Integration Flow
- Service Startup:
# In TrackerFetcherService.start()
self.health_monitor = TrackerFetcherHealthMonitor(self)
await self.health_monitor.start_monitoring()
- Metrics Recording:
# During batch processing
self.health_monitor.record_batch_processing_time(duration * 1000)
self.health_monitor.record_fetch_attempt(success, reports_found, geofence_events)
- Health Publishing:
- Automatic periodic health status publishing
- Real-time metrics publishing
-
Alert generation and publishing
-
Service Shutdown:
# In TrackerFetcherService.stop() if self.health_monitor: await self.health_monitor.stop_monitoring()
Configuration
Health Monitoring Settings
The health monitoring system uses the following configuration from services/shared/config.py:
# Health monitoring intervals
HEALTH_PUBLISHING_INTERVAL: int = 30 # Status publishing interval
HEALTH_METRICS_INTERVAL: int = 300 # Detailed metrics interval
HEALTH_RETENTION_HOURS: int = 24 # Data retention period
# Health Redis configuration (separate from main Redis)
HEALTH_REDIS_HOST: str = "dragonfly"
HEALTH_REDIS_PORT: int = 6379
HEALTH_REDIS_CLUSTER_MODE: bool = False
Service-Specific Thresholds
Tracker Fetcher specific thresholds:
# Queue health thresholds
LARGE_QUEUE_THRESHOLD = 10000 # Warning threshold
VERY_LARGE_QUEUE_THRESHOLD = 50000 # Unhealthy threshold
# Performance thresholds
MIN_SUCCESS_RATE_HEALTHY = 80.0 # Below this = degraded
MIN_SUCCESS_RATE_DEGRADED = 50.0 # Below this = unhealthy
MIN_FETCH_RATE = 10.0 # Trackers per hour
# API response time thresholds
MAX_API_RESPONSE_TIME = 10000 # 10 seconds in milliseconds
Usage
Running Health Integration Tests
# Test the tracker fetcher health integration
docker compose exec dev ./scripts/test_tracker_fetcher_health_integration.py
Monitoring Health Data
The health monitoring system publishes data to Redis channels:
# Subscribe to tracker fetcher health status
HEALTH_CHANNEL_STATUS = "health:service:tracker_fetcher:status"
# Subscribe to detailed metrics
HEALTH_CHANNEL_METRICS = "health:service:tracker_fetcher:metrics"
# Subscribe to system alerts
HEALTH_CHANNEL_ALERTS = "health:system:alerts"
Accessing Health Data Programmatically
from services.tracker_fetcher.service import TrackerFetcherService
from services.tracker_fetcher.health_monitor import TrackerFetcherHealthMonitor
# Create service and health monitor
service = TrackerFetcherService()
health_monitor = TrackerFetcherHealthMonitor(service)
# Collect health data
health_data = await health_monitor.collect_health_data()
print(f"Service status: {health_data['status'].value}")
print(f"Metrics: {health_data['metrics']}")
print(f"Alerts: {len(health_data['alerts'])}")
Next Steps
Phase 4: Additional Service Integration
The framework is now ready to be extended to other services:
- Geocoding Service: Monitor geocoding API performance and cache hit rates
- Realtime Geofence Service: Monitor geofence detection performance
- Location Aggregator: Monitor aggregation performance and data processing
- Notification Service: Monitor notification delivery rates
Phase 5: Admin Panel Integration
Integration with the admin panel for health monitoring dashboard:
- Health Dashboard: Real-time service health visualization
- Metrics Charts: Historical performance and health trends
- Alert Management: Alert acknowledgment and resolution tracking
- Service Control: Start/stop services from admin panel
Benefits
Operational Visibility
- Real-time Monitoring: Continuous health status updates
- Performance Tracking: Detailed metrics on service performance
- Proactive Alerting: Early warning of potential issues
- Historical Analysis: Trend analysis and capacity planning
Reliability Improvements
- Early Problem Detection: Issues identified before they impact users
- Automated Recovery: Health-based service restart capabilities
- Performance Optimization: Data-driven performance improvements
- Capacity Planning: Usage patterns and scaling insights
Development Benefits
- Debugging Support: Rich metrics for troubleshooting
- Performance Profiling: Detailed timing and performance data
- Integration Testing: Comprehensive health check capabilities
- Service Dependencies: Clear visibility into service relationships
Conclusion
Phase 3 successfully integrates comprehensive health monitoring into the tracker fetcher service, providing:
- Complete Health Visibility: All aspects of service health are monitored
- Proactive Alerting: Issues are detected and reported immediately
- Performance Insights: Detailed metrics enable optimization
- Operational Excellence: Foundation for reliable service operations
The implementation provides a template for integrating health monitoring into other services, establishing a consistent approach to service observability across the entire system.