Skip to content

Phase 3: Service Integration Implementation

This document describes the implementation of Phase 3 of the distributed health monitoring system, which integrates health monitoring into existing services.

Overview

Phase 3 builds upon the foundation established in Phase 1 (Health Redis Infrastructure) and Phase 2 (Health Publisher Framework) to integrate comprehensive health monitoring into our microservices.

Implementation Status

✅ Completed Components

1. Tracker Fetcher Service Integration

File: services/tracker_fetcher/health_monitor.py

  • Purpose: Comprehensive health monitoring for the tracker fetcher service
  • Features:
  • Apple API connectivity monitoring
  • Queue health metrics (immediate, hot, warm, cold, retry queues)
  • Fetch performance tracking
  • Geofence event generation monitoring
  • Anisette server connectivity checks
  • Batch processing performance metrics

Key Metrics Tracked:

  • fetch_rate_per_hour: Trackers processed per hour
  • success_rate_percent: Percentage of successful fetch attempts
  • apple_account_authenticated: Apple account authentication status
  • anisette_server_reachable: Anisette server connectivity
  • total_queue_size: Combined size of all queues
  • apple_api_response_time_ms: Apple API response time
  • reports_found_per_tracker: Average reports found per tracker
  • geofence_events_generated: Total geofence events created

Alert Conditions:

  • CRITICAL: Apple account not authenticated
  • CRITICAL: Anisette server unreachable
  • WARNING: Large queue backlog (>10,000 trackers)
  • WARNING/CRITICAL: Low fetch success rate (<80%/50%)
  • WARNING: Low fetch rate (<10 trackers/hour)
  • WARNING: Slow Apple API response (>10 seconds)

2. Service Integration Points

File: services/tracker_fetcher/service.py

  • Health Monitor Initialization: Automatic health monitor creation and startup
  • Metrics Recording: Integration points for recording fetch attempts and performance
  • Graceful Shutdown: Proper health monitor cleanup on service stop

Integration Features:

  • Automatic health monitoring startup with service
  • Real-time metrics recording during batch processing
  • Performance tracking for Apple API calls
  • Queue health monitoring integration
  • Graceful shutdown handling

3. Testing Infrastructure

File: scripts/test_tracker_fetcher_health_integration.py

  • Purpose: Comprehensive testing of health monitoring integration
  • Test Coverage:
  • Health monitor creation and initialization
  • Health indicators collection
  • Metrics recording functionality
  • Alert generation and thresholds
  • Complete health data collection

Architecture

Health Monitor Hierarchy

BaseServiceHealthMonitor (services/shared/service_health_monitor.py)
├── Database connectivity monitoring
├── Redis connectivity monitoring
├── Basic metrics collection
└── Health status determination

TrackerFetcherHealthMonitor (services/tracker_fetcher/health_monitor.py)
├── Extends BaseServiceHealthMonitor
├── Apple API specific monitoring
├── Queue health metrics
├── Fetch performance tracking
└── Service-specific alert generation

Integration Flow

  1. Service Startup:
# In TrackerFetcherService.start()
self.health_monitor = TrackerFetcherHealthMonitor(self)
await self.health_monitor.start_monitoring()
  1. Metrics Recording:
# During batch processing
self.health_monitor.record_batch_processing_time(duration * 1000)
self.health_monitor.record_fetch_attempt(success, reports_found, geofence_events)
  1. Health Publishing:
  2. Automatic periodic health status publishing
  3. Real-time metrics publishing
  4. Alert generation and publishing

  5. Service Shutdown:

    # In TrackerFetcherService.stop()
    if self.health_monitor:
        await self.health_monitor.stop_monitoring()
    

Configuration

Health Monitoring Settings

The health monitoring system uses the following configuration from services/shared/config.py:

# Health monitoring intervals
HEALTH_PUBLISHING_INTERVAL: int = 30  # Status publishing interval
HEALTH_METRICS_INTERVAL: int = 300    # Detailed metrics interval
HEALTH_RETENTION_HOURS: int = 24      # Data retention period

# Health Redis configuration (separate from main Redis)
HEALTH_REDIS_HOST: str = "dragonfly"
HEALTH_REDIS_PORT: int = 6379
HEALTH_REDIS_CLUSTER_MODE: bool = False

Service-Specific Thresholds

Tracker Fetcher specific thresholds:

# Queue health thresholds
LARGE_QUEUE_THRESHOLD = 10000      # Warning threshold
VERY_LARGE_QUEUE_THRESHOLD = 50000 # Unhealthy threshold

# Performance thresholds
MIN_SUCCESS_RATE_HEALTHY = 80.0    # Below this = degraded
MIN_SUCCESS_RATE_DEGRADED = 50.0   # Below this = unhealthy
MIN_FETCH_RATE = 10.0              # Trackers per hour

# API response time thresholds
MAX_API_RESPONSE_TIME = 10000      # 10 seconds in milliseconds

Usage

Running Health Integration Tests

# Test the tracker fetcher health integration
docker compose exec dev ./scripts/test_tracker_fetcher_health_integration.py

Monitoring Health Data

The health monitoring system publishes data to Redis channels:

# Subscribe to tracker fetcher health status
HEALTH_CHANNEL_STATUS = "health:service:tracker_fetcher:status"

# Subscribe to detailed metrics
HEALTH_CHANNEL_METRICS = "health:service:tracker_fetcher:metrics"

# Subscribe to system alerts
HEALTH_CHANNEL_ALERTS = "health:system:alerts"

Accessing Health Data Programmatically

from services.tracker_fetcher.service import TrackerFetcherService
from services.tracker_fetcher.health_monitor import TrackerFetcherHealthMonitor

# Create service and health monitor
service = TrackerFetcherService()
health_monitor = TrackerFetcherHealthMonitor(service)

# Collect health data
health_data = await health_monitor.collect_health_data()
print(f"Service status: {health_data['status'].value}")
print(f"Metrics: {health_data['metrics']}")
print(f"Alerts: {len(health_data['alerts'])}")

Next Steps

Phase 4: Additional Service Integration

The framework is now ready to be extended to other services:

  1. Geocoding Service: Monitor geocoding API performance and cache hit rates
  2. Realtime Geofence Service: Monitor geofence detection performance
  3. Location Aggregator: Monitor aggregation performance and data processing
  4. Notification Service: Monitor notification delivery rates

Phase 5: Admin Panel Integration

Integration with the admin panel for health monitoring dashboard:

  1. Health Dashboard: Real-time service health visualization
  2. Metrics Charts: Historical performance and health trends
  3. Alert Management: Alert acknowledgment and resolution tracking
  4. Service Control: Start/stop services from admin panel

Benefits

Operational Visibility

  • Real-time Monitoring: Continuous health status updates
  • Performance Tracking: Detailed metrics on service performance
  • Proactive Alerting: Early warning of potential issues
  • Historical Analysis: Trend analysis and capacity planning

Reliability Improvements

  • Early Problem Detection: Issues identified before they impact users
  • Automated Recovery: Health-based service restart capabilities
  • Performance Optimization: Data-driven performance improvements
  • Capacity Planning: Usage patterns and scaling insights

Development Benefits

  • Debugging Support: Rich metrics for troubleshooting
  • Performance Profiling: Detailed timing and performance data
  • Integration Testing: Comprehensive health check capabilities
  • Service Dependencies: Clear visibility into service relationships

Conclusion

Phase 3 successfully integrates comprehensive health monitoring into the tracker fetcher service, providing:

  • Complete Health Visibility: All aspects of service health are monitored
  • Proactive Alerting: Issues are detected and reported immediately
  • Performance Insights: Detailed metrics enable optimization
  • Operational Excellence: Foundation for reliable service operations

The implementation provides a template for integrating health monitoring into other services, establishing a consistent approach to service observability across the entire system.