Skip to content

Final Root Cause Analysis - The Real Problem Behind Connection Failures

Executive Summary

After deep analysis of the codebase, deployment configuration, and error logs, I have identified the real root cause of the "magically stops" behavior. This is NOT just a connection issue - it's a fundamental architectural mismatch between the service design and deployment reality.

The Real Problem

What's Actually Happening

The error logs show:

tracker-fetcher-api-dev-1   | 2025-08-17 12:20:10 - Redis NOTIFICATIONS health check failed: Error 111 connecting to 192.168.100.1:6379. Connection refused.
tracker-fetcher-api-dev-1   | 2025-08-17 12:20:10 - Database error: connection to server at "192.168.100.1", port 5432 failed: Connection refused

But the critical insight is in the container name: tracker-fetcher-api-dev-1

The Deployment Reality

Looking at services/docker/compose.yml, I found:

  1. tracker-fetcher-api-dev (profile: dev-api) - Runs as API server on port 8081
  2. tracker-fetcher-worker-dev (profile: dev) - Runs as background worker
  3. tracker-fetcher-dev (profile: dev) - Runs as API server on port 8080

The error is coming from tracker-fetcher-api-dev-1, which means:

  • This is running the API mode (python -m services.tracker_fetcher.taskiq_main api)
  • But it's trying to do background processing work
  • The API mode is NOT designed to run continuous background loops

The Architectural Mismatch

What the Code Does

In services/tracker_fetcher/taskiq_main.py:

async def run_worker():
    """Run the service as a worker (without FastAPI)."""
    # Main processing loop
    while running:
        try:
            # Process a batch
            result = await service.process_batch()
            # Wait before next batch
            await asyncio.wait_for(shutdown_event.wait(), timeout=settings.FETCH_INTERVAL)
        except Exception as e:
            logger.error("Error in processing loop", error=str(e))

This is the worker mode - designed for continuous background processing.

But the API mode just runs FastAPI endpoints - no background processing loop.

What's Actually Running

The error logs show tracker-fetcher-api-dev-1 is running, which means:

  • Container is running in API mode (not worker mode)
  • API mode has no background processing loop
  • API mode only responds to HTTP requests
  • There is no continuous fetching happening

The "Magically Stops" Explanation

The service doesn't "magically stop" - it was never designed to run continuously in API mode.

Here's what's happening:

  1. API Mode Starts: Container starts in API mode, initializes connections
  2. Health Checks Run: SimpleHealthMonitor tries to check Redis/DB connections
  3. Connection Failures: Network issues cause connection failures
  4. No Recovery Loop: API mode has no background processing, so no retry attempts
  5. Service Appears Dead: Health checks fail, no processing happens, looks like it "stopped"

But the truth is: API mode was never supposed to do background processing.

The Real Root Causes

1. Wrong Service Mode Running

Problem: The logs show tracker-fetcher-api-dev-1 running, but background processing should use tracker-fetcher-worker-dev.

Evidence:

  • Error logs from tracker-fetcher-api-dev-1
  • API mode has no processing loop
  • Worker mode has the continuous processing logic

2. Deployment Configuration Mismatch

Problem: The wrong Docker Compose profile is being used.

Current Reality:

  • dev-api profile runs API servers (ports 8081, 8003)
  • dev profile runs background workers
  • Error logs suggest dev-api profile is running when dev should be

3. No Resilience in API Mode

Problem: API mode has no connection resilience because it's not designed for continuous operation.

Evidence from Code:

# services/tracker_fetcher/taskiq_service.py
def health_check(self) -> dict:
    try:
        with get_db_context() as db:
            db.execute(text("SELECT 1"))
        db_healthy = True
    except Exception:
        db_healthy = False  # FAILS AND STOPS - NO RETRY

4. Missing Background Processing

Problem: No continuous background processing is happening because the wrong mode is running.

What Should Happen:

  • Worker mode runs continuous loop every 5 minutes (FETCH_INTERVAL=300)
  • Worker mode has retry logic and error handling
  • Worker mode processes batches automatically

What's Actually Happening:

  • API mode waits for HTTP requests
  • No automatic batch processing
  • No continuous operation

The Complete Solution

Immediate Fix (Deployment)

  1. Stop Running API Mode for Background Processing
  2. Use tracker-fetcher-worker-dev for continuous processing
  3. Use tracker-fetcher-api-dev only for HTTP API access (optional)

  4. Correct Docker Compose Usage

# For background processing (what you actually want)
docker compose --profile dev up tracker-fetcher-worker-dev

# NOT this (API mode - no background processing)
docker compose --profile dev-api up tracker-fetcher-api-dev

Architectural Fix (Code)

  1. Add Connection Resilience to Both Modes
  2. API mode needs resilient connections for HTTP requests
  3. Worker mode needs resilient connections for continuous processing

  4. Implement Proper Health Monitoring

  5. Active recovery in worker mode
  6. Proper error reporting in API mode

  7. Add Circuit Breaker Pattern

  8. Prevent cascade failures
  9. Allow graceful degradation

Long-term Fix (TaskiQ Migration)

This analysis confirms the Task Management Replacement document is correct:

  • Current architecture is fundamentally flawed
  • Need database-driven queues (not Redis state)
  • Need proper distributed locking
  • Need resilient connection management

Why This Keeps Happening

The Deployment Confusion

  1. Multiple Service Modes: API vs Worker modes are confusing
  2. Profile Complexity: Multiple Docker Compose profiles
  3. No Clear Documentation: Which mode to use for what purpose
  4. Wrong Assumptions: Assuming API mode does background processing

The Architectural Debt

  1. Legacy Celery Patterns: Code still follows Celery patterns
  2. No Resilience Design: System assumes perfect connections
  3. Mode Confusion: API and Worker modes have different purposes
  4. No Proper Orchestration: No clear service startup/dependency management

Success Criteria for Fix

Immediate (Deployment Fix)

  • Correct service mode running (worker, not API)
  • Continuous background processing every 5 minutes
  • Proper error handling and retry logic
  • Clear documentation on which mode to use

Short-term (Resilience Fix)

  • Connection resilience in both API and worker modes
  • Circuit breaker pattern implementation
  • Active health monitoring with recovery
  • Proper logging and error reporting

Long-term (Architecture Fix)

  • Complete TaskiQ migration per replacement document
  • Database-driven queue management
  • Distributed locking for horizontal scaling
  • Elimination of Redis queue state

Conclusion

The "unreliable behaviour of celery" mentioned in your original message is actually unreliable behaviour of running the wrong service mode.

The connection failures are real, but they're not the root cause. The root cause is:

  1. Wrong service mode running (API instead of Worker)
  2. No background processing happening (API mode doesn't process batches)
  3. No connection resilience (neither mode handles failures properly)
  4. Deployment configuration confusion (wrong Docker Compose profiles)

The fix is not to patch connection handling - it's to run the correct service mode and implement proper resilience patterns.

This analysis validates the Task Management Replacement strategy: the current system is architecturally flawed and needs the planned TaskiQ migration with database-driven queues and proper resilience patterns.