Final Root Cause Analysis - The Real Problem Behind Connection Failures

Executive Summary

After deep analysis of the codebase, deployment configuration, and error logs, I have identified the real root cause of the "magically stops" behavior. This is NOT just a connection issue - it's a fundamental architectural mismatch between the service design and deployment reality.

The Real Problem

What's Actually Happening

The error logs show:

tracker-fetcher-api-dev-1   | 2025-08-17 12:20:10 - Redis NOTIFICATIONS health check failed: Error 111 connecting to 192.168.100.1:6379. Connection refused.
tracker-fetcher-api-dev-1   | 2025-08-17 12:20:10 - Database error: connection to server at "192.168.100.1", port 5432 failed: Connection refused

But the critical insight is in the container name: tracker-fetcher-api-dev-1

The Deployment Reality

Looking at services/docker/compose.yml, I found:

tracker-fetcher-api-dev (profile: dev-api) - Runs as API server on port 8081
tracker-fetcher-worker-dev (profile: dev) - Runs as background worker
tracker-fetcher-dev (profile: dev) - Runs as API server on port 8080

The error is coming from tracker-fetcher-api-dev-1, which means:

This is running the API mode (python -m services.tracker_fetcher.taskiq_main api)
But it's trying to do background processing work
The API mode is NOT designed to run continuous background loops

The Architectural Mismatch

What the Code Does

In services/tracker_fetcher/taskiq_main.py:

async def run_worker():
    """Run the service as a worker (without FastAPI)."""
    # Main processing loop
    while running:
        try:
            # Process a batch
            result = await service.process_batch()
            # Wait before next batch
            await asyncio.wait_for(shutdown_event.wait(), timeout=settings.FETCH_INTERVAL)
        except Exception as e:
            logger.error("Error in processing loop", error=str(e))

This is the worker mode - designed for continuous background processing.

But the API mode just runs FastAPI endpoints - no background processing loop.

What's Actually Running

The error logs show tracker-fetcher-api-dev-1 is running, which means:

Container is running in API mode (not worker mode)
API mode has no background processing loop
API mode only responds to HTTP requests
There is no continuous fetching happening

The "Magically Stops" Explanation

The service doesn't "magically stop" - it was never designed to run continuously in API mode.

Here's what's happening:

API Mode Starts: Container starts in API mode, initializes connections
Health Checks Run: SimpleHealthMonitor tries to check Redis/DB connections
Connection Failures: Network issues cause connection failures
No Recovery Loop: API mode has no background processing, so no retry attempts
Service Appears Dead: Health checks fail, no processing happens, looks like it "stopped"

But the truth is: API mode was never supposed to do background processing.

The Real Root Causes

1. Wrong Service Mode Running

Problem: The logs show tracker-fetcher-api-dev-1 running, but background processing should use tracker-fetcher-worker-dev.

Evidence:

Error logs from tracker-fetcher-api-dev-1
API mode has no processing loop
Worker mode has the continuous processing logic

2. Deployment Configuration Mismatch

Problem: The wrong Docker Compose profile is being used.

Current Reality:

dev-api profile runs API servers (ports 8081, 8003)
dev profile runs background workers
Error logs suggest dev-api profile is running when dev should be

3. No Resilience in API Mode

Problem: API mode has no connection resilience because it's not designed for continuous operation.

Evidence from Code:

# services/tracker_fetcher/taskiq_service.py
def health_check(self) -> dict:
    try:
        with get_db_context() as db:
            db.execute(text("SELECT 1"))
        db_healthy = True
    except Exception:
        db_healthy = False  # FAILS AND STOPS - NO RETRY

4. Missing Background Processing

Problem: No continuous background processing is happening because the wrong mode is running.

What Should Happen:

Worker mode runs continuous loop every 5 minutes (FETCH_INTERVAL=300)
Worker mode has retry logic and error handling
Worker mode processes batches automatically

What's Actually Happening:

API mode waits for HTTP requests
No automatic batch processing
No continuous operation

The Complete Solution

Immediate Fix (Deployment)

Stop Running API Mode for Background Processing
Use tracker-fetcher-worker-dev for continuous processing
Use tracker-fetcher-api-dev only for HTTP API access (optional)
Correct Docker Compose Usage

# For background processing (what you actually want)
docker compose --profile dev up tracker-fetcher-worker-dev

# NOT this (API mode - no background processing)
docker compose --profile dev-api up tracker-fetcher-api-dev

Architectural Fix (Code)

Add Connection Resilience to Both Modes
API mode needs resilient connections for HTTP requests
Worker mode needs resilient connections for continuous processing
Implement Proper Health Monitoring
Active recovery in worker mode
Proper error reporting in API mode
Add Circuit Breaker Pattern
Prevent cascade failures
Allow graceful degradation

Long-term Fix (TaskiQ Migration)

This analysis confirms the Task Management Replacement document is correct:

Current architecture is fundamentally flawed
Need database-driven queues (not Redis state)
Need proper distributed locking
Need resilient connection management

Why This Keeps Happening

The Deployment Confusion

Multiple Service Modes: API vs Worker modes are confusing
Profile Complexity: Multiple Docker Compose profiles
No Clear Documentation: Which mode to use for what purpose
Wrong Assumptions: Assuming API mode does background processing

The Architectural Debt

Legacy Celery Patterns: Code still follows Celery patterns
No Resilience Design: System assumes perfect connections
Mode Confusion: API and Worker modes have different purposes
No Proper Orchestration: No clear service startup/dependency management

Success Criteria for Fix

Immediate (Deployment Fix)

Correct service mode running (worker, not API)
Continuous background processing every 5 minutes
Proper error handling and retry logic
Clear documentation on which mode to use

Short-term (Resilience Fix)

Connection resilience in both API and worker modes
Circuit breaker pattern implementation
Active health monitoring with recovery
Proper logging and error reporting

Long-term (Architecture Fix)

Complete TaskiQ migration per replacement document
Database-driven queue management
Distributed locking for horizontal scaling
Elimination of Redis queue state

Conclusion

The "unreliable behaviour of celery" mentioned in your original message is actually unreliable behaviour of running the wrong service mode.

The connection failures are real, but they're not the root cause. The root cause is:

Wrong service mode running (API instead of Worker)
No background processing happening (API mode doesn't process batches)
No connection resilience (neither mode handles failures properly)
Deployment configuration confusion (wrong Docker Compose profiles)

The fix is not to patch connection handling - it's to run the correct service mode and implement proper resilience patterns.

This analysis validates the Task Management Replacement strategy: the current system is architecturally flawed and needs the planned TaskiQ migration with database-driven queues and proper resilience patterns.