Final Root Cause Analysis - The Real Problem Behind Connection Failures
Executive Summary
After deep analysis of the codebase, deployment configuration, and error logs, I have identified the real root cause of the "magically stops" behavior. This is NOT just a connection issue - it's a fundamental architectural mismatch between the service design and deployment reality.
The Real Problem
What's Actually Happening
The error logs show:
tracker-fetcher-api-dev-1 | 2025-08-17 12:20:10 - Redis NOTIFICATIONS health check failed: Error 111 connecting to 192.168.100.1:6379. Connection refused.
tracker-fetcher-api-dev-1 | 2025-08-17 12:20:10 - Database error: connection to server at "192.168.100.1", port 5432 failed: Connection refused
But the critical insight is in the container name: tracker-fetcher-api-dev-1
The Deployment Reality
Looking at services/docker/compose.yml, I found:
- tracker-fetcher-api-dev (profile: dev-api) - Runs as API server on port 8081
- tracker-fetcher-worker-dev (profile: dev) - Runs as background worker
- tracker-fetcher-dev (profile: dev) - Runs as API server on port 8080
The error is coming from tracker-fetcher-api-dev-1, which means:
- This is running the API mode (
python -m services.tracker_fetcher.taskiq_main api) - But it's trying to do background processing work
- The API mode is NOT designed to run continuous background loops
The Architectural Mismatch
What the Code Does
In services/tracker_fetcher/taskiq_main.py:
async def run_worker():
"""Run the service as a worker (without FastAPI)."""
# Main processing loop
while running:
try:
# Process a batch
result = await service.process_batch()
# Wait before next batch
await asyncio.wait_for(shutdown_event.wait(), timeout=settings.FETCH_INTERVAL)
except Exception as e:
logger.error("Error in processing loop", error=str(e))
This is the worker mode - designed for continuous background processing.
But the API mode just runs FastAPI endpoints - no background processing loop.
What's Actually Running
The error logs show tracker-fetcher-api-dev-1 is running, which means:
- Container is running in API mode (not worker mode)
- API mode has no background processing loop
- API mode only responds to HTTP requests
- There is no continuous fetching happening
The "Magically Stops" Explanation
The service doesn't "magically stop" - it was never designed to run continuously in API mode.
Here's what's happening:
- API Mode Starts: Container starts in API mode, initializes connections
- Health Checks Run: SimpleHealthMonitor tries to check Redis/DB connections
- Connection Failures: Network issues cause connection failures
- No Recovery Loop: API mode has no background processing, so no retry attempts
- Service Appears Dead: Health checks fail, no processing happens, looks like it "stopped"
But the truth is: API mode was never supposed to do background processing.
The Real Root Causes
1. Wrong Service Mode Running
Problem: The logs show tracker-fetcher-api-dev-1 running, but background processing should use tracker-fetcher-worker-dev.
Evidence:
- Error logs from
tracker-fetcher-api-dev-1 - API mode has no processing loop
- Worker mode has the continuous processing logic
2. Deployment Configuration Mismatch
Problem: The wrong Docker Compose profile is being used.
Current Reality:
dev-apiprofile runs API servers (ports 8081, 8003)devprofile runs background workers- Error logs suggest
dev-apiprofile is running whendevshould be
3. No Resilience in API Mode
Problem: API mode has no connection resilience because it's not designed for continuous operation.
Evidence from Code:
# services/tracker_fetcher/taskiq_service.py
def health_check(self) -> dict:
try:
with get_db_context() as db:
db.execute(text("SELECT 1"))
db_healthy = True
except Exception:
db_healthy = False # FAILS AND STOPS - NO RETRY
4. Missing Background Processing
Problem: No continuous background processing is happening because the wrong mode is running.
What Should Happen:
- Worker mode runs continuous loop every 5 minutes (FETCH_INTERVAL=300)
- Worker mode has retry logic and error handling
- Worker mode processes batches automatically
What's Actually Happening:
- API mode waits for HTTP requests
- No automatic batch processing
- No continuous operation
The Complete Solution
Immediate Fix (Deployment)
- Stop Running API Mode for Background Processing
- Use
tracker-fetcher-worker-devfor continuous processing -
Use
tracker-fetcher-api-devonly for HTTP API access (optional) -
Correct Docker Compose Usage
# For background processing (what you actually want)
docker compose --profile dev up tracker-fetcher-worker-dev
# NOT this (API mode - no background processing)
docker compose --profile dev-api up tracker-fetcher-api-dev
Architectural Fix (Code)
- Add Connection Resilience to Both Modes
- API mode needs resilient connections for HTTP requests
-
Worker mode needs resilient connections for continuous processing
-
Implement Proper Health Monitoring
- Active recovery in worker mode
-
Proper error reporting in API mode
-
Add Circuit Breaker Pattern
- Prevent cascade failures
- Allow graceful degradation
Long-term Fix (TaskiQ Migration)
This analysis confirms the Task Management Replacement document is correct:
- Current architecture is fundamentally flawed
- Need database-driven queues (not Redis state)
- Need proper distributed locking
- Need resilient connection management
Why This Keeps Happening
The Deployment Confusion
- Multiple Service Modes: API vs Worker modes are confusing
- Profile Complexity: Multiple Docker Compose profiles
- No Clear Documentation: Which mode to use for what purpose
- Wrong Assumptions: Assuming API mode does background processing
The Architectural Debt
- Legacy Celery Patterns: Code still follows Celery patterns
- No Resilience Design: System assumes perfect connections
- Mode Confusion: API and Worker modes have different purposes
- No Proper Orchestration: No clear service startup/dependency management
Success Criteria for Fix
Immediate (Deployment Fix)
- Correct service mode running (worker, not API)
- Continuous background processing every 5 minutes
- Proper error handling and retry logic
- Clear documentation on which mode to use
Short-term (Resilience Fix)
- Connection resilience in both API and worker modes
- Circuit breaker pattern implementation
- Active health monitoring with recovery
- Proper logging and error reporting
Long-term (Architecture Fix)
- Complete TaskiQ migration per replacement document
- Database-driven queue management
- Distributed locking for horizontal scaling
- Elimination of Redis queue state
Conclusion
The "unreliable behaviour of celery" mentioned in your original message is actually unreliable behaviour of running the wrong service mode.
The connection failures are real, but they're not the root cause. The root cause is:
- Wrong service mode running (API instead of Worker)
- No background processing happening (API mode doesn't process batches)
- No connection resilience (neither mode handles failures properly)
- Deployment configuration confusion (wrong Docker Compose profiles)
The fix is not to patch connection handling - it's to run the correct service mode and implement proper resilience patterns.
This analysis validates the Task Management Replacement strategy: the current system is architecturally flawed and needs the planned TaskiQ migration with database-driven queues and proper resilience patterns.