When an event fails after all retries, we just log it — no alerts, no visibility. That makes it hard to: * Spot recurring failure patterns * Catch critical issues quickly * Tell recoverable errors from unrecoverable ones **Current behavior:** * 20 retries, 30-second backoff (`saveEventProcessingJob.ts:35-36`) * Logs look like: `❌ [${job}] failed: ${err.message}` (`queue.ts:35`) * No error categorization * No alerting for critical failures We need to find a way to be notified about (critical) errors.
When an event fails after all retries, we just log it — no alerts, no visibility. That makes it hard to:
Current behavior:
saveEventProcessingJob.ts:35-36)❌ [${job}] failed: ${err.message}(queue.ts:35)We need to find a way to be notified about (critical) errors.