ADR-0042: Notification fan-out is fail-soft per channel
When ODD Platform fans an alert out to its notification channels, a failure in one channel is logged and the rest still receive it — one broken channel never blocks the others.
Status
Accepted. Reconstructed from the codebase on 2026-05-30; the decision is live in the source today.
Context
A single alert can fan out to several notification channels (Slack, webhook, email). Any one of them can fail transiently — a webhook endpoint is down, an SMTP server times out. The platform has to decide what happens to the other channels, and to the next alert, when one send fails: stop the whole fan-out, or carry on.
Decision
Fan-out is fail-soft per channel: a send failure is caught, logged at ERROR naming the channel, and the loop continues to the next channel. AlertNotificationMessageProcessor.process iterates the configured senders and calls each inside a try/catch (NotificationSenderException); on exception it logs the failing channel's receiverId() and proceeds to the next sender. The exception does not propagate, so the next sender still runs and the next WAL message is still processed.
The decision encodes "one bad channel does not block the others" as the operational stance. The alternative — let the first failure abort the fan-out — would couple every channel's delivery to the least reliable one, and would stall WAL progress behind a single bad endpoint.
Consequences
A misconfigured or down channel does not stop the others: Slack still gets the alert if the webhook is failing, and the WAL keeps advancing rather than wedging behind a failed send.
📌 Partial failure is operator-visible only in logs. Because the failure is caught and logged rather than surfaced, the platform keeps no delivery-status record, counter, or alert for a channel that is silently failing — an operator learns of a dead channel only by inspecting ERROR logs. Closing that blind spot (a delivery audit trail or a failure metric) would be an additive change that does not alter the fail-soft stance.
The stance is consistent with the platform's broader "best-effort across a list of independent operations" convention (the same continue-on-failure shape used by the partition-management orchestrator).
Evidence
odd-platform-api/.../notification/processor/AlertNotificationMessageProcessor.java:26-35— the fan-out loop:for (… notificationSender : notificationSenders) { try { notificationSender.send(notificationMessage); } catch (NotificationSenderException e) { log.error(…"Error occurred while sending notification via %s"…, notificationSender.receiverId()…); } }— caught, logged, loop continues; no rethrow.odd-platform-api/.../notification/processor/AlertNotificationMessageProcessor.java:19—private final List<NotificationSender<AlertNotificationMessage>> notificationSenders;— the fan-out target is the list of activated channel senders (per ADR-0041).
See also
ADR-0041 — Notification channels activate by the presence of their keys — what populates the list of senders this fan-out iterates.
ADR-0043 — Notification WAL consumer is a leader-elected singleton — the WAL loop whose progress fail-soft fan-out protects.
Last updated