ADR-0042: Notification fan-out is fail-soft per channel

When ODD Platform fans an alert out to its notification channels, a failure in one channel is logged and the rest still receive it — one broken channel never blocks the others.

Status

Accepted. Reconstructed from the codebase on 2026-05-30; the decision is live in the source today.

Context

A single alert can fan out to several notification channels (Slack, webhook, email). Any one of them can fail transiently — a webhook endpoint is down, an SMTP server times out. The platform has to decide what happens to the other channels, and to the next alert, when one send fails: stop the whole fan-out, or carry on.

Decision

Fan-out is fail-soft per channel: a send failure is caught, logged at ERROR naming the channel, and the loop continues to the next channel. AlertNotificationMessageProcessor.process iterates the configured senders and calls each inside a try/catch (NotificationSenderException); on exception it logs the failing channel's receiverId() and proceeds to the next sender. The exception does not propagate, so the next sender still runs and the next WAL message is still processed.

The decision encodes "one bad channel does not block the others" as the operational stance. The alternative — let the first failure abort the fan-out — would couple every channel's delivery to the least reliable one, and would stall WAL progress behind a single bad endpoint.

Consequences

  • A misconfigured or down channel does not stop the others: Slack still gets the alert if the webhook is failing, and the WAL keeps advancing rather than wedging behind a failed send.

  • 📌 Partial failure is operator-visible only in logs. Because the failure is caught and logged rather than surfaced, the platform keeps no delivery-status record, counter, or alert for a channel that is silently failing — an operator learns of a dead channel only by inspecting ERROR logs. Closing that blind spot (a delivery audit trail or a failure metric) would be an additive change that does not alter the fail-soft stance.

  • The stance is consistent with the platform's broader "best-effort across a list of independent operations" convention (the same continue-on-failure shape used by the partition-management orchestrator).

Evidence

  • odd-platform-api/.../notification/processor/AlertNotificationMessageProcessor.java:26-35 — the fan-out loop: for (… notificationSender : notificationSenders) { try { notificationSender.send(notificationMessage); } catch (NotificationSenderException e) { log.error(…"Error occurred while sending notification via %s"…, notificationSender.receiverId()…); } } — caught, logged, loop continues; no rethrow.

  • odd-platform-api/.../notification/processor/AlertNotificationMessageProcessor.java:19private final List<NotificationSender<AlertNotificationMessage>> notificationSenders; — the fan-out target is the list of activated channel senders (per ADR-0041).

See also

Last updated