ADR-0043: The notification WAL consumer is a leader-elected singleton
ODD Platform consumes the Postgres WAL for notifications from one thread on one replica, elected by a Postgres advisory lock — so a multi-replica deployment emits each alert once, with no broker.
Status
Accepted. Reconstructed from the codebase on 2026-05-30; the decision is live in the source today.
Context
Notifications are driven by the Postgres write-ahead log: when an alert row is written, a logical-replication consumer reads it and fans it out. In a multi-replica deployment, if every replica consumed the WAL, every alert would be delivered N times. The platform needs exactly-once-cluster-wide consumption — and, in keeping with its Postgres-as-only-runtime-dependency posture, without introducing an external coordinator (ZooKeeper/Consul/etcd) for leader election.
Decision
The WAL consumer runs on a single thread that holds a Postgres advisory lock; only the lock-holding replica consumes. At application startup (ApplicationReadyEvent), NotificationSubscriberStarter submits the subscriber to a single-thread executor whose thread is named notification-subscriber-thread. The subscriber's first action is leaderElectionManager.acquire(walProperties.getAdvisoryLockId(), true) — the blocking form, so a replica that is not the leader blocks here and never reads the WAL. Only the replica that holds the advisory lock opens the logical-replication stream and processes messages.
The advisory lock id is operator-tunable (notifications.wal.advisory-lock-id, default 100), drawn from the same disjoint per-subsystem namespace as the Data Collaboration sender (ADR-0020). On the leader, consumption is single-threaded by construction (one executor thread), so WAL messages are processed in order. If the leader dies, it drops the lock and a waiting replica acquires it and takes over.
This is the same single-leader-via-Postgres-advisory-lock mechanism as ADR-0020's outbound Slack sender; the two are instances of one cluster-coordination convention, each keyed by a distinct lock id.
Consequences
Each alert is consumed once cluster-wide: non-leader replicas block on the lock and never double-deliver, with no external coordinator — Postgres is the only dependency.
Consumption does not scale horizontally — adding replicas adds standby leaders, not parallel consumers; throughput is bounded by the single consumer thread. This is intentional (ordering + exactly-once over throughput).
Failover is automatic via advisory-lock release semantics: killing the leader frees the lock and a standby takes over on its next acquire attempt.
Because the single thread both reads the WAL and drives fan-out, a sender that blocks the thread would stall consumption — which is exactly why fan-out is fail-soft (ADR-0042), so one slow/broken channel cannot wedge the WAL.
Evidence
odd-platform-api/.../notification/NotificationSubscriberStarter.java:21-23—Executors.newSingleThreadExecutor(r -> new Thread(r, "notification-subscriber-thread"));:30-35—@EventListener(ApplicationReadyEvent.class)submits the subscriber at startup.odd-platform-api/.../notification/NotificationSubscriber.java:47—leaderElectionManager.acquire(walProperties.getAdvisoryLockId(), true)(blocking acquire) wraps the replication-stream loop; non-leaders block here.odd-platform-api/.../leaderelection/PostgreSQLLeaderElectionManagerImpl.java:21-23—acquire(...)prepares andexecute()sSELECT pg_advisory_lock(<id>)— the blocking Postgres lock function (not thetry_variant), so the call returns only once the lock is held; the connection is then returned and kept open to hold the lock for the session.odd-platform-api/src/main/resources/application.yml:177—advisory-lock-id: 100undernotifications.wal, the operator-tunable lock id.
See also
ADR-0020 — Outbound Slack delivery is decoupled via a Postgres queue — the same single-leader Postgres-advisory-lock mechanism, applied to outbound delivery (distinct lock id).
ADR-0042 — Notification fan-out is fail-soft per channel — keeps a bad channel from stalling this single consumer thread.
ADR-0044 — Postgres replication artefacts are lazy-created, never dropped — the slot and publication this consumer relies on.
Last updated