Alerting
ODD Platform watches each catalogued entity for failed jobs, failed data-quality tests, backwards-incompatible schema changes, and externally-injected distribution anomalies — and tracks every alert t

Whenever an issue arises with a catalogued entity — a failed job, a failed data quality test, a backwards-incompatible schema change, or an externally-injected distribution anomaly — the platform raises an alert visible in the navigation pane's Alerts section and on each affected entity's own Alerts tab.

Each alert carries the affected entity, the alert type, the triggering timestamp, status history, and a Resolve action; resolving an alert updates the same record in place.

The platform uses PostgreSQL logical replication to deliver alerts even when the alerting pipeline is briefly partitioned from the primary database — see the PostgreSQL Configuration operator guide for the database-side prerequisites.
Alert types
The platform tracks four alert types:
Failed job — a transformer entity's most recent run reported failure.
Failed data quality test — a quality-test entity's most recent run reported failure.
Backwards incompatible schema change — a producer dropped something a downstream consumer was relying on (see Backwards-incompatible schema change for the detection rules).
Distribution anomaly — anomalous distributions detected externally and pushed in via Prometheus AlertManager (see the inbound webhook description in Notifications).
Alerts originate from two sources: the platform's own ingestion / evaluation pipeline (for the first three types and any other internal triggers), and optionally from an external Prometheus AlertManager via the POST /ingestion/alert/alertmanager inbound webhook. AlertManager-routed alerts surface as Distribution Anomaly alerts using the entity_oddrn label to attribute them to the affected entity. The setup steps for the AlertManager integration live on Notifications → Prometheus AlertManager inbound webhook.
How an alert leaves the platform — Slack, email, generic webhook, plus the AlertManager-driven inbound path — is its own subsystem; see Notifications.
Alert views — All, My Objects, Dependents
The Alerts section in the navigation pane has three tabs that scope the alert list to different slices of the platform. Pick the tab that matches how you're working the queue — on large deployments the default All view can run into the hundreds of open alerts, and the My Objects / Dependents tabs let individual owners cut the list down to what they're responsible for.
All
Every open and resolved alert across the whole platform.
Platform-wide triage; stewards and admins watching the full alert surface.
My Objects
Alerts raised on data entities where the signed-in user is a registered owner.
Per-owner view — "what fires on the things I own". Requires user ↔ owner association.
Dependents
Alerts raised on data entities that are downstream of entities the signed-in user owns (via lineage).
Impact view — "what's breaking in systems that consume my data". Surfaces ripple effects before the downstream team pings you. Requires user ↔ owner association.

The endpoints behind the three tabs (getAllAlerts, getAssociatedUserAlerts, getDependentEntitiesAlerts) plus the badge-counter call (getAlertTotals) are documented at API Reference → Alerts → Global alert listings.
The My Objects and Dependents tabs are hidden unless the signed-in user is linked to an Owner — without the association, the platform cannot evaluate "mine" or "downstream of mine". An operator who sees only the All tab almost always has a missing user-owner link. This is a different My Objects from the Recommended → My Objects wizard on the main page — the wizard surfaces recently-ingested owned entities; this tab filters open alerts.
Alert lifecycle: statuses, resolution, cleanup
Every alert moves through three statuses:
OPEN
The alert is active. It shows on the entity's Alerts tab and counts toward the All / My / Dependents badges.
The platform, when the alert is created.
RESOLVED
An operator marked the alert resolved by hand — typically after fixing the underlying issue or judging it a false positive.
An operator, via the Resolve action on the alert.
RESOLVED_AUTOMATICALLY
The platform itself resolved the alert because the condition that fired it has cleared (see below). No operator acted.
The platform, on the next ingest that observes the cleared condition.
The two resolved statuses behave the same way in the listings — both clear the alert from open counts and stop notifications — but the distinction is preserved on the alert record and in the Activity Feed, so you can tell whether an alert was worked on or simply cleared itself. Auto-resolution events are recorded as system events on the feed; manual resolutions carry the operator's identity.
Auto-resolution triggers
Auto-resolution applies to Failed job and Failed data quality test alerts only. The other two alert types — Backwards incompatible schema change and Distribution anomaly — never auto-resolve and stay OPEN until an operator resolves them by hand.
The trigger is the next ingest that reports a successful run for the same task on the same entity:
If a job that previously failed succeeds on its next run, the open
Failed jobalert for that entity flips toRESOLVED_AUTOMATICALLY.If a data-quality test that previously failed passes on its next run, the open
Failed data quality testalert for that test flips toRESOLVED_AUTOMATICALLY.
A subsequent failure opens a new alert; auto-resolved alerts are not reopened.
Manual reopen has a guard. An operator can reopen a RESOLVED or RESOLVED_AUTOMATICALLY alert by sending its status back to OPEN (PUT /api/alerts/{alert_id}/status) — but only if there is no other open alert of the same type on the same data entity. The platform refuses the reopen with Cannot reopen alert since the system already has an open alert of the same type. Resolve or work the newer alert first, or leave the old one closed.
Auto-cleanup of resolved alerts
Resolved alerts do not accumulate forever. The platform's housekeeping job permanently deletes auto-resolved (RESOLVED_AUTOMATICALLY) alerts whose status-update timestamp is older than housekeeping.ttl.resolved_alerts_days (default 30 days). The chunk records attached to each alert are deleted along with it — this is a hard delete, not a soft one, so once the window passes the alert is gone from the database. The retention window is intended to apply to manually resolved (RESOLVED) alerts as well, but a known platform bug currently exempts manual resolutions from the retention check (see the warning below).
To change the retention window, see Housekeeping Settings Configuration. Raise the value before auto-resolved alerts age out if you need a longer audit trail; once deleted, alerts cannot be recovered.
Manually resolved alerts are deleted on the next housekeeping run regardless of the retention window. Due to a SQL operator-precedence bug in the cleanup query — the WHERE clause is written as status = RESOLVED OR status = RESOLVED_AUTOMATICALLY AND status_updated_at <= cutoff and is parsed by Postgres as (status = RESOLVED) OR (status = RESOLVED_AUTOMATICALLY AND status_updated_at <= cutoff) — every RESOLVED row is selected for deletion regardless of how long ago it was resolved. The next housekeeping run after a manual resolve hard-deletes the alert and its chunks within minutes, well before housekeeping.ttl.resolved_alerts_days would expire. Raising the TTL does not help — manual RESOLVED rows bypass the retention check entirely, so the value is meaningless for them until the bug is fixed.
The only operator-side workaround until the platform fix lands is to export the alert audit data before manually resolving — GET /api/dataentities/{data_entity_id}/alerts returns the open and recently-resolved set including chunks and status history. Persist the response somewhere durable (object store, log pipeline, ticketing system) if the audit trail matters for compliance or postmortems; once the next housekeeping run fires, the row is gone from the platform database with no recovery path.
Once the platform fix lands, both states will respect the retention window symmetrically.
Backwards-incompatible schema change — what triggers it
Backwards incompatible schema change is the alert type that fires when a producer drops something a downstream consumer was relying on. The platform compares the previously-ingested version of an entity to the latest ingest and raises this alert whenever any of three classes of removal is detected. Adding fields, sources, targets, or inputs is not a backwards-incompatible change and does not trigger an alert. For the user-facing diff surface (where every column add / remove / type change is rendered on the dataset's Structure tab, alert or not), see Schema diff.
The three detection paths:
Dataset
A field that existed in the previous version is no longer present in the latest version, or a field's data type has changed.
Fields are compared by (oddrn, type). Removing a column, renaming it (the new name produces a different ODDRN), or changing its type all surface the alert with the message Missing field: {name}.
Transformer
A source or target ODDRN that the transformer previously listed is no longer in its current source/target list.
Reported as Missing source: {oddrn} or Missing target: {oddrn}.
Consumer
An input ODDRN that the consumer previously listed is no longer in its current input list.
Reported as Missing input: {oddrn}.
The first ingest of an entity never fires this alert. Detection requires a previous (penultimate) version to compare against — there is nothing to "remove" relative to a non-existent prior state. Operators wiring up a new pipeline will see this alert begin to fire only from the second ingest onward.
Backwards incompatible schema change alerts do not auto-resolve — once raised they stay OPEN until an operator resolves them by hand. See Alert lifecycle for the full lifecycle.
Halt notifications per entity
Alert traffic on a single noisy data entity — a flaky job, an unstable test, a frequently re-shaping dataset — can drown out the rest of the queue. ODD lets owners halt notifications on one entity at a time, scoped per alert type, for a fixed duration. The halt is a temporary mute; the underlying detection pipeline keeps running, and notifications resume automatically when the timer expires.
Halts are configured from the entity's Notification Settings button. Each of the four alert types is toggled independently — you can mute "Failed data quality test" for a dataset that's actively being repaired while keeping "Backwards incompatible schema" alerts firing for the same entity:
Backwards incompatible schema change
New schema-drift alerts during the halt window.
Failed data quality test
New alerts from quality-test failures.
Failed job
New alerts from job-run failures.
Distribution anomaly
Currently unenforced — see known limitation below. The toggle is exposed on the UI and persisted by the API, but the AlertManager webhook bypasses halt enforcement.
For each toggle, pick one of five durations:
Half an hour (30 minutes)
Hour (60 minutes)
3 hours
1 day
Week (7 days)
The platform stores the halt as a future timestamp per alert type; once that timestamp passes, the toggle re-enables on its own without operator action.

Halts suppress new alerts only — they do not silence auto-resolution. If an open Failed job alert is already firing on an entity and you halt that alert type, a subsequent successful run still flips the existing alert to RESOLVED_AUTOMATICALLY (see Alert lifecycle above). Halts stop the next new alert of that type from being created; they don't freeze alerts already in flight.
The halt configuration is also exposed over the API — the getAlertConfig / updateAlertConfig endpoints, the four halt-timestamp field names, the ISO-8601 format requirement, the null-clears semantics, and the ALERT_HALT_CONFIG_UPDATED activity-feed event emission are all documented at API Reference → Alerts → Per-entity halt-notification configuration.
Distribution anomaly halt is currently unenforced
The Distribution anomaly halt toggle has no effect on alert creation. The toggle is exposed on the entity's Notification Settings UI and persisted by PUT /api/dataentities/{data_entity_id}/alert_config, but the AlertManager-driven path that creates Distribution Anomaly alerts (POST /ingestion/alert/alertmanager → the platform's external-alert handler) does not consult the halt config — new alerts continue to fire on a "halted" entity until the halt timer expires.
Until the platform fix lands, mute Distribution Anomaly noise at the Prometheus Alertmanager layer instead of relying on the per-entity halt — use a silences entry or a route matcher on the entity_oddrn label, or an inhibit_rules block in the Alertmanager configuration to suppress alerts while a parent condition is active.
The other three halt types (Failed job, Failed data quality test, Backwards incompatible schema change) are unaffected — their halts are enforced on the ingestion-driven alert-creation path.
API surface
The platform's HTTP surface for alerts — the three list endpoints behind the All / My Objects / Dependents tabs, the getAlertTotals badge call, the per-entity alert listing, the manual status-flip endpoint, and the halt-configuration endpoints — is documented at API Reference → Alerts.
Where to next
For how alerts get out of the platform — Slack, email, generic webhook, the AlertManager inbound webhook → Notifications.
For the activity-feed events that record every alert state transition (
OPEN_ALERT_RECEIVED,RESOLVED_ALERT_RECEIVED,ALERT_STATUS_UPDATED,ALERT_HALT_CONFIG_UPDATED) → Activity Feed.For the operator-side configuration of the alert-creation pipeline —
notifications.enabled, the PostgreSQL replication prerequisite, AlertManager setup → Configure ODD Platform → Enable Alert Notifications.
Last updated