# Alerting

![](/files/gzN28Ofa8Sxm8D1zMdw8)

Whenever an issue arises with a catalogued entity — a failed job, a failed data quality test, a backwards-incompatible schema change, or an externally-injected distribution anomaly — the platform raises an **alert** visible in the navigation pane's `Alerts` section and on each affected entity's own Alerts tab.

<figure><img src="/files/TxTuNLZlofwfl0Scz0om" alt="" height="320" width="700"><figcaption><p>Alerts section on the navigation pane</p></figcaption></figure>

Each alert carries the affected entity, the alert type, the triggering timestamp, status history, and a Resolve action; resolving an alert updates the same record in place.

<figure><img src="/files/bHvp4f7p8jFxRmS1LTjF" alt="" height="260" width="700"><figcaption><p>Alert notification details</p></figcaption></figure>

The platform uses PostgreSQL logical replication to deliver alerts even when the alerting pipeline is briefly partitioned from the primary database — see the [PostgreSQL Configuration](/configuration-and-deployment/odd-platform.md#postgresql-configuration) operator guide for the database-side prerequisites.

## Alert types

The platform tracks four alert types:

* **Failed job** — a transformer entity's most recent run reported failure.
* **Failed data quality test** — a quality-test entity's most recent run reported failure.
* **Backwards incompatible schema change** — a producer dropped something a downstream consumer was relying on (see [Backwards-incompatible schema change](#backwards-incompatible-schema-change-what-triggers-it) for the detection rules).
* **Distribution anomaly** — anomalous distributions detected externally and pushed in via Prometheus AlertManager (see the inbound webhook description in [Notifications](/features/active-platform-features/notifications.md)).

Alerts originate from two sources: the platform's own ingestion / evaluation pipeline (for the first three types and any other internal triggers), and optionally from an external [Prometheus AlertManager](https://prometheus.io/docs/alerting/latest/alertmanager/) via the `POST /ingestion/alert/alertmanager` inbound webhook. AlertManager-routed alerts surface as **Distribution Anomaly** alerts using the `entity_oddrn` label to attribute them to the affected entity. The setup steps for the AlertManager integration live on [Notifications → Prometheus AlertManager inbound webhook](/features/active-platform-features/notifications.md#prometheus-alertmanager-inbound-webhook).

How an alert leaves the platform — Slack, email, generic webhook, plus the AlertManager-driven inbound path — is its own subsystem; see [Notifications](/features/active-platform-features/notifications.md).

## Alert views — All, My Objects, Dependents

The `Alerts` section in the navigation pane has three tabs that scope the alert list to different slices of the platform. Pick the tab that matches how you're working the queue — on large deployments the default `All` view can run into the hundreds of open alerts, and the `My Objects` / `Dependents` tabs let individual owners cut the list down to what they're responsible for.

| Tab            | Scope                                                                                                 | When to use                                                                                                                                                       |
| -------------- | ----------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **All**        | Every open and resolved alert across the whole platform.                                              | Platform-wide triage; stewards and admins watching the full alert surface.                                                                                        |
| **My Objects** | Alerts raised on data entities where the signed-in user is a registered owner.                        | Per-owner view — "what fires on the things I own". Requires user ↔ owner association.                                                                             |
| **Dependents** | Alerts raised on data entities that are downstream of entities the signed-in user owns (via lineage). | Impact view — "what's breaking in systems that consume my data". Surfaces ripple effects before the downstream team pings you. Requires user ↔ owner association. |

![Alerts → All view on a populated deployment — 230 open alerts, each row carrying the affected entity name (dq\_test\_for\_\*), the entity-class chip (QT for Quality Test), the failure category ("Failed DQ test"), the timestamp + Show history link, and per-row Open / Resolve actions. The tab strip at the top scopes the list to All / My Objects / Dependents.](/files/pCDgRwqBh2qXfCLy9TYH)

The endpoints behind the three tabs (`getAllAlerts`, `getAssociatedUserAlerts`, `getDependentEntitiesAlerts`) plus the badge-counter call (`getAlertTotals`) are documented at [API Reference → Alerts → Global alert listings](/developer-guides/api-reference/alerts.md).

{% hint style="info" %}
The `My Objects` and `Dependents` tabs are hidden unless the signed-in user is linked to an [Owner](/configuration-and-deployment/enable-security/authorization/user-owner-association.md) — without the association, the platform cannot evaluate "mine" or "downstream of mine". An operator who sees only the `All` tab almost always has a missing user-owner link. This is a different `My Objects` from the **Recommended → My Objects** wizard on the main page — the wizard surfaces recently-ingested owned entities; this tab filters open alerts.
{% endhint %}

## Alert lifecycle: statuses, resolution, cleanup

Every alert moves through three statuses:

| Status                       | What it means                                                                                                               | Who sets it                                                           |
| ---------------------------- | --------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| **`OPEN`**                   | The alert is active. It shows on the entity's Alerts tab and counts toward the **All / My / Dependents** badges.            | The platform, when the alert is created.                              |
| **`RESOLVED`**               | An operator marked the alert resolved by hand — typically after fixing the underlying issue or judging it a false positive. | An operator, via the `Resolve` action on the alert.                   |
| **`RESOLVED_AUTOMATICALLY`** | The platform itself resolved the alert because the condition that fired it has cleared (see below). No operator acted.      | The platform, on the next ingest that observes the cleared condition. |

The two resolved statuses behave the same way in the listings — both clear the alert from open counts and stop notifications — but the distinction is preserved on the alert record and in the [Activity Feed](/features/active-platform-features/activity-feed.md), so you can tell whether an alert was worked on or simply cleared itself. Auto-resolution events are recorded as system events on the feed; manual resolutions carry the operator's identity.

### Auto-resolution triggers

Auto-resolution applies to **`Failed job`** and **`Failed data quality test`** alerts only. The other two alert types — `Backwards incompatible schema change` and `Distribution anomaly` — never auto-resolve and stay `OPEN` until an operator resolves them by hand.

The trigger is the next ingest that reports a **successful** run for the same task on the same entity:

* If a job that previously failed succeeds on its next run, the open `Failed job` alert for that entity flips to `RESOLVED_AUTOMATICALLY`.
* If a data-quality test that previously failed passes on its next run, the open `Failed data quality test` alert for that test flips to `RESOLVED_AUTOMATICALLY`.

A subsequent failure opens a **new** alert; auto-resolved alerts are not reopened.

{% hint style="info" %}
**Manual reopen has a guard.** An operator can reopen a `RESOLVED` or `RESOLVED_AUTOMATICALLY` alert by sending its status back to `OPEN` (`PUT /api/alerts/{alert_id}/status`) — but only if there is **no other open alert of the same type** on the same data entity. The platform refuses the reopen with `Cannot reopen alert since the system already has an open alert of the same type`. Resolve or work the newer alert first, or leave the old one closed.
{% endhint %}

### Auto-cleanup of resolved alerts

Resolved alerts do not accumulate forever. The platform's housekeeping job **permanently deletes** auto-resolved (`RESOLVED_AUTOMATICALLY`) alerts whose status-update timestamp is older than `housekeeping.ttl.resolved_alerts_days` (default `30` days). The chunk records attached to each alert are deleted along with it — this is a hard delete, not a soft one, so once the window passes the alert is gone from the database. The retention window is intended to apply to **manually resolved (`RESOLVED`) alerts** as well, but a known platform bug currently exempts manual resolutions from the retention check (see the warning below).

To change the retention window, see [Housekeeping Settings Configuration](/configuration-and-deployment/odd-platform.md#housekeeping-settings-configuration). Raise the value before auto-resolved alerts age out if you need a longer audit trail; once deleted, alerts cannot be recovered.

{% hint style="warning" %}
**Manually resolved alerts are deleted on the next housekeeping run regardless of the retention window.** Due to a SQL operator-precedence bug in the cleanup query — the WHERE clause is written as `status = RESOLVED OR status = RESOLVED_AUTOMATICALLY AND status_updated_at <= cutoff` and is parsed by Postgres as `(status = RESOLVED) OR (status = RESOLVED_AUTOMATICALLY AND status_updated_at <= cutoff)` — every `RESOLVED` row is selected for deletion regardless of how long ago it was resolved. The next housekeeping run after a manual resolve hard-deletes the alert and its chunks within minutes, well before `housekeeping.ttl.resolved_alerts_days` would expire. Raising the TTL does **not** help — manual `RESOLVED` rows bypass the retention check entirely, so the value is meaningless for them until the bug is fixed.

The only operator-side workaround until the platform fix lands is to **export the alert audit data before manually resolving** — `GET /api/dataentities/{data_entity_id}/alerts` returns the open and recently-resolved set including chunks and status history. Persist the response somewhere durable (object store, log pipeline, ticketing system) if the audit trail matters for compliance or postmortems; once the next housekeeping run fires, the row is gone from the platform database with no recovery path.

Once the platform fix lands, both states will respect the retention window symmetrically.
{% endhint %}

## Backwards-incompatible schema change — what triggers it

`Backwards incompatible schema change` is the alert type that fires when a producer drops something a downstream consumer was relying on. The platform compares the **previously-ingested** version of an entity to the **latest** ingest and raises this alert whenever any of three classes of removal is detected. **Adding** fields, sources, targets, or inputs is not a backwards-incompatible change and does not trigger an alert. For the user-facing diff surface (where every column add / remove / type change is rendered on the dataset's Structure tab, alert or not), see [Schema diff](/features/data-discovery/schema-diff.md).

The three detection paths:

| Entity class    | What triggers an alert                                                                                                           | Detail                                                                                                                                                                                                 |
| --------------- | -------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Dataset**     | A field that existed in the previous version is no longer present in the latest version, **or** a field's data type has changed. | Fields are compared by `(oddrn, type)`. Removing a column, renaming it (the new name produces a different ODDRN), or changing its type all surface the alert with the message `Missing field: {name}`. |
| **Transformer** | A source or target ODDRN that the transformer previously listed is no longer in its current source/target list.                  | Reported as `Missing source: {oddrn}` or `Missing target: {oddrn}`.                                                                                                                                    |
| **Consumer**    | An input ODDRN that the consumer previously listed is no longer in its current input list.                                       | Reported as `Missing input: {oddrn}`.                                                                                                                                                                  |

{% hint style="info" %}
**The first ingest of an entity never fires this alert.** Detection requires a previous (penultimate) version to compare against — there is nothing to "remove" relative to a non-existent prior state. Operators wiring up a new pipeline will see this alert begin to fire only from the second ingest onward.
{% endhint %}

`Backwards incompatible schema change` alerts do **not** auto-resolve — once raised they stay `OPEN` until an operator resolves them by hand. See [Alert lifecycle](#alert-lifecycle-statuses-resolution-cleanup) for the full lifecycle.

## Halt notifications per entity

Alert traffic on a single noisy data entity — a flaky job, an unstable test, a frequently re-shaping dataset — can drown out the rest of the queue. ODD lets owners **halt** notifications on one entity at a time, scoped per alert type, for a fixed duration. The halt is a temporary mute; the underlying detection pipeline keeps running, and notifications resume automatically when the timer expires.

Halts are configured from the entity's `Notification Settings` button. Each of the four alert types is toggled independently — you can mute "Failed data quality test" for a dataset that's actively being repaired while keeping "Backwards incompatible schema" alerts firing for the same entity:

| Alert type toggle                        | What it suppresses                                                                                                                                                                                                               |
| ---------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Backwards incompatible schema change** | New schema-drift alerts during the halt window.                                                                                                                                                                                  |
| **Failed data quality test**             | New alerts from quality-test failures.                                                                                                                                                                                           |
| **Failed job**                           | New alerts from job-run failures.                                                                                                                                                                                                |
| **Distribution anomaly**                 | *Currently unenforced — see* [*known limitation*](#distribution-anomaly-halt-is-currently-unenforced) *below. The toggle is exposed on the UI and persisted by the API, but the AlertManager webhook bypasses halt enforcement.* |

For each toggle, pick one of five durations:

* **Half an hour** (30 minutes)
* **Hour** (60 minutes)
* **3 hours**
* **1 day**
* **Week** (7 days)

The platform stores the halt as a future timestamp per alert type; once that timestamp passes, the toggle re-enables on its own without operator action.

<figure><img src="/files/13KzgYnwDUaoNB6ApvbZ" alt="" height="372" width="700"><figcaption><p>Turning specific notifications off</p></figcaption></figure>

{% hint style="info" %}
**Halts suppress new alerts only — they do not silence auto-resolution.** If an open `Failed job` alert is already firing on an entity and you halt that alert type, a subsequent successful run still flips the existing alert to `RESOLVED_AUTOMATICALLY` (see [Alert lifecycle](#alert-lifecycle-statuses-resolution-cleanup) above). Halts stop the **next new alert** of that type from being created; they don't freeze alerts already in flight.
{% endhint %}

The halt configuration is also exposed over the API — the `getAlertConfig` / `updateAlertConfig` endpoints, the four halt-timestamp field names, the ISO-8601 format requirement, the `null`-clears semantics, and the `ALERT_HALT_CONFIG_UPDATED` activity-feed event emission are all documented at [API Reference → Alerts → Per-entity halt-notification configuration](/developer-guides/api-reference/alerts.md).

### Distribution anomaly halt is currently unenforced

{% hint style="warning" %}
**The Distribution anomaly halt toggle has no effect on alert creation.** The toggle is exposed on the entity's `Notification Settings` UI and persisted by `PUT /api/dataentities/{data_entity_id}/alert_config`, but the AlertManager-driven path that creates Distribution Anomaly alerts (`POST /ingestion/alert/alertmanager` → the platform's external-alert handler) does not consult the halt config — new alerts continue to fire on a "halted" entity until the halt timer expires.

Until the platform fix lands, mute Distribution Anomaly noise at the **Prometheus Alertmanager** layer instead of relying on the per-entity halt — use a `silences` entry or a `route` matcher on the `entity_oddrn` label, or an `inhibit_rules` block in the Alertmanager configuration to suppress alerts while a parent condition is active.

The other three halt types (`Failed job`, `Failed data quality test`, `Backwards incompatible schema change`) are unaffected — their halts are enforced on the ingestion-driven alert-creation path.
{% endhint %}

## API surface

The platform's HTTP surface for alerts — the three list endpoints behind the **All / My Objects / Dependents** tabs, the `getAlertTotals` badge call, the per-entity alert listing, the manual status-flip endpoint, and the halt-configuration endpoints — is documented at [API Reference → Alerts](/developer-guides/api-reference/alerts.md).

## Where to next

* For how alerts get out of the platform — Slack, email, generic webhook, the AlertManager inbound webhook → [Notifications](/features/active-platform-features/notifications.md).
* For the activity-feed events that record every alert state transition (`OPEN_ALERT_RECEIVED`, `RESOLVED_ALERT_RECEIVED`, `ALERT_STATUS_UPDATED`, `ALERT_HALT_CONFIG_UPDATED`) → [Activity Feed](/features/active-platform-features/activity-feed.md).
* For the operator-side configuration of the alert-creation pipeline — `notifications.enabled`, the PostgreSQL replication prerequisite, AlertManager setup → [Configure ODD Platform → Enable Alert Notifications](/configuration-and-deployment/odd-platform.md#enable-alert-notifications).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.opendatadiscovery.org/features/active-platform-features/alerting.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
