> For the complete documentation index, see [llms.txt](https://docs.opendatadiscovery.org/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.opendatadiscovery.org/features/data-discovery/schema-diff.md).

# Dataset schema diff

When metadata is re-ingested for a dataset, the platform compares the new revision to the previous one and surfaces every change — added columns, removed columns, renamed columns, type changes — on the dataset's **Structure** page. Operators get a visual diff for every revision, and the platform raises a [Backwards-incompatible schema change](/features/active-platform-features/alerting.md#backwards-incompatible-schema-change-what-triggers-it) alert whenever a removal or type change breaks downstream consumers.

This page covers the user-facing diff surface. The underlying alert mechanism — what triggers it, how it lifecycles, the per-entity halt configuration — lives on the [Alerting](/features/active-platform-features/alerting.md) page.

## Where to find it

Open any dataset (`Table`, `File`, `View`, `Vector Store`, ...) and navigate to the **Structure** tab on the entity's detail page.

<figure><img src="/files/TQ7yJ3HBM1cKGXpSgS0o" alt=""><figcaption><p>Dataset structure in the ODD UI</p></figcaption></figure>

The Structure tab carries:

* **Fields** — every column the dataset exposes.
* **Data types** per field.
* **Statistics** per field (when ingested by an adapter that emits them — see [Test Results Import](/features/data-quality/test-results-import.md) for the per-adapter coverage).

## Revision history

Every re-ingest of a dataset that **changes the structure** creates a new **revision**. Adding a column, deleting a column, renaming a column, or changing a column's data type all bump the revision counter. Same-structure re-ingests do not create a new revision (the platform compares structure by `(field_oddrn, type)` to decide).

The revision history is browsable per dataset: pick any two revisions to see exactly what changed between them.

<figure><img src="/files/jerumYZg9ynGN0QgJfUZ" alt=""><figcaption><p>Dataset revisions in the ODD UI</p></figcaption></figure>

Two illustrative diffs the platform surfaces:

<figure><img src="/files/oYjeYJ5Pv06xfkF84WaH" alt=""><figcaption><p>Column was added</p></figcaption></figure>

<figure><img src="/files/IfAwQr2ZSDKmWpkDHnzz" alt=""><figcaption><p>Column was removed</p></figcaption></figure>

The diff also captures **data-type changes** and **column renames** (both are detected as a type change or an ODDRN change between revisions, respectively).

## Backwards-incompatible alerts

When the comparison surfaces a **removal** of a previously-present column, or a **type change** on an existing column, the platform additionally raises a [Backwards-incompatible schema change](/features/active-platform-features/alerting.md#backwards-incompatible-schema-change-what-triggers-it) alert against the dataset. The alert lands on the entity's Alerts tab and the platform-wide [Alerts](/features/active-platform-features/alerting.md) section, and is **not** auto-resolved — an operator must work it by hand.

<figure><img src="/files/EJKXqLhasjgS1j29Im1V" alt=""><figcaption><p>Schema-change alerts in the ODD UI</p></figcaption></figure>

This separation is intentional:

* **The diff** is a discovery surface — every operator looking at the dataset sees what changed across revisions.
* **The alert** is the action surface — the platform proactively flags the case where the change breaks a downstream consumer.

For the full alert rule (what counts as backwards-incompatible per entity class — Datasets, Transformers, Consumers — and the first-ingest-no-alert exception), see [Alerting → Backwards-incompatible schema change](/features/active-platform-features/alerting.md#backwards-incompatible-schema-change-what-triggers-it).

## Known limitations and operator caveats

The dataset-version read and diff endpoints behind this page carry a few behaviours that are non-obvious from the UI alone. Each item below states what an operator might assume, what the platform actually does, and what to do today.

{% hint style="danger" %}
**Dataset-version reads (including the diff endpoint) are not scoped to their dataset — any authenticated caller who can guess a `version_id` reads any dataset's schema.** The `GET /api/datasets/{data_entity_id}/structure/{version_id}` endpoint, the diff endpoint at `/api/datasets/{data_entity_id}/structure/diff`, and the "latest version" read all consume the path component `{data_entity_id}` in the controller signature, but the underlying repository query filters **only** on `dataset_version.id` — no `data_entity_id` predicate is added. Calling `GET /api/datasets/9999/structure/{any-real-version-id}` returns the schema that owns that `version_id`, regardless of which dataset id you put in the path.

The UI surface amplifies the leak. The catalog's compare viewer (`/dataentities/{id}/structure/compare?firstVersionId=…&secondVersionId=…`) reads the two `version_id`s straight from the URL query string and passes them into the diff fetch **without** checking they belong to the dataset whose detail page is open. A pasted URL whose `version_id`s come from a different dataset returns 200 with a structurally-valid but semantically-nonsense diff; the compare panel renders it the same as a legitimate diff.

**Operator-visible consequences.**

* In a multi-tenant deployment, any authenticated user enumerating `version_id` integers reads every dataset's full schema — column names, types, descriptions, tags, terms, attached lookup tables, enum values — independent of dataset-level RBAC.
* Under `auth.type=DISABLED`, the same reads are reachable anonymously.
* The compare viewer is a *trusted* audit surface — operators use it to reason about "what changed between V1 and V2." A forged URL (shared in chat, in a bookmark, in another document) can display a deliberately misleading diff that looks identical to a real one. There is no "this isn't your dataset" warning.

**Mitigations until the platform-side fix lands.** Treat any view of the structure / diff page as catalog-read-collaborative — every authenticated user can effectively read every dataset's structure, regardless of dataset RBAC. If your deployment requires per-dataset isolation, enforce it at the network perimeter (reverse-proxy rules on the `/api/datasets/*/structure*` paths) rather than relying on platform RBAC. The upstream fix tightens the repository predicate to also filter on `data_entity_id`, adds a typed `NotFoundException` when the version-id does not belong to the dataset, and adds a client-side check in the compare viewer.
{% endhint %}

{% hint style="warning" %}
**The diff endpoint returns 500 for missing-version-id and 400 for identical-version-id — operators cannot distinguish "wrong input" from "platform broken."** When one or both version-ids in `GET /api/datasets/{id}/structure/diff?first_version_id=A&second_version_id=B` do not exist (typo, deleted version, cross-dataset id), the service raises a bare `RuntimeException("Query returned %s rows for diff request")` that the controller-advice maps to **HTTP 500**. When the two version-ids are identical, the service raises a typed `BadUserRequestException` that maps to **HTTP 400**. The compare-viewer UI catches both into the same generic error page; the operator sees the same "something went wrong" regardless of whether they mistyped a version-id or whether the platform is actually broken.

This breaks the common debugging path: an SRE seeing a 500 on the diff endpoint starts looking at platform health (logs, JVM heap, database) when the actual cause was a wrong `version_id` in the URL. The upstream fix replaces the bare `RuntimeException` with a typed `NotFoundException` so the controller-advice maps "missing version-id" to HTTP 404; once that ships, the compare UI can differentiate 4xx (user input) from 5xx (platform error) and present a meaningful message.

Until the fix lands, treat any 500 from the structure / diff endpoints as **probably a wrong version-id** before treating it as a platform incident — verify both ids exist on the dataset (the revision history dropdown on the Structure tab is authoritative) before opening a ticket.
{% endhint %}

{% hint style="info" %}
**"Latest version" means the highest `version` integer, not the most recently created.** `GET /api/datasets/{id}/structure` (no `version_id`) returns the structure of the dataset version with the highest `version` column value — computed by the repository as `max(DATASET_VERSION.VERSION)`. The `CREATED_AT` timestamp is **not** considered. Under normal ingestion (where the version integer increases monotonically with each re-ingest) the two are equivalent. They diverge in two scenarios:

* **Operator-driven re-ingest of an older version** (collector replay of historical metadata, manual SQL fix-up). The replay carries the older `version` value but a newer `created_at`; "latest" by version returns the just-replayed row's structure.
* **Out-of-order or manually-edited version numbers** in the database (rare; usually only from migrations or recovery).

If you observed a recent re-ingest and the "latest" structure looks unexpected, query an explicit `version_id` from the revision history rather than relying on the no-version-id read.
{% endhint %}

{% hint style="info" %}
**Renaming one parent field in a nested struct shows every descendant field as removed-and-re-added.** On datasets with nested structures (a struct/record column whose sub-fields are themselves diffable), a field's identity includes its parent's ODDRN. When you rename a parent field, every field hierarchically beneath it gets a new ODDRN too — so the diff treats each descendant as a different field, emitting it once as **removed** (under the old name) and once as **added** (under the new name). A single rename near the top of a deep struct therefore renders as a large "everything changed" diff, even though only one name actually changed.

This is expected behaviour, not a diff bug: if you renamed a parent struct field, read the wall of removed/added descendant rows as the consequence of that one rename rather than as independent column changes.
{% endhint %}

## Activity-feed surfacing

Schema changes also surface as events on the [Activity Feed](/features/active-platform-features/activity-feed.md) — an operator walking the feed sees every metadata change across the catalog including schema edits, alongside ownership / tag / term / status changes.

## Where to next

* [Alerting → Backwards-incompatible schema change](/features/active-platform-features/alerting.md#backwards-incompatible-schema-change-what-triggers-it) — the alert rule, lifecycle, and operator workflow.
* [Activity Feed](/features/active-platform-features/activity-feed.md) — the audit trail of every schema (and other metadata) change.
* [Test Results Import](/features/data-quality/test-results-import.md) — per-adapter coverage of field statistics surfaced on the Structure tab.
* [Data Discovery overview](/features/data-discovery.md) — the bucket landing this page sits under.