> For the complete documentation index, see [llms.txt](https://docs.opendatadiscovery.org/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.opendatadiscovery.org/features/data-discovery/schema-diff.md).

# Dataset schema diff

When metadata is re-ingested for a dataset, the platform compares the new revision to the previous one and surfaces every change — added columns, removed columns, renamed columns, type changes — on the dataset's **Structure** page. Operators get a visual diff for every revision, and the platform raises a [Backwards-incompatible schema change](/features/active-platform-features/alerting.md#backwards-incompatible-schema-change-what-triggers-it) alert whenever a removal or type change breaks downstream consumers.

This page covers the user-facing diff surface. The underlying alert mechanism — what triggers it, how it lifecycles, the per-entity halt configuration — lives on the [Alerting](/features/active-platform-features/alerting.md) page.

## Where to find it

Open any dataset (`Table`, `File`, `View`, `Vector Store`, ...) and navigate to the **Structure** tab on the entity's detail page.

<figure><img src="/files/TQ7yJ3HBM1cKGXpSgS0o" alt=""><figcaption><p>Dataset structure in the ODD UI</p></figcaption></figure>

The Structure tab carries:

* **Fields** — every column the dataset exposes.
* **Data types** per field.
* **Statistics** per field (when ingested by an adapter that emits them — see [Test Results Import](/features/data-quality/test-results-import.md) for the per-adapter coverage).

## Revision history

Every re-ingest of a dataset that **changes the structure** creates a new **revision**. Adding a column, deleting a column, renaming a column, or changing a column's data type all bump the revision counter. Same-structure re-ingests do not create a new revision (the platform compares structure by `(field_oddrn, type)` to decide).

The revision history is browsable per dataset: pick any two revisions to see exactly what changed between them.

<figure><img src="/files/jerumYZg9ynGN0QgJfUZ" alt=""><figcaption><p>Dataset revisions in the ODD UI</p></figcaption></figure>

Two illustrative diffs the platform surfaces:

<figure><img src="/files/oYjeYJ5Pv06xfkF84WaH" alt=""><figcaption><p>Column was added</p></figcaption></figure>

<figure><img src="/files/IfAwQr2ZSDKmWpkDHnzz" alt=""><figcaption><p>Column was removed</p></figcaption></figure>

The diff also captures **data-type changes** and **column renames** (both are detected as a type change or an ODDRN change between revisions, respectively).

## Backwards-incompatible alerts

When the comparison surfaces a **removal** of a previously-present column, or a **type change** on an existing column, the platform additionally raises a [Backwards-incompatible schema change](/features/active-platform-features/alerting.md#backwards-incompatible-schema-change-what-triggers-it) alert against the dataset. The alert lands on the entity's Alerts tab and the platform-wide [Alerts](/features/active-platform-features/alerting.md) section, and is **not** auto-resolved — an operator must work it by hand.

<figure><img src="/files/EJKXqLhasjgS1j29Im1V" alt=""><figcaption><p>Schema-change alerts in the ODD UI</p></figcaption></figure>

This separation is intentional:

* **The diff** is a discovery surface — every operator looking at the dataset sees what changed across revisions.
* **The alert** is the action surface — the platform proactively flags the case where the change breaks a downstream consumer.

For the full alert rule (what counts as backwards-incompatible per entity class — Datasets, Transformers, Consumers — and the first-ingest-no-alert exception), see [Alerting → Backwards-incompatible schema change](/features/active-platform-features/alerting.md#backwards-incompatible-schema-change-what-triggers-it).

## Known limitations and operator caveats

The dataset-version read and diff endpoints behind this page carry a few behaviours that are non-obvious from the UI alone. Each item below states what an operator might assume, what the platform actually does, and what to do today.

{% hint style="danger" %}
**Dataset-version reads (including the diff endpoint) are not scoped to their dataset — any authenticated caller who can guess a `version_id` reads any dataset's schema.** The `GET /api/datasets/{data_entity_id}/structure/{version_id}` endpoint, the diff endpoint at `/api/datasets/{data_entity_id}/structure/diff`, and the "latest version" read all consume the path component `{data_entity_id}` in the controller signature, but the underlying repository query filters **only** on `dataset_version.id` — no `data_entity_id` predicate is added. Calling `GET /api/datasets/9999/structure/{any-real-version-id}` returns the schema that owns that `version_id`, regardless of which dataset id you put in the path.

The UI surface amplifies the leak. The catalog's compare viewer (`/dataentities/{id}/structure/compare?firstVersionId=…&secondVersionId=…`) reads the two `version_id`s straight from the URL query string and passes them into the diff fetch **without** checking they belong to the dataset whose detail page is open. A pasted URL whose `version_id`s come from a different dataset returns 200 with a structurally-valid but semantically-nonsense diff; the compare panel renders it the same as a legitimate diff.

**Operator-visible consequences.**

* In a multi-tenant deployment, any authenticated user enumerating `version_id` integers reads every dataset's full schema — column names, types, descriptions, tags, terms, attached lookup tables, enum values — independent of dataset-level RBAC.
* Under `auth.type=DISABLED`, the same reads are reachable anonymously.
* The compare viewer is a *trusted* audit surface — operators use it to reason about "what changed between V1 and V2." A forged URL (shared in chat, in a bookmark, in another document) can display a deliberately misleading diff that looks identical to a real one. There is no "this isn't your dataset" warning.

**Mitigations until the platform-side fix lands.** Treat any view of the structure / diff page as catalog-read-collaborative — every authenticated user can effectively read every dataset's structure, regardless of dataset RBAC. If your deployment requires per-dataset isolation, enforce it at the network perimeter (reverse-proxy rules on the `/api/datasets/*/structure*` paths) rather than relying on platform RBAC. The upstream fix tightens the repository predicate to also filter on `data_entity_id`, adds a typed `NotFoundException` when the version-id does not belong to the dataset, and adds a client-side check in the compare viewer.
{% endhint %}

{% hint style="warning" %}
**The diff endpoint returns 500 for missing-version-id and 400 for identical-version-id — operators cannot distinguish "wrong input" from "platform broken."** When one or both version-ids in `GET /api/datasets/{id}/structure/diff?first_version_id=A&second_version_id=B` do not exist (typo, deleted version, cross-dataset id), the service raises a bare `RuntimeException("Query returned %s rows for diff request")` that the controller-advice maps to **HTTP 500**. When the two version-ids are identical, the service raises a typed `BadUserRequestException` that maps to **HTTP 400**. The compare-viewer UI catches both into the same generic error page; the operator sees the same "something went wrong" regardless of whether they mistyped a version-id or whether the platform is actually broken.

This breaks the common debugging path: an SRE seeing a 500 on the diff endpoint starts looking at platform health (logs, JVM heap, database) when the actual cause was a wrong `version_id` in the URL. The upstream fix replaces the bare `RuntimeException` with a typed `NotFoundException` so the controller-advice maps "missing version-id" to HTTP 404; once that ships, the compare UI can differentiate 4xx (user input) from 5xx (platform error) and present a meaningful message.

Until the fix lands, treat any 500 from the structure / diff endpoints as **probably a wrong version-id** before treating it as a platform incident — verify both ids exist on the dataset (the revision history dropdown on the Structure tab is authoritative) before opening a ticket.
{% endhint %}

{% hint style="info" %}
**"Latest version" means the highest `version` integer, not the most recently created.** `GET /api/datasets/{id}/structure` (no `version_id`) returns the structure of the dataset version with the highest `version` column value — computed by the repository as `max(DATASET_VERSION.VERSION)`. The `CREATED_AT` timestamp is **not** considered. Under normal ingestion (where the version integer increases monotonically with each re-ingest) the two are equivalent. They diverge in two scenarios:

* **Operator-driven re-ingest of an older version** (collector replay of historical metadata, manual SQL fix-up). The replay carries the older `version` value but a newer `created_at`; "latest" by version returns the just-replayed row's structure.
* **Out-of-order or manually-edited version numbers** in the database (rare; usually only from migrations or recovery).

If you observed a recent re-ingest and the "latest" structure looks unexpected, query an explicit `version_id` from the revision history rather than relying on the no-version-id read.
{% endhint %}

{% hint style="info" %}
**Renaming one parent field in a nested struct shows every descendant field as removed-and-re-added.** On datasets with nested structures (a struct/record column whose sub-fields are themselves diffable), a field's identity includes its parent's ODDRN. When you rename a parent field, every field hierarchically beneath it gets a new ODDRN too — so the diff treats each descendant as a different field, emitting it once as **removed** (under the old name) and once as **added** (under the new name). A single rename near the top of a deep struct therefore renders as a large "everything changed" diff, even though only one name actually changed.

This is expected behaviour, not a diff bug: if you renamed a parent struct field, read the wall of removed/added descendant rows as the consequence of that one rename rather than as independent column changes.
{% endhint %}

## Activity-feed surfacing

Schema changes also surface as events on the [Activity Feed](/features/active-platform-features/activity-feed.md) — an operator walking the feed sees every metadata change across the catalog including schema edits, alongside ownership / tag / term / status changes.

## Where to next

* [Alerting → Backwards-incompatible schema change](/features/active-platform-features/alerting.md#backwards-incompatible-schema-change-what-triggers-it) — the alert rule, lifecycle, and operator workflow.
* [Activity Feed](/features/active-platform-features/activity-feed.md) — the audit trail of every schema (and other metadata) change.
* [Test Results Import](/features/data-quality/test-results-import.md) — per-adapter coverage of field statistics surfaced on the Structure tab.
* [Data Discovery overview](/features/data-discovery.md) — the bucket landing this page sits under.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.opendatadiscovery.org/features/data-discovery/schema-diff.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
