Dataset schema diff
Dataset schema diff — visual side-by-side comparison of dataset schema revisions, with per-revision history and automatic alert raising for backwards-incompatible schema changes.
When metadata is re-ingested for a dataset, the platform compares the new revision to the previous one and surfaces every change — added columns, removed columns, renamed columns, type changes — on the dataset's Structure page. Operators get a visual diff for every revision, and the platform raises a Backwards-incompatible schema change alert whenever a removal or type change breaks downstream consumers.
This page covers the user-facing diff surface. The underlying alert mechanism — what triggers it, how it lifecycles, the per-entity halt configuration — lives on the Alerting page.
Where to find it
Open any dataset (Table, File, View, Vector Store, ...) and navigate to the Structure tab on the entity's detail page.

The Structure tab carries:
Fields — every column the dataset exposes.
Data types per field.
Statistics per field (when ingested by an adapter that emits them — see Test Results Import for the per-adapter coverage).
Revision history
Every re-ingest of a dataset that changes the structure creates a new revision. Adding a column, deleting a column, renaming a column, or changing a column's data type all bump the revision counter. Same-structure re-ingests do not create a new revision (the platform compares structure by (field_oddrn, type) to decide).
The revision history is browsable per dataset: pick any two revisions to see exactly what changed between them.

Two illustrative diffs the platform surfaces:


The diff also captures data-type changes and column renames (both are detected as a type change or an ODDRN change between revisions, respectively).
Backwards-incompatible alerts
When the comparison surfaces a removal of a previously-present column, or a type change on an existing column, the platform additionally raises a Backwards-incompatible schema change alert against the dataset. The alert lands on the entity's Alerts tab and the platform-wide Alerts section, and is not auto-resolved — an operator must work it by hand.

This separation is intentional:
The diff is a discovery surface — every operator looking at the dataset sees what changed across revisions.
The alert is the action surface — the platform proactively flags the case where the change breaks a downstream consumer.
For the full alert rule (what counts as backwards-incompatible per entity class — Datasets, Transformers, Consumers — and the first-ingest-no-alert exception), see Alerting → Backwards-incompatible schema change.
Known limitations and operator caveats
The dataset-version read and diff endpoints behind this page carry a few behaviours that are non-obvious from the UI alone. Each item below states what an operator might assume, what the platform actually does, and what to do today.
Dataset-version reads (including the diff endpoint) are not scoped to their dataset — any authenticated caller who can guess a version_id reads any dataset's schema. The GET /api/datasets/{data_entity_id}/structure/{version_id} endpoint, the diff endpoint at /api/datasets/{data_entity_id}/structure/diff, and the "latest version" read all consume the path component {data_entity_id} in the controller signature, but the underlying repository query filters only on dataset_version.id — no data_entity_id predicate is added. Calling GET /api/datasets/9999/structure/{any-real-version-id} returns the schema that owns that version_id, regardless of which dataset id you put in the path.
The UI surface amplifies the leak. The catalog's compare viewer (/dataentities/{id}/structure/compare?firstVersionId=…&secondVersionId=…) reads the two version_ids straight from the URL query string and passes them into the diff fetch without checking they belong to the dataset whose detail page is open. A pasted URL whose version_ids come from a different dataset returns 200 with a structurally-valid but semantically-nonsense diff; the compare panel renders it the same as a legitimate diff.
Operator-visible consequences.
In a multi-tenant deployment, any authenticated user enumerating
version_idintegers reads every dataset's full schema — column names, types, descriptions, tags, terms, attached lookup tables, enum values — independent of dataset-level RBAC.Under
auth.type=DISABLED, the same reads are reachable anonymously.The compare viewer is a trusted audit surface — operators use it to reason about "what changed between V1 and V2." A forged URL (shared in chat, in a bookmark, in another document) can display a deliberately misleading diff that looks identical to a real one. There is no "this isn't your dataset" warning.
Mitigations until the platform-side fix lands. Treat any view of the structure / diff page as catalog-read-collaborative — every authenticated user can effectively read every dataset's structure, regardless of dataset RBAC. If your deployment requires per-dataset isolation, enforce it at the network perimeter (reverse-proxy rules on the /api/datasets/*/structure* paths) rather than relying on platform RBAC. The upstream fix tightens the repository predicate to also filter on data_entity_id, adds a typed NotFoundException when the version-id does not belong to the dataset, and adds a client-side check in the compare viewer.
The diff endpoint returns 500 for missing-version-id and 400 for identical-version-id — operators cannot distinguish "wrong input" from "platform broken." When one or both version-ids in GET /api/datasets/{id}/structure/diff?first_version_id=A&second_version_id=B do not exist (typo, deleted version, cross-dataset id), the service raises a bare RuntimeException("Query returned %s rows for diff request") that the controller-advice maps to HTTP 500. When the two version-ids are identical, the service raises a typed BadUserRequestException that maps to HTTP 400. The compare-viewer UI catches both into the same generic error page; the operator sees the same "something went wrong" regardless of whether they mistyped a version-id or whether the platform is actually broken.
This breaks the common debugging path: an SRE seeing a 500 on the diff endpoint starts looking at platform health (logs, JVM heap, database) when the actual cause was a wrong version_id in the URL. The upstream fix replaces the bare RuntimeException with a typed NotFoundException so the controller-advice maps "missing version-id" to HTTP 404; once that ships, the compare UI can differentiate 4xx (user input) from 5xx (platform error) and present a meaningful message.
Until the fix lands, treat any 500 from the structure / diff endpoints as probably a wrong version-id before treating it as a platform incident — verify both ids exist on the dataset (the revision history dropdown on the Structure tab is authoritative) before opening a ticket.
"Latest version" means the highest version integer, not the most recently created. GET /api/datasets/{id}/structure (no version_id) returns the structure of the dataset version with the highest version column value — computed by the repository as max(DATASET_VERSION.VERSION). The CREATED_AT timestamp is not considered. Under normal ingestion (where the version integer increases monotonically with each re-ingest) the two are equivalent. They diverge in two scenarios:
Operator-driven re-ingest of an older version (collector replay of historical metadata, manual SQL fix-up). The replay carries the older
versionvalue but a newercreated_at; "latest" by version returns the just-replayed row's structure.Out-of-order or manually-edited version numbers in the database (rare; usually only from migrations or recovery).
If you observed a recent re-ingest and the "latest" structure looks unexpected, query an explicit version_id from the revision history rather than relying on the no-version-id read.
Renaming one parent field in a nested struct shows every descendant field as removed-and-re-added. On datasets with nested structures (a struct/record column whose sub-fields are themselves diffable), a field's identity includes its parent's ODDRN. When you rename a parent field, every field hierarchically beneath it gets a new ODDRN too — so the diff treats each descendant as a different field, emitting it once as removed (under the old name) and once as added (under the new name). A single rename near the top of a deep struct therefore renders as a large "everything changed" diff, even though only one name actually changed.
This is expected behaviour, not a diff bug: if you renamed a parent struct field, read the wall of removed/added descendant rows as the consequence of that one rename rather than as independent column changes.
Activity-feed surfacing
Schema changes also surface as events on the Activity Feed — an operator walking the feed sees every metadata change across the catalog including schema edits, alongside ownership / tag / term / status changes.
Where to next
Alerting → Backwards-incompatible schema change — the alert rule, lifecycle, and operator workflow.
Activity Feed — the audit trail of every schema (and other metadata) change.
Test Results Import — per-adapter coverage of field statistics surfaced on the Structure tab.
Data Discovery overview — the bucket landing this page sits under.
Last updated