Dataset schema diff

Dataset schema diff — visual side-by-side comparison of dataset schema revisions, with per-revision history and automatic alert raising for backwards-incompatible schema changes.

When metadata is re-ingested for a dataset, the platform compares the new revision to the previous one and surfaces every change — added columns, removed columns, renamed columns, type changes — on the dataset's Structure page. Operators get a visual diff for every revision, and the platform raises a Backwards-incompatible schema change alert whenever a removal or type change breaks downstream consumers.

This page covers the user-facing diff surface. The underlying alert mechanism — what triggers it, how it lifecycles, the per-entity halt configuration — lives on the Alerting page.

Where to find it

Open any dataset (Table, File, View, Vector Store, ...) and navigate to the Structure tab on the entity's detail page.

Dataset structure in the ODD UI

The Structure tab carries:

  • Fields — every column the dataset exposes.

  • Data types per field.

  • Statistics per field (when ingested by an adapter that emits them — see Test Results Import for the per-adapter coverage).

Revision history

Every re-ingest of a dataset that changes the structure creates a new revision. Adding a column, deleting a column, renaming a column, or changing a column's data type all bump the revision counter. Same-structure re-ingests do not create a new revision (the platform compares structure by (field_oddrn, type) to decide).

The revision history is browsable per dataset: pick any two revisions to see exactly what changed between them.

Dataset revisions in the ODD UI

Two illustrative diffs the platform surfaces:

Column was added
Column was removed

The diff also captures data-type changes and column renames (both are detected as a type change or an ODDRN change between revisions, respectively).

Backwards-incompatible alerts

When the comparison surfaces a removal of a previously-present column, or a type change on an existing column, the platform additionally raises a Backwards-incompatible schema change alert against the dataset. The alert lands on the entity's Alerts tab and the platform-wide Alerts section, and is not auto-resolved — an operator must work it by hand.

Schema-change alerts in the ODD UI

This separation is intentional:

  • The diff is a discovery surface — every operator looking at the dataset sees what changed across revisions.

  • The alert is the action surface — the platform proactively flags the case where the change breaks a downstream consumer.

For the full alert rule (what counts as backwards-incompatible per entity class — Datasets, Transformers, Consumers — and the first-ingest-no-alert exception), see Alerting → Backwards-incompatible schema change.

Known limitations and operator caveats

The dataset-version read and diff endpoints behind this page carry a few behaviours that are non-obvious from the UI alone. Each item below states what an operator might assume, what the platform actually does, and what to do today.

"Latest version" means the highest version integer, not the most recently created. GET /api/datasets/{id}/structure (no version_id) returns the structure of the dataset version with the highest version column value — computed by the repository as max(DATASET_VERSION.VERSION). The CREATED_AT timestamp is not considered. Under normal ingestion (where the version integer increases monotonically with each re-ingest) the two are equivalent. They diverge in two scenarios:

  • Operator-driven re-ingest of an older version (collector replay of historical metadata, manual SQL fix-up). The replay carries the older version value but a newer created_at; "latest" by version returns the just-replayed row's structure.

  • Out-of-order or manually-edited version numbers in the database (rare; usually only from migrations or recovery).

If you observed a recent re-ingest and the "latest" structure looks unexpected, query an explicit version_id from the revision history rather than relying on the no-version-id read.

Renaming one parent field in a nested struct shows every descendant field as removed-and-re-added. On datasets with nested structures (a struct/record column whose sub-fields are themselves diffable), a field's identity includes its parent's ODDRN. When you rename a parent field, every field hierarchically beneath it gets a new ODDRN too — so the diff treats each descendant as a different field, emitting it once as removed (under the old name) and once as added (under the new name). A single rename near the top of a deep struct therefore renders as a large "everything changed" diff, even though only one name actually changed.

This is expected behaviour, not a diff bug: if you renamed a parent struct field, read the wall of removed/added descendant rows as the consequence of that one rename rather than as independent column changes.

Activity-feed surfacing

Schema changes also surface as events on the Activity Feed — an operator walking the feed sees every metadata change across the catalog including schema edits, alongside ownership / tag / term / status changes.

Where to next

Last updated