> For the complete documentation index, see [llms.txt](https://docs.opendatadiscovery.org/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.opendatadiscovery.org/features/data-lineage.md).

# Data Lineage

The **Data Lineage** section of ODD Platform is the home for upstream and downstream traceability across the catalog. The role is durable: anything that documents *how entities are connected* — which dataset was read by which job, which job produced which model, which microservice traced which call — belongs here.

ODD covers Data Lineage fully and across two complementary surfaces — **data-object lineage** (catalog entities and the edges between them) and **microservices lineage** (OpenTelemetry-traced microservice calls rendered alongside the data graph). See the [Data Governance map](/introduction/main-concepts.md#data-governance-map) for the position of Data Lineage among the other governance pillars.

Open lineage from the **Lineage tab** on any data-entity detail page (per-entity view) or from the **Group lineage** entry point on a [Data Entity Group](/features/data-discovery/groups-domains.md) detail page. The microservices view is reached from any catalogued microservice entity ingested through [`odd-tracing-gateway`](/integrations/integrations/odd-tracing-gateway.md).

Both canvases expose a Compact / Full view-mode toggle — same label, two subsystem behaviours; the [data-objects sub-page](/features/data-lineage/data-objects.md#view-mode-toggle-compact--full) describes the asymmetry and the dense-graph caveat. The same sub-page also covers the **UI-vs-API depth contract** every direct-API caller should read before scripting lineage queries — the canvas's 1-20 depth dropdown is a UI presentation choice; the URL and the API accept any positive integer with no upper bound.

## Subsections

* [**Data Objects Lineage**](/features/data-lineage/data-objects.md) — per-entity upstream / downstream graphs across the full ODD entity model: datasets, transformers, transformer runs, quality tests + their runs, consumers, data inputs, data entity groups (including ML experiments), and entity relationships. Backed by the split per-entity endpoints `GET /api/dataentities/{data_entity_id}/lineage/upstream` and `GET /api/dataentities/{data_entity_id}/lineage/downstream`, plus the dedicated group-lineage endpoint `GET /api/dataentitygroups/{data_entity_group_id}/lineage`.
* [**Microservices Lineage**](/features/data-lineage/microservices.md) — microservice call lineage rendered alongside data-object lineage. Sourced from OpenTelemetry traces ingested via `odd-tracing-gateway` (the platform's only [standalone gateway](/introduction/main-concepts.md#the-architecture-chain) push adapter today).

## Why this is a separate pillar

For how Data Lineage relates to the other governance pillars (Data Discovery, Data Modelling, Master Data Management, Data Glossary, Data Quality), see [Main Concepts → Data Governance map → Pillar differentiation](/introduction/main-concepts.md#pillar-differentiation) — the canonical home for the six-pillar framing. Lineage is its own pillar because the connection graph cuts across every other pillar; a dataset has a structure, a meaning, a location, a quality signal, *and* a lineage, and the lineage itself is the cross-pillar record.

## My-objects triplet — composition + anchor architecture

Three lineage-adjacent endpoints answer the operator question *"what do I own, what flows into it, what flows out of it"* as a unified triplet:

| Endpoint                              | What it returns                                                                                 |
| ------------------------------------- | ----------------------------------------------------------------------------------------------- |
| `GET /api/dataentities/my`            | The entities the signed-in user owns (the **anchor set**).                                      |
| `GET /api/dataentities/my/upstream`   | The set of entities that the user's owned entities **depend on** but the user does **not** own. |
| `GET /api/dataentities/my/downstream` | The set of entities that **depend on** the user's owned entities but the user does **not** own. |

The two `*upstream` / `*downstream` endpoints are not "my owned entities + their upstream" — they are explicitly **the non-owned set adjacent to the user's owned entities** (lineage neighbours minus the anchor). The UI labels the surfaces accurately as *"Upstream dependents"* / *"Downstream dependents"* — the dependents on the user's stuff, not the user's stuff.

{% hint style="danger" %}
**The OpenAPI summary on the upstream / downstream operations describes the wrong shape.** The spec text for `getMyObjectsWithUpstream` and `getMyObjectsWithDownstream` currently reads *"Returns list of data entities owned by current user with upstream dependencies"* — implying the response is your owned set with extra context. The actual response is the **NON-owned set adjacent to your owned entities** (the lineage neighbours, with the owned anchor explicitly excluded).

Third-party API consumers compiling SDKs from `openapi.yaml` will get the wrong mental model. SDK code that treats the response as *"my entities"* will silently mis-attribute lineage neighbours to the caller. Until the spec is corrected, **follow the UI label semantic** (*"Upstream dependents"* / *"Downstream dependents"*) when integrating these endpoints — they are dependency graphs around the caller's owned set, not the caller's owned set itself.
{% endhint %}

### Anchor architecture and operator caveats

A handful of architectural details on the triplet matter when reasoning about exposure, performance, and debugging:

{% hint style="warning" %}
**Owner-scoping is enforced at exactly one site — the lineage projection downstream has no defence-in-depth.** The triplet's owner filter runs at the anchor-fetch step (the platform resolves the signed-in user to their bound Owner and looks up the entities that Owner owns). From that point onward the lineage CTE has **no ownership join** and the final projection (`listByOddrns`) is a pure `WHERE oddrn IN (...)` scan against the anchor set with no per-owner predicate. The base `/api/dataentities/my` endpoint **does** join the ownership table; the triplet's upstream / downstream endpoints do **not**.

Today's code is correct — the anchor set IS the operator's owned set, so the lineage walk around that set is genuinely the caller's neighbourhood. The architectural caveat is that **a regression at the anchor-fetch step has catastrophic blast radius**: a misordered web filter dropping the security context, a typo in the user-owner-mapping resolver, or a fallback that defaults to an unintended owner under `auth.type=DISABLED` would silently return a different owner's lineage neighbourhood. The repository tier does not catch the mistake — there is no JOIN-side check that says "the anchor must match the caller." Combined with the [cross-mode user-name collision](/configuration-and-deployment/enable-security/authentication/login-form.md) on `USER_OWNER_MAPPING.OIDC_USERNAME`, a multi-mode deployment with a name collision is one regression away from cross-owner lineage neighbourhood leak. Audit attention belongs at the anchor-fetch site (`fetchAssociatedOwner`) — it is the single load-bearing line.
{% endhint %}

{% hint style="warning" %}
**The endpoint fetches the full owned set before applying pagination — admin / CI-bot owners trigger O(anchor) DB cost on every call regardless of `size`.** The triplet builds the upstream / downstream query by first calling `listByOwner(ownerId)` (which returns **all** entities the owner owns, no pagination) and then constructing a CTE with `WHERE child_oddrn IN (oddrn1, oddrn2, ...)` over the full anchor set. Memory and database CPU scale with the **size of the owned set**, not with the requested page size. PostgreSQL's planner cost is non-linear above \~1000 IN-clause elements; the jOOQ query does not paginate the IN clause.

**Operator-visible consequence.** An admin owner who owns thousands of catalogued entities (a CI-bot account that gets default-owner-assigned on every ingestion, an admin who became the owner of everything during initial setup) triggers a heavy query on every `/my/upstream` or `/my/downstream` call — even when the UI requests `size=5`. The Recommended panel on the Catalog Overview home page fires these endpoints on every SPA mount, so the cost is per-user-pageload, not per-explicit-API-call. **Operationally bound the owned set** — avoid making admin accounts the owner of all entities; use a service-account pattern for ingestion-time auto-owner assignment; consider a small dedicated owner per team rather than a single platform-wide steward.
{% endhint %}

{% hint style="info" %}
**An empty response (`HTTP 200` with `[]`) on the triplet is indistinguishable across four root causes.** The triplet does not signal which condition produced the empty response:

* No `USER_OWNER_MAPPING` row for the caller — the platform cannot resolve an Owner; the anchor set is empty.
* A bound Owner exists but owns zero entities — the anchor set is genuinely empty.
* Owned entities exist but they have no upstream / downstream lineage edges — the neighbourhood is empty.
* No security context (anonymous call under `auth.type=DISABLED`) — the anchor-fetch resolves to no Owner.

All four return the same `200 OK` body. When troubleshooting an empty triplet response, cross-check `/api/identity/whoami` (auth state — distinguishes the no-security-context case) and `/api/dataentities/my` (owned set — distinguishes the empty-anchor cases from the empty-neighbourhood case) before concluding "no lineage exists."
{% endhint %}

## Read posture across the catalog

Lineage on every catalogued entity — datasets, transformers, consumers, microservices, Data Entity Groups — uses the platform's read-collaborative posture. Any authenticated user with read access to the catalog can request the upstream / downstream graph of any catalogued entity, regardless of which team owns the underlying object. The lineage repository does not apply an ownership-side filter on the read path; the group-lineage endpoint exposes the full child-set under its parent group; the microservices lineage surface exposes the full call graph between catalogued services.

This matters most for **multi-team deployments** that expect per-team isolation on lineage reads — they don't get it from the platform's RBAC today. The mitigations are platform-wide and live on the [Authorization](/configuration-and-deployment/enable-security/authorization.md) subtree: scope the catalog deployment per team, or restrict who has authenticated access to it. Microservice lineage is the highest-sensitivity surface in this class because operational call patterns are more topology-revealing than schema-lineage edges (see [Microservices Lineage → Access model](/features/data-lineage/microservices.md#access-model)).

## Where to next

* If you want to trace upstream / downstream from a specific catalogued entity → [Data Objects Lineage](/features/data-lineage/data-objects.md).
* If you ingest microservices through OpenTelemetry traces and want to see them alongside your data graph → [Microservices Lineage](/features/data-lineage/microservices.md).
* For the Lineage HTTP API (per-entity, group-level, and microservices) → [API Reference → Lineage](/developer-guides/api-reference/lineage.md).
* For the broader catalog vocabulary (Data Entity, ODDRN, Plugin, Push adapter) → [Main Concepts](/introduction/main-concepts.md).
* For where Data Lineage sits among the other governance pillars → [Main Concepts → Data Governance map](/introduction/main-concepts.md#data-governance-map).