> For the complete documentation index, see [llms.txt](https://docs.opendatadiscovery.org/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.opendatadiscovery.org/features/data-lineage.md).

# Data Lineage

The **Data Lineage** section of ODD Platform is the home for upstream and downstream traceability across the catalog. The role is durable: anything that documents *how entities are connected* — which dataset was read by which job, which job produced which model, which microservice traced which call — belongs here.

ODD covers Data Lineage fully and across two complementary surfaces — **data-object lineage** (catalog entities and the edges between them) and **microservices lineage** (OpenTelemetry-traced microservice calls rendered alongside the data graph). See the [Data Governance map](/introduction/main-concepts.md#data-governance-map) for the position of Data Lineage among the other governance pillars.

Open lineage from the **Lineage tab** on any data-entity detail page (per-entity view) or from the **Group lineage** entry point on a [Data Entity Group](/features/data-discovery/groups-domains.md) detail page. The microservices view is reached from any catalogued microservice entity ingested through [`odd-tracing-gateway`](/integrations/integrations/odd-tracing-gateway.md).

Both canvases expose a Compact / Full view-mode toggle — same label, two subsystem behaviours; the [data-objects sub-page](/features/data-lineage/data-objects.md#view-mode-toggle-compact--full) describes the asymmetry and the dense-graph caveat. The same sub-page also covers the **UI-vs-API depth contract** every direct-API caller should read before scripting lineage queries — the canvas's 1-20 depth dropdown is a UI presentation choice; the URL and the API accept any positive integer with no upper bound.

## Subsections

* [**Data Objects Lineage**](/features/data-lineage/data-objects.md) — per-entity upstream / downstream graphs across the full ODD entity model: datasets, transformers, transformer runs, quality tests + their runs, consumers, data inputs, data entity groups (including ML experiments), and entity relationships. Backed by the split per-entity endpoints `GET /api/dataentities/{data_entity_id}/lineage/upstream` and `GET /api/dataentities/{data_entity_id}/lineage/downstream`, plus the dedicated group-lineage endpoint `GET /api/dataentitygroups/{data_entity_group_id}/lineage`.
* [**Microservices Lineage**](/features/data-lineage/microservices.md) — microservice call lineage rendered alongside data-object lineage. Sourced from OpenTelemetry traces ingested via `odd-tracing-gateway` (the platform's only [standalone gateway](/introduction/main-concepts.md#the-architecture-chain) push adapter today).

## Why this is a separate pillar

For how Data Lineage relates to the other governance pillars (Data Discovery, Data Modelling, Master Data Management, Data Glossary, Data Quality), see [Main Concepts → Data Governance map → Pillar differentiation](/introduction/main-concepts.md#pillar-differentiation) — the canonical home for the six-pillar framing. Lineage is its own pillar because the connection graph cuts across every other pillar; a dataset has a structure, a meaning, a location, a quality signal, *and* a lineage, and the lineage itself is the cross-pillar record.

## My-objects triplet — composition + anchor architecture

Three lineage-adjacent endpoints answer the operator question *"what do I own, what flows into it, what flows out of it"* as a unified triplet:

| Endpoint                              | What it returns                                                                                 |
| ------------------------------------- | ----------------------------------------------------------------------------------------------- |
| `GET /api/dataentities/my`            | The entities the signed-in user owns (the **anchor set**).                                      |
| `GET /api/dataentities/my/upstream`   | The set of entities that the user's owned entities **depend on** but the user does **not** own. |
| `GET /api/dataentities/my/downstream` | The set of entities that **depend on** the user's owned entities but the user does **not** own. |

The two `*upstream` / `*downstream` endpoints are not "my owned entities + their upstream" — they are explicitly **the non-owned set adjacent to the user's owned entities** (lineage neighbours minus the anchor). The UI labels the surfaces accurately as *"Upstream dependents"* / *"Downstream dependents"* — the dependents on the user's stuff, not the user's stuff.

{% hint style="danger" %}
**The OpenAPI summary on the upstream / downstream operations describes the wrong shape.** The spec text for `getMyObjectsWithUpstream` and `getMyObjectsWithDownstream` currently reads *"Returns list of data entities owned by current user with upstream dependencies"* — implying the response is your owned set with extra context. The actual response is the **NON-owned set adjacent to your owned entities** (the lineage neighbours, with the owned anchor explicitly excluded).

Third-party API consumers compiling SDKs from `openapi.yaml` will get the wrong mental model. SDK code that treats the response as *"my entities"* will silently mis-attribute lineage neighbours to the caller. Until the spec is corrected, **follow the UI label semantic** (*"Upstream dependents"* / *"Downstream dependents"*) when integrating these endpoints — they are dependency graphs around the caller's owned set, not the caller's owned set itself.
{% endhint %}

### Anchor architecture and operator caveats

A handful of architectural details on the triplet matter when reasoning about exposure, performance, and debugging:

{% hint style="warning" %}
**Owner-scoping is enforced at exactly one site — the lineage projection downstream has no defence-in-depth.** The triplet's owner filter runs at the anchor-fetch step (the platform resolves the signed-in user to their bound Owner and looks up the entities that Owner owns). From that point onward the lineage CTE has **no ownership join** and the final projection (`listByOddrns`) is a pure `WHERE oddrn IN (...)` scan against the anchor set with no per-owner predicate. The base `/api/dataentities/my` endpoint **does** join the ownership table; the triplet's upstream / downstream endpoints do **not**.

Today's code is correct — the anchor set IS the operator's owned set, so the lineage walk around that set is genuinely the caller's neighbourhood. The architectural caveat is that **a regression at the anchor-fetch step has catastrophic blast radius**: a misordered web filter dropping the security context, a typo in the user-owner-mapping resolver, or a fallback that defaults to an unintended owner under `auth.type=DISABLED` would silently return a different owner's lineage neighbourhood. The repository tier does not catch the mistake — there is no JOIN-side check that says "the anchor must match the caller." Combined with the [cross-mode user-name collision](/configuration-and-deployment/enable-security/authentication/login-form.md) on `USER_OWNER_MAPPING.OIDC_USERNAME`, a multi-mode deployment with a name collision is one regression away from cross-owner lineage neighbourhood leak. Audit attention belongs at the anchor-fetch site (`fetchAssociatedOwner`) — it is the single load-bearing line.
{% endhint %}

{% hint style="warning" %}
**The endpoint fetches the full owned set before applying pagination — admin / CI-bot owners trigger O(anchor) DB cost on every call regardless of `size`.** The triplet builds the upstream / downstream query by first calling `listByOwner(ownerId)` (which returns **all** entities the owner owns, no pagination) and then constructing a CTE with `WHERE child_oddrn IN (oddrn1, oddrn2, ...)` over the full anchor set. Memory and database CPU scale with the **size of the owned set**, not with the requested page size. PostgreSQL's planner cost is non-linear above \~1000 IN-clause elements; the jOOQ query does not paginate the IN clause.

**Operator-visible consequence.** An admin owner who owns thousands of catalogued entities (a CI-bot account that gets default-owner-assigned on every ingestion, an admin who became the owner of everything during initial setup) triggers a heavy query on every `/my/upstream` or `/my/downstream` call — even when the UI requests `size=5`. The Recommended panel on the Catalog Overview home page fires these endpoints on every SPA mount, so the cost is per-user-pageload, not per-explicit-API-call. **Operationally bound the owned set** — avoid making admin accounts the owner of all entities; use a service-account pattern for ingestion-time auto-owner assignment; consider a small dedicated owner per team rather than a single platform-wide steward.
{% endhint %}

{% hint style="info" %}
**An empty response (`HTTP 200` with `[]`) on the triplet is indistinguishable across four root causes.** The triplet does not signal which condition produced the empty response:

* No `USER_OWNER_MAPPING` row for the caller — the platform cannot resolve an Owner; the anchor set is empty.
* A bound Owner exists but owns zero entities — the anchor set is genuinely empty.
* Owned entities exist but they have no upstream / downstream lineage edges — the neighbourhood is empty.
* No security context (anonymous call under `auth.type=DISABLED`) — the anchor-fetch resolves to no Owner.

All four return the same `200 OK` body. When troubleshooting an empty triplet response, cross-check `/api/identity/whoami` (auth state — distinguishes the no-security-context case) and `/api/dataentities/my` (owned set — distinguishes the empty-anchor cases from the empty-neighbourhood case) before concluding "no lineage exists."
{% endhint %}

## Read posture across the catalog

Lineage on every catalogued entity — datasets, transformers, consumers, microservices, Data Entity Groups — uses the platform's read-collaborative posture. Any authenticated user with read access to the catalog can request the upstream / downstream graph of any catalogued entity, regardless of which team owns the underlying object. The lineage repository does not apply an ownership-side filter on the read path; the group-lineage endpoint exposes the full child-set under its parent group; the microservices lineage surface exposes the full call graph between catalogued services.

This matters most for **multi-team deployments** that expect per-team isolation on lineage reads — they don't get it from the platform's RBAC today. The mitigations are platform-wide and live on the [Authorization](/configuration-and-deployment/enable-security/authorization.md) subtree: scope the catalog deployment per team, or restrict who has authenticated access to it. Microservice lineage is the highest-sensitivity surface in this class because operational call patterns are more topology-revealing than schema-lineage edges (see [Microservices Lineage → Access model](/features/data-lineage/microservices.md#access-model)).

## Where to next

* If you want to trace upstream / downstream from a specific catalogued entity → [Data Objects Lineage](/features/data-lineage/data-objects.md).
* If you ingest microservices through OpenTelemetry traces and want to see them alongside your data graph → [Microservices Lineage](/features/data-lineage/microservices.md).
* For the Lineage HTTP API (per-entity, group-level, and microservices) → [API Reference → Lineage](/developer-guides/api-reference/lineage.md).
* For the broader catalog vocabulary (Data Entity, ODDRN, Plugin, Push adapter) → [Main Concepts](/introduction/main-concepts.md).
* For where Data Lineage sits among the other governance pillars → [Main Concepts → Data Governance map](/introduction/main-concepts.md#data-governance-map).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.opendatadiscovery.org/features/data-lineage.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
