Architecture

This page is the structural mental model of an Open Data Discovery deployment — what runs where, how metadata flows from a source system to a user's screen, and which architectural concerns cross every component. It is the front door before Features, Integrations, and Configure ODD Platform. For the producer-side vocabulary used here (Adapter, Plugin, Collector, Push adapter), see Main Concepts → The architecture chain; this page operates in client-server topology framing (Push-client, Collector, Platform, Server).

Data flow

Metadata moves through five stages between a source system and a catalog user:

  1. Produce. A source system has metadata that needs to surface in the catalog — a database schema, a job graph, a dbt manifest, a Spark lineage event, a Lookup-Table row.

  2. Ingest. A producer (Collector or Push-client) sends metadata to the platform's Ingestion API. Pull producers (Collectors) poll on a schedule; push producers (in-process plugins, gateways, SDK callers) emit on the source's own cadence. Both speak the ODD Specification — the wire contract.

  3. Store. The platform writes the metadata to PostgreSQL keyed by ODDRN. Same-ODDRN means same entity across ingests, across producers, over time — that is what makes cross-system lineage possible.

  4. Query. UI calls and external scripts hit the Platform API (/api/...). Reads serve the catalog (search, lineage, alerts, glossary, query examples, relationships); writes mutate the catalog (ownership, tags, alert status, halt configuration, lookup-table rows). See the API Reference hub for the full surface.

  5. Render. The platform UI (served from the same process) renders the catalog: search, entity pages, lineage graphs, alert tabs, the Directory drill-down, the Catalog Overview home page.

The Push-client / Collector split is only at stage 2 — every later stage is identical regardless of which producer family fed the catalog.

Deployment topology

Component
What it is
What an operator deploys
Configuration home

Platform (Server)

The Spring-Boot application: Ingestion API, Platform API, UI, scheduled jobs (housekeeping, alerting, data-collaboration sender).

One Platform process plus PostgreSQL.

Collector

Container of pull adapters plus the runtime around them (adapter launcher, logger, Platform-API client, scheduler). The canonical implementation is odd-collector; cloud-specific siblings are odd-collector-aws, odd-collector-azure, odd-collector-gcp, and odd-collector-profiler.

One Collector container per cloud / source-family group, each holding many configured plugins (one per source instance).

Push-client (in-process plugin)

A push-strategy adapter that runs inside the source system's runtime — a dbt plugin, an Airflow plugin, a Great Expectations checkpoint action, a Spark listener, an odd-cli invocation.

Installed alongside the source application; emits metadata on the source's own cadence.

Per-tool repos under opendatadiscovery on GitHub; see Integrations.

Push-client (standalone gateway)

A push-strategy adapter that runs as its own service. Source systems push over an externally-defined wire protocol (today: OpenTelemetry/OTLP for odd-tracing-gateway); the gateway processes the input and exposes the inferred entities for the Platform / a collector to pull through the standard adapter-contract entities API.

One gateway process per network perimeter that needs aggregated push ingress.

UI

Single-page React application served from the Platform process at /.

Same process as the Platform — operators do not deploy the UI separately.

(Configured indirectly through odd.platform-base-url; see Configure ODD Platform.)

Centralised: the Platform (one server, one PostgreSQL) and the UI (served from the same process). Distributed: every Collector and every Push-client lives in or beside its source system. The reason ODD scales to many sources is that the producer side is horizontally distributable while the catalog stays a single coherent surface.

Cross-cutting concerns

A landing-level pointer per concern; every link below has its own canonical home with the full operator detail.

Pull vs Push — when to choose which

Both topologies feed the same catalog through the same Ingestion API. The choice is operational:

  • Pull (Collector) when the source is a passive data store (database, warehouse, BI tool, ML registry, message broker) and you want point-in-time snapshots on a cadence. The Collector drives; the source has no awareness of the catalog. Most data-source integrations work this way.

  • Push (Push-client) when the source is an already-running application that you can instrument — Airflow DAGs, dbt runs, Spark jobs, Great Expectations validations, your own services calling odd-cli — and you want per-run lineage and results reported as they happen. The source drives; latency from event to catalog is bounded by the producer's own emit cadence.

  • Both at once is normal: a pull Collector indexes the warehouse catalog while an Airflow Push-client reports per-run lineage on top of it.

For the in-the-spec view of push-strategy producers, see the push model section of the ODD Specification.

ODDRN

ODDRN (Open Data Discovery Resource Name) is the stable string that identifies every entity in the system — a dataset, a column, a data source, a pipeline run. Producers generate an ODDRN for each entity they emit so the platform can recognise the same entity across ingests, across producers, and over time. ODDRN is what makes cross-system lineage possible, what makes idempotent ingests possible, and what gives the AlertManager webhook its entity_oddrn routing key.

Operators rarely interact with ODDRNs directly — they become relevant when authoring a custom adapter. See ODDRN for the format, examples, and the generator libraries for Python and Java; see Build a custom collector for the end-to-end Python pattern.

Where to read the code

The mapping from this overview to the actual code lives in the workspace's navigation domain pages — navigation/domains/{feature}.md files maintain controller / service / configuration / UI pointers per feature so a reader does not need to grep. The contributor-facing entry points on the public doc tree are:

  • GitHub organization overview — every ODD repository with a one-line summary.

  • Build and run — Platform and Collector build / deploy walkthroughs, plus the Build a custom collector developer guide.

  • Main Concepts — the producer-side vocabulary (Adapter / Plugin / Collector / Push adapter / Data source) and the Data Governance map (which pillars ODD covers, which are roadmap).

  • API Reference — the canonical hub for every HTTP endpoint, with per-feature sub-pages.

Last updated