Architecture
This page is the structural mental model of an Open Data Discovery deployment — what runs where, how metadata flows from a source system to a user's screen, and which architectural concerns cross every component. It is the front door before Features, Integrations, and Configure ODD Platform. For the producer-side vocabulary used here (Adapter, Plugin, Collector, Push adapter), see Main Concepts → The architecture chain; this page operates in client-server topology framing (Push-client, Collector, Platform, Server).

Data flow
Metadata moves through five stages between a source system and a catalog user:
Produce. A source system has metadata that needs to surface in the catalog — a database schema, a job graph, a dbt manifest, a Spark lineage event, a Lookup-Table row.
Ingest. A producer (Collector or Push-client) sends metadata to the platform's Ingestion API. Pull producers (Collectors) poll on a schedule; push producers (in-process plugins, gateways, SDK callers) emit on the source's own cadence. Both speak the ODD Specification — the wire contract.
Store. The platform writes the metadata to PostgreSQL keyed by ODDRN. Same-ODDRN means same entity across ingests, across producers, over time — that is what makes cross-system lineage possible.
Query. UI calls and external scripts hit the Platform API (
/api/...). Reads serve the catalog (search, lineage, alerts, glossary, query examples, relationships); writes mutate the catalog (ownership, tags, alert status, halt configuration, lookup-table rows). See the API Reference hub for the full surface.Render. The platform UI (served from the same process) renders the catalog: search, entity pages, lineage graphs, alert tabs, the Directory drill-down, the Catalog Overview home page.
The Push-client / Collector split is only at stage 2 — every later stage is identical regardless of which producer family fed the catalog.
Deployment topology
Platform (Server)
The Spring-Boot application: Ingestion API, Platform API, UI, scheduled jobs (housekeeping, alerting, data-collaboration sender).
One Platform process plus PostgreSQL.
Collector
Container of pull adapters plus the runtime around them (adapter launcher, logger, Platform-API client, scheduler). The canonical implementation is odd-collector; cloud-specific siblings are odd-collector-aws, odd-collector-azure, odd-collector-gcp, and odd-collector-profiler.
One Collector container per cloud / source-family group, each holding many configured plugins (one per source instance).
Push-client (in-process plugin)
A push-strategy adapter that runs inside the source system's runtime — a dbt plugin, an Airflow plugin, a Great Expectations checkpoint action, a Spark listener, an odd-cli invocation.
Installed alongside the source application; emits metadata on the source's own cadence.
Per-tool repos under opendatadiscovery on GitHub; see Integrations.
Push-client (standalone gateway)
A push-strategy adapter that runs as its own service. Source systems push over an externally-defined wire protocol (today: OpenTelemetry/OTLP for odd-tracing-gateway); the gateway processes the input and exposes the inferred entities for the Platform / a collector to pull through the standard adapter-contract entities API.
One gateway process per network perimeter that needs aggregated push ingress.
UI
Single-page React application served from the Platform process at /.
Same process as the Platform — operators do not deploy the UI separately.
(Configured indirectly through odd.platform-base-url; see Configure ODD Platform.)
Centralised: the Platform (one server, one PostgreSQL) and the UI (served from the same process). Distributed: every Collector and every Push-client lives in or beside its source system. The reason ODD scales to many sources is that the producer side is horizontally distributable while the catalog stays a single coherent surface.
Cross-cutting concerns
A landing-level pointer per concern; every link below has its own canonical home with the full operator detail.
Authentication. UI / Platform-API auth (Disabled / Login form / OAUTH2 / LDAP) plus separate Server-to-server (S2S) tokens for programmatic clients plus an independent Ingestion-API filter for producer traffic. See Enable security — the three surfaces are deliberately decoupled.
Alerting. Platform-detected (failed jobs, failed DQ tests, schema-incompatible changes, distribution anomalies) and externally-injected (Prometheus AlertManager via
/ingestion/alert/alertmanager). Dispatch goes to in-app tabs, optional Slack webhook, optional email. See Active platform features → Alerting and Active platform features → Notifications, with the operator-side configuration on Configure ODD Platform → Enable Alert Notifications.Lineage. Cross-system upstream / downstream graphs at entity granularity, plus group lineage for Data Entity Groups (including ML experiments). See Data Lineage → Data Objects Lineage and the API Reference → Lineage sub-page.
Search. Free-text plus seven facets — Datasource, Type, Namespace, Owner, Tag, Groups, Statuses. Complemented by the Directory's hierarchy-driven browse. See Data Discovery and the dedicated Search and Filtering page for the per-facet semantics and the per-result transparency icons.
Attachments. Per-entity files (PNG / PDF / docs) stored locally or to a REMOTE S3-compatible bucket. The default is local file system (
./attachments/) — explicitly switch toREMOTEfor production deployments. See Configure ODD Platform → Attachment storage for the operator caveats.Data Collaboration. Optional Slack-based per-entity discussion threads (full Slack app via OAuth + Events API webhook). Distinct from the alert webhook. See Active platform features → Data Collaboration and Configure ODD Platform → Enable Data Collaboration.
GenAI proxy. Optional thin proxy from the platform to an external AI service the operator runs. The platform itself does not embed an LLM. See Active platform features → GenAI assistant.
Pull vs Push — when to choose which
Both topologies feed the same catalog through the same Ingestion API. The choice is operational:
Pull (Collector) when the source is a passive data store (database, warehouse, BI tool, ML registry, message broker) and you want point-in-time snapshots on a cadence. The Collector drives; the source has no awareness of the catalog. Most data-source integrations work this way.
Push (Push-client) when the source is an already-running application that you can instrument — Airflow DAGs, dbt runs, Spark jobs, Great Expectations validations, your own services calling
odd-cli— and you want per-run lineage and results reported as they happen. The source drives; latency from event to catalog is bounded by the producer's own emit cadence.Both at once is normal: a pull Collector indexes the warehouse catalog while an Airflow Push-client reports per-run lineage on top of it.
For the in-the-spec view of push-strategy producers, see the push model section of the ODD Specification.
ODDRN
ODDRN (Open Data Discovery Resource Name) is the stable string that identifies every entity in the system — a dataset, a column, a data source, a pipeline run. Producers generate an ODDRN for each entity they emit so the platform can recognise the same entity across ingests, across producers, and over time. ODDRN is what makes cross-system lineage possible, what makes idempotent ingests possible, and what gives the AlertManager webhook its entity_oddrn routing key.
Operators rarely interact with ODDRNs directly — they become relevant when authoring a custom adapter. See ODDRN for the format, examples, and the generator libraries for Python and Java; see Build a custom collector for the end-to-end Python pattern.
Where to read the code
The mapping from this overview to the actual code lives in the workspace's navigation domain pages — navigation/domains/{feature}.md files maintain controller / service / configuration / UI pointers per feature so a reader does not need to grep. The contributor-facing entry points on the public doc tree are:
GitHub organization overview — every ODD repository with a one-line summary.
Build and run — Platform and Collector build / deploy walkthroughs, plus the Build a custom collector developer guide.
Main Concepts — the producer-side vocabulary (Adapter / Plugin / Collector / Push adapter / Data source) and the Data Governance map (which pillars ODD covers, which are roadmap).
API Reference — the canonical hub for every HTTP endpoint, with per-feature sub-pages.
Last updated