Main Concepts

Core vocabulary and mental model for the Open Data Discovery project — what the pieces are, how they fit together, and where to dive deeper.

This page introduces the core vocabulary of the Open Data Discovery (ODD) project. It is a map — each concept gets a short definition and a link to its canonical deep-dive page.

Not the Business Glossary. ODD Platform ships an in-app Business Glossary feature (term entities you can link to datasets, term-to-term relationships, ownership). That is a different thing from this docs page. See the Business Glossary feature page for the product feature.

The architecture chain

Metadata flows from data systems into the platform along two paths — pull (a collector polls the source) and push (an adapter embedded inside the source's runtime emits directly to the platform):

  • Pull path: Data source ← Pull adapter (wrapped as a Plugin inside a Collector) → ODD Platform

  • Push path: Data source's application runtime → Push adapter → ODD Platform

The producer-side concepts:

  • Data source — a system holding data or data-adjacent metadata: a database, a warehouse, a BI tool, an ML training registry, an orchestrator.

  • Adapter — a set of scripts that map metadata from a source system (PostgreSQL, MySQL, Airflow, Kafka, …) to the ODD specification — Data Entities, data types, lineage edges, quality tests. An adapter's job is extract-and-map, nothing more. Adapters come in two flavours: pull (reads from the source on a schedule) and push (emits from inside the source's runtime). An adapter never runs alone; it is either hosted by a collector (pull) or packaged as a push adapter (push).

  • Plugin — a configured adapter instance inside a collector. One plugin carries one adapter's connection and schedule settings (source host, database, credentials, cadence). A single collector can host many plugins — multiple instances of the same adapter type (e.g., two PostgreSQL plugins pointing at different hosts or databases) or plugins for different adapter types.

  • Collector — a container of pull adapters plus the runtime around them: adapter launcher, logging system, Platform-API client, configuration reader, scheduling. A collector is what you deploy; the pull adapters inside it are the mappers, and each one is configured via a plugin. The canonical implementation is odd-collector with 40+ bundled pull adapters; specialist collectors exist for AWS, GCP, Azure, and data profiling. A collector is not a synonym for "pull adapter".

  • Push adapter (also known as push-client) — a push-strategy adapter; the source initiates the data flow and the adapter knows the platform's endpoint. Push adapters ship in three deployment shapes:

    • In-process plugin / extension — embedded in the source system's own runtime: a dbt plugin, a Great Expectations checkpoint action, an Airflow plugin, a Spark listener. The most common shape today.

    • Standalone gateway — a separate service that source systems push to (today's only example: odd-tracing-gateway, which receives OpenTelemetry traces). Operator-mental-model is "push"; the Platform-side leg is a pull hidden behind the gateway's standalone deployment.

    • Direct SDK / CLI use — push via a CLI or library call from custom code (odd-cli invocation, custom Python using odd-models-package).

    All three shapes are extract-and-map adapters; what differs from a pull adapter is deployment topology — the adapter does not live in a collector container.

  • ODD Platform — the central server: stores the metadata, provides search, lineage, ownership, alerts, DQ dashboards, and the UI.

Pick pull when the source is a data store and you want point-in-time snapshots on a cadence — most data-source integrations work this way, since the source is passive and the collector drives. Pick push when the source is an application already running code we can instrument — Airflow DAGs, dbt runs, Spark jobs, Great Expectations validations — and you want each run's lineage and results reported as they happen. Some ecosystems combine both: a pull collector indexes the catalog while a push-client reports per-run lineage.

See Architecture.md for the diagram, developer-guides/build-and-run/build-and-run-odd-collectors.md for deployment detail, and the specification's push-model note for protocol-level detail.

ODDRN

ODDRN (Open Data Discovery Resource Name) is the unique, stable string that identifies every entity in the system — a dataset, a column, a data source, a pipeline run, a transformer. Producers (collectors, push adapters, custom agents) must generate an ODDRN for each entity they report so the platform can recognise the same entity across ingests, across producers, and over time. ODDRNs are what make cross-system lineage possible.

Format. Every ODDRN starts with a double slash and the data-source family, followed by the connection coordinates that uniquely locate the entity in the world — host for self-hosted databases, AWS account ID + region for cloud services, etc. The format follows REST URL conventions:

where:

  • 1.2.3.4 — the PostgreSQL instance host

  • ex_database — the target database

  • public — the target schema

  • ex_table — the target table

Usage. ODDRNs power the Ingestion API — the same string identifying the same entity across ingests is what lets the platform decide whether to create new entities, update existing ones, or delete obsolete ones on each payload. Operators rarely see ODDRNs directly; they become relevant when writing a custom agent. To assist, ODD ships open-source generator libraries for Python and Java; the Build a custom collector walkthrough covers the Python pattern end-to-end, including which Generator subclass to use per source family.

Known limitation. All consumers of the Ingestion API must use the same ODDRN string for the same entity. Since ODDRNs encode connection coordinates, this means agents reporting on the same data infrastructure must agree on hostnames or static IPs — coordinate identifiers across your deployment if multiple agents touch the same source.

ODD Specification

The ODD Specification is the wire contract between producers (collectors, push-clients) and the platform — the Ingestion API schema. It decouples the two sides: any producer that speaks the specification can feed any compliant platform. This is what makes custom agents and third-party collectors possible.

Data Governance map

A structured view of how ODD's functionality maps onto recognised data governance pillars. Use this to answer "does ODD do X?" for your governance framework.

  • Data Discoveryavailable. The core of the platform: catalog search with multiple facets, entity pages (datasets, transformers, consumers, quality tests, ML models), tags, ownership, and the Directory view. See the Data Discovery pillar landing for the four entry paths (Search, Directory, Tagging, Data Entity Groups & Domains) and the Catalog Overview home page.

  • Data Lineageavailable. Upstream and downstream lineage across the full entity model, not just datasets — pipelines, ML experiments, and quality tests all participate, plus microservices traced through OpenTelemetry. See the Data Lineage pillar landing.

  • Data Qualityavailable. Per-entity test results surfaced on entity pages, the catalog-wide Data Quality dashboard, and operator-set Minor / Major / Critical SLA statuses. See the Data Quality pillar landing and Visibility for Data Quality Engineer.

  • Data Modelingpartially available. Data Entity Groups (DEGs) for logical grouping and entity relationship / ERD views today. Schema evolution signals (backwards-incompatible change triggers) are surfaced in alerts. See Dataset schema diff and the Data Modelling pillar.

  • Data Glossaryavailable. The in-app Business Glossary feature — term entities with term-to-term and term-to-data-entity linking, ownership, tags. Distinct from this Main Concepts page: Business Glossary is a product feature, Main Concepts is documentation. See the Data Glossary pillar landing and the Business Glossary reference.

  • Master Data Management (incl. Reference Data Management)partially available. Lookup Tables provide operator-managed reference data as first-class entities in the catalog. Full MDM semantics (golden records, survivorship rules, stewardship workflows) are not part of ODD today — what ships is reference-data management. See the Master Data Management pillar landing and the Lookup Tables feature page.

  • Data Costroadmap. Cost attribution to datasets, pipelines, and owners is not implemented today.

  • Data Security (governance-level)roadmap. Data classification, sensitivity tagging, PII/PHI handling, and fine-grained data-access control sit on the roadmap. This is different from platform-access security (who can log in, what roles they have, what policies apply to the UI/API) — that is already shipped and documented under configuration-and-deployment/enable-security/README.md.

Pillar differentiation

The six available / partially-available pillars are conceptually distinct because each captures a different operator workflow:

  • Data Discovery is location-oriented — finding existing entities by search, browse, or home-page surfacing. Entities come from collectors and push adapters; this pillar provides the navigation paths into the catalog.

  • Data Modelling is contract-oriented — describing how a dataset is queried (Query Examples) and connected (Relationships / ERDs). The dataset itself comes from outside; the platform records intent and structure on top.

  • Master Data Management is operator-curated reference data — the canonical lookup tables managed inside the platform. There is no external source; the platform is the system of record.

  • Data Lineage is connection-oriented — describing how entities flow into and out of each other across pipelines and microservices. The lineage is the cross-pillar record because every entity has a structure, a meaning, a location, a quality signal, and a lineage.

  • Data Glossary is meaning-oriented — naming and describing the concepts the data represents. Terms are first-class catalog entities with their own lifecycle, ownership, RBAC, and search surface; not metadata attached to other entities.

  • Data Quality is correctness-oriented — test results, anomaly classes, dataset SLAs. Every catalogued dataset has a quality story, even if it is only "no checks defined".

That difference shows up in where the data lives: Data Modelling artefacts attach to existing entities; Master Data artefacts are entities (Lookup Tables exist as Data Entities of type LOOKUP_TABLE); Data Quality results are pushed in by external frameworks; Lineage edges are computed from the connection graph. The six pillars sit alongside each other in the Data Governance map above, not nested.

AI aspects

ODD integrates AI/GenAI capabilities in a few places:

  • GenAI assistant — opt-in proxy from a single platform endpoint to an external AI service the operator runs (the platform does not embed an LLM). API-only today. See the GenAI assistant page for configuration, the external service contract, and operator caveats.

  • Data profiling — automatic statistical profiles for datasets (null ratios, distributions, cardinality) via odd-collector-profiler. Surfaces on entity pages.

  • ML experiment / model lineage — experiments and trained models are first-class entities with their own lineage edges; useful for reproducibility and governance of ML pipelines.

Terms & Aliases

A living record of synonyms and aliases users may search for. If you know a feature by a different name, start here.

Canonical term
Also known as
What it is
Details

Server-to-server (S2S) authentication

Machine-to-machine (M2M) tokens, M2M auth

Static API-key authentication for programmatic clients

Ingestion authentication filter

Ingestion filter, ingestion API key

Token-based auth for /ingestion/** — independent of UI auth, off by default

Ingestion authentication in Enable security

Collector secrets backend

Alternative secrets backend

Store collector credentials in an external secret store (AWS SSM) instead of YAML

ODDRN

Open Data Discovery Resource Name

Stable string identifying every entity in the system

Business Glossary (feature)

Glossary, Terms

In-app feature for managing term entities and linking them to datasets — not this Main Concepts page

Data Entity Group

DEG

Logical grouping of data entities inside the catalog

ML Experiments

ML Experiment Logging (deprecated)

A Data Entity Group collecting the entities produced by one training run — inputs, jobs, models, artifacts. Catalog view, not a metrics tracker.

ODD Specification

Ingestion API spec, ingress API

Wire contract between producers and the platform

Integration

Umbrella term for any path metadata takes from a source into the Platform — collectors (pull) and push adapters (push). Prefer in user-facing prose unless direction (pull/push) matters.

Adapter

Source→spec mapper (push or pull); extract-and-map only, never runs alone. Classified by strategy (pull / push) and deployment shape.

Pull adapter

Pull-strategy adapter — reads from the source on a cadence; the adapter knows the source endpoint and credentials. Today always paired with the collector-hosted deployment shape (configured via a plugin).

Plugin

Adapter instance, adapter config

A configured pull-adapter instance inside a collector. Push adapters do not use the plugin term.

Collector

Pull-adapter container (informal)

Container of pull adapters + runtime; the collector-hosted deployment shape. Not a synonym for "pull adapter".

Push adapter

Push-client (client-server framing)

Push-strategy adapter — the source initiates the data flow. Three deployment shapes: in-process plugin / extension (dbt, GE, Airflow, Spark), standalone gateway (odd-tracing-gateway), direct SDK / CLI use (odd-cli). Used when the discussion is about extract-and-map mechanics.

Push-client

Push adapter (extract-and-map framing)

Same component as Push adapter, framed from client-server topology — a producer-side client of the Platform server using push strategy. Used when the discussion is about deployment topology or network position.

Standalone gateway

Push-adapter standalone shape, OTel gateway, tracing gateway

A push-adapter deployment shape: a separate service that source systems push to over an externally-defined wire protocol (today: OpenTelemetry/OTLP), with the Platform pulling the inferred entities. Today's only example is odd-tracing-gateway. Distinct from in-process plugins (which live inside a source tool's runtime) and from collectors (which host pull adapters).

Catalog Overview page

Overview page, Main page, Data Entity Report (deprecated as page synonym)

The catalog's home page — main search, top tags, domains, the per-class Entities report, directory, and (when auth is on) owner association. Distinct from a data entity's own Overview tab, which is the per-entity landing view inside a detail page.

Data Discovery (the bucket landing the home page surfaces)

Master Data Management

MDM, Reference Data Management, Reference Data

The Data Governance pillar covering operator-curated reference data managed inside the platform. ODD ships the Reference-Data subset (Lookup Tables); golden records / survivorship / stewardship workflows are not part of ODD today.

Lookup Tables

Reference tables, Master Data tables

Operator-curated reference tables managed inside the platform — schema, data, RBAC, API surface. Exposed in the catalog as Data Entities of type LOOKUP_TABLE. UI section: Master Data top-level tab.

Slack alert webhook

Slack notifications, Slack incoming webhook

Outgoing-only HTTP POST of alert messages into a Slack channel via notifications.receivers.slack.url. One-way write — no thread state, no replies read back. Distinct from the Slack collaboration app. Consumer: SlackNotificationSender (gated by @ConditionalOnProperty(name = "notifications.receivers.slack.url")).

Slack collaboration app

Slack Events API, Slack OAuth integration, Data Collaboration Slack

Full Slack app for in-app per-entity discussion threads — OAuth (datacollaboration.slack-oauth-token) plus the Slack Events API webhook to read replies back into the platform; bidirectional. Distinct from the Slack alert webhook. Routes gated by @ConditionalOnDataCollaboration (returns 404 Not Found when datacollaboration.enabled=false).

New aliases get added as they're discovered. If you notice a term that is missing or ambiguous, open an issue or a PR — the goal is that searching any common name lands you on the right page.

Last updated