Main Concepts
Core vocabulary and mental model for the Open Data Discovery project — what the pieces are, how they fit together, and where to dive deeper.
This page introduces the core vocabulary of the Open Data Discovery (ODD) project. It is a map — each concept gets a short definition and a link to its canonical deep-dive page.
Not the Business Glossary. ODD Platform ships an in-app Business Glossary feature (term entities you can link to datasets, term-to-term relationships, ownership). That is a different thing from this docs page. See the Business Glossary feature page for the product feature.
The architecture chain
Metadata flows from data systems into the platform along two paths — pull (a collector polls the source) and push (an adapter embedded inside the source's runtime emits directly to the platform):
Pull path: Data source ← Pull adapter (wrapped as a Plugin inside a Collector) → ODD Platform
Push path: Data source's application runtime → Push adapter → ODD Platform
The producer-side concepts:
Data source — a system holding data or data-adjacent metadata: a database, a warehouse, a BI tool, an ML training registry, an orchestrator.
Adapter — a set of scripts that map metadata from a source system (PostgreSQL, MySQL, Airflow, Kafka, …) to the ODD specification — Data Entities, data types, lineage edges, quality tests. An adapter's job is extract-and-map, nothing more. Adapters come in two flavours: pull (reads from the source on a schedule) and push (emits from inside the source's runtime). An adapter never runs alone; it is either hosted by a collector (pull) or packaged as a push adapter (push).
Plugin — a configured adapter instance inside a collector. One plugin carries one adapter's connection and schedule settings (source host, database, credentials, cadence). A single collector can host many plugins — multiple instances of the same adapter type (e.g., two PostgreSQL plugins pointing at different hosts or databases) or plugins for different adapter types.
Collector — a container of pull adapters plus the runtime around them: adapter launcher, logging system, Platform-API client, configuration reader, scheduling. A collector is what you deploy; the pull adapters inside it are the mappers, and each one is configured via a plugin. The canonical implementation is
odd-collectorwith 40+ bundled pull adapters; specialist collectors exist for AWS, GCP, Azure, and data profiling. A collector is not a synonym for "pull adapter".Push adapter (also known as push-client) — a push-strategy adapter; the source initiates the data flow and the adapter knows the platform's endpoint. Push adapters ship in three deployment shapes:
In-process plugin / extension — embedded in the source system's own runtime: a dbt plugin, a Great Expectations checkpoint action, an Airflow plugin, a Spark listener. The most common shape today.
Standalone gateway — a separate service that source systems push to (today's only example:
odd-tracing-gateway, which receives OpenTelemetry traces). Operator-mental-model is "push"; the Platform-side leg is a pull hidden behind the gateway's standalone deployment.Direct SDK / CLI use — push via a CLI or library call from custom code (
odd-cliinvocation, custom Python usingodd-models-package).
All three shapes are extract-and-map adapters; what differs from a pull adapter is deployment topology — the adapter does not live in a collector container.
ODD Platform — the central server: stores the metadata, provides search, lineage, ownership, alerts, DQ dashboards, and the UI.
Pick pull when the source is a data store and you want point-in-time snapshots on a cadence — most data-source integrations work this way, since the source is passive and the collector drives. Pick push when the source is an application already running code we can instrument — Airflow DAGs, dbt runs, Spark jobs, Great Expectations validations — and you want each run's lineage and results reported as they happen. Some ecosystems combine both: a pull collector indexes the catalog while a push-client reports per-run lineage.
See Architecture.md for the diagram, developer-guides/build-and-run/build-and-run-odd-collectors.md for deployment detail, and the specification's push-model note for protocol-level detail.
ODDRN
ODDRN (Open Data Discovery Resource Name) is the unique, stable string that identifies every entity in the system — a dataset, a column, a data source, a pipeline run, a transformer. Producers (collectors, push adapters, custom agents) must generate an ODDRN for each entity they report so the platform can recognise the same entity across ingests, across producers, and over time. ODDRNs are what make cross-system lineage possible.
Format. Every ODDRN starts with a double slash and the data-source family, followed by the connection coordinates that uniquely locate the entity in the world — host for self-hosted databases, AWS account ID + region for cloud services, etc. The format follows REST URL conventions:
where:
1.2.3.4— the PostgreSQL instance hostex_database— the target databasepublic— the target schemaex_table— the target table
Usage. ODDRNs power the Ingestion API — the same string identifying the same entity across ingests is what lets the platform decide whether to create new entities, update existing ones, or delete obsolete ones on each payload. Operators rarely see ODDRNs directly; they become relevant when writing a custom agent. To assist, ODD ships open-source generator libraries for Python and Java; the Build a custom collector walkthrough covers the Python pattern end-to-end, including which Generator subclass to use per source family.
Known limitation. All consumers of the Ingestion API must use the same ODDRN string for the same entity. Since ODDRNs encode connection coordinates, this means agents reporting on the same data infrastructure must agree on hostnames or static IPs — coordinate identifiers across your deployment if multiple agents touch the same source.
ODD Specification
The ODD Specification is the wire contract between producers (collectors, push-clients) and the platform — the Ingestion API schema. It decouples the two sides: any producer that speaks the specification can feed any compliant platform. This is what makes custom agents and third-party collectors possible.
Data Governance map
A structured view of how ODD's functionality maps onto recognised data governance pillars. Use this to answer "does ODD do X?" for your governance framework.
Data Discovery — available. The core of the platform: catalog search with multiple facets, entity pages (datasets, transformers, consumers, quality tests, ML models), tags, ownership, and the Directory view. See the Data Discovery pillar landing for the four entry paths (Search, Directory, Tagging, Data Entity Groups & Domains) and the Catalog Overview home page.
Data Lineage — available. Upstream and downstream lineage across the full entity model, not just datasets — pipelines, ML experiments, and quality tests all participate, plus microservices traced through OpenTelemetry. See the Data Lineage pillar landing.
Data Quality — available. Per-entity test results surfaced on entity pages, the catalog-wide Data Quality dashboard, and operator-set Minor / Major / Critical SLA statuses. See the Data Quality pillar landing and Visibility for Data Quality Engineer.
Data Modeling — partially available. Data Entity Groups (DEGs) for logical grouping and entity relationship / ERD views today. Schema evolution signals (backwards-incompatible change triggers) are surfaced in alerts. See Dataset schema diff and the Data Modelling pillar.
Data Glossary — available. The in-app Business Glossary feature — term entities with term-to-term and term-to-data-entity linking, ownership, tags. Distinct from this Main Concepts page: Business Glossary is a product feature, Main Concepts is documentation. See the Data Glossary pillar landing and the Business Glossary reference.
Master Data Management (incl. Reference Data Management) — partially available. Lookup Tables provide operator-managed reference data as first-class entities in the catalog. Full MDM semantics (golden records, survivorship rules, stewardship workflows) are not part of ODD today — what ships is reference-data management. See the Master Data Management pillar landing and the Lookup Tables feature page.
Data Cost — roadmap. Cost attribution to datasets, pipelines, and owners is not implemented today.
Data Security (governance-level) — roadmap. Data classification, sensitivity tagging, PII/PHI handling, and fine-grained data-access control sit on the roadmap. This is different from platform-access security (who can log in, what roles they have, what policies apply to the UI/API) — that is already shipped and documented under configuration-and-deployment/enable-security/README.md.
Pillar differentiation
The six available / partially-available pillars are conceptually distinct because each captures a different operator workflow:
Data Discovery is location-oriented — finding existing entities by search, browse, or home-page surfacing. Entities come from collectors and push adapters; this pillar provides the navigation paths into the catalog.
Data Modelling is contract-oriented — describing how a dataset is queried (Query Examples) and connected (Relationships / ERDs). The dataset itself comes from outside; the platform records intent and structure on top.
Master Data Management is operator-curated reference data — the canonical lookup tables managed inside the platform. There is no external source; the platform is the system of record.
Data Lineage is connection-oriented — describing how entities flow into and out of each other across pipelines and microservices. The lineage is the cross-pillar record because every entity has a structure, a meaning, a location, a quality signal, and a lineage.
Data Glossary is meaning-oriented — naming and describing the concepts the data represents. Terms are first-class catalog entities with their own lifecycle, ownership, RBAC, and search surface; not metadata attached to other entities.
Data Quality is correctness-oriented — test results, anomaly classes, dataset SLAs. Every catalogued dataset has a quality story, even if it is only "no checks defined".
That difference shows up in where the data lives: Data Modelling artefacts attach to existing entities; Master Data artefacts are entities (Lookup Tables exist as Data Entities of type LOOKUP_TABLE); Data Quality results are pushed in by external frameworks; Lineage edges are computed from the connection graph. The six pillars sit alongside each other in the Data Governance map above, not nested.
AI aspects
ODD integrates AI/GenAI capabilities in a few places:
GenAI assistant — opt-in proxy from a single platform endpoint to an external AI service the operator runs (the platform does not embed an LLM). API-only today. See the GenAI assistant page for configuration, the external service contract, and operator caveats.
Data profiling — automatic statistical profiles for datasets (null ratios, distributions, cardinality) via
odd-collector-profiler. Surfaces on entity pages.ML experiment / model lineage — experiments and trained models are first-class entities with their own lineage edges; useful for reproducibility and governance of ML pipelines.
Terms & Aliases
A living record of synonyms and aliases users may search for. If you know a feature by a different name, start here.
Server-to-server (S2S) authentication
Machine-to-machine (M2M) tokens, M2M auth
Static API-key authentication for programmatic clients
Ingestion authentication filter
Ingestion filter, ingestion API key
Token-based auth for /ingestion/** — independent of UI auth, off by default
Ingestion authentication in Enable security
Collector secrets backend
Alternative secrets backend
Store collector credentials in an external secret store (AWS SSM) instead of YAML
ODDRN
Open Data Discovery Resource Name
Stable string identifying every entity in the system
Business Glossary (feature)
Glossary, Terms
In-app feature for managing term entities and linking them to datasets — not this Main Concepts page
Data Entity Group
DEG
Logical grouping of data entities inside the catalog
ML Experiments
ML Experiment Logging (deprecated)
A Data Entity Group collecting the entities produced by one training run — inputs, jobs, models, artifacts. Catalog view, not a metrics tracker.
ODD Specification
Ingestion API spec, ingress API
Wire contract between producers and the platform
Integration
—
Umbrella term for any path metadata takes from a source into the Platform — collectors (pull) and push adapters (push). Prefer in user-facing prose unless direction (pull/push) matters.
Adapter
—
Source→spec mapper (push or pull); extract-and-map only, never runs alone. Classified by strategy (pull / push) and deployment shape.
Pull adapter
—
Pull-strategy adapter — reads from the source on a cadence; the adapter knows the source endpoint and credentials. Today always paired with the collector-hosted deployment shape (configured via a plugin).
Plugin
Adapter instance, adapter config
A configured pull-adapter instance inside a collector. Push adapters do not use the plugin term.
Collector
Pull-adapter container (informal)
Container of pull adapters + runtime; the collector-hosted deployment shape. Not a synonym for "pull adapter".
Push adapter
Push-client (client-server framing)
Push-strategy adapter — the source initiates the data flow. Three deployment shapes: in-process plugin / extension (dbt, GE, Airflow, Spark), standalone gateway (odd-tracing-gateway), direct SDK / CLI use (odd-cli). Used when the discussion is about extract-and-map mechanics.
Push-client
Push adapter (extract-and-map framing)
Same component as Push adapter, framed from client-server topology — a producer-side client of the Platform server using push strategy. Used when the discussion is about deployment topology or network position.
Standalone gateway
Push-adapter standalone shape, OTel gateway, tracing gateway
A push-adapter deployment shape: a separate service that source systems push to over an externally-defined wire protocol (today: OpenTelemetry/OTLP), with the Platform pulling the inferred entities. Today's only example is odd-tracing-gateway. Distinct from in-process plugins (which live inside a source tool's runtime) and from collectors (which host pull adapters).
Catalog Overview page
Overview page, Main page, Data Entity Report (deprecated as page synonym)
The catalog's home page — main search, top tags, domains, the per-class Entities report, directory, and (when auth is on) owner association. Distinct from a data entity's own Overview tab, which is the per-entity landing view inside a detail page.
Data Discovery (the bucket landing the home page surfaces)
Master Data Management
MDM, Reference Data Management, Reference Data
The Data Governance pillar covering operator-curated reference data managed inside the platform. ODD ships the Reference-Data subset (Lookup Tables); golden records / survivorship / stewardship workflows are not part of ODD today.
Lookup Tables
Reference tables, Master Data tables
Operator-curated reference tables managed inside the platform — schema, data, RBAC, API surface. Exposed in the catalog as Data Entities of type LOOKUP_TABLE. UI section: Master Data top-level tab.
Slack alert webhook
Slack notifications, Slack incoming webhook
Outgoing-only HTTP POST of alert messages into a Slack channel via notifications.receivers.slack.url. One-way write — no thread state, no replies read back. Distinct from the Slack collaboration app. Consumer: SlackNotificationSender (gated by @ConditionalOnProperty(name = "notifications.receivers.slack.url")).
Slack collaboration app
Slack Events API, Slack OAuth integration, Data Collaboration Slack
Full Slack app for in-app per-entity discussion threads — OAuth (datacollaboration.slack-oauth-token) plus the Slack Events API webhook to read replies back into the platform; bidirectional. Distinct from the Slack alert webhook. Routes gated by @ConditionalOnDataCollaboration (returns 404 Not Found when datacollaboration.enabled=false).
New aliases get added as they're discovered. If you notice a term that is missing or ambiguous, open an issue or a PR — the goal is that searching any common name lands you on the right page.
Last updated