Overview
Hub for every way metadata reaches the ODD Platform — pull adapters (collector-hosted), push adapters (in-process plugins, standalone gateways, direct SDK use).
An integration is any path metadata takes from a source system into the ODD Platform. ODD ships two strategies — pull (a collector polls the source on a schedule) and push (a push adapter lives inside or alongside the source and emits as the source runs). Pick by where the work happens: pull when the source is a passive data store you want snapshotted on a cadence, push when the source is an application or a stream you want reporting per-run lineage and results in real time.
Push adapters ship in three deployment shapes:
In-process plugin / extension — the adapter is embedded inside the source tool's own runtime (
odd-airflow-2,odd-dbt,odd-spark-adapter,odd-great-expectations). Operators install the adapter into the existing source application.Standalone gateway — the adapter is its own service that source systems push to over an externally-defined wire protocol (today:
odd-tracing-gatewayover OpenTelemetry/OTLP). Operators deploy the gateway as a separate process and point their existing observability pipeline at it.Direct SDK / CLI use — the adapter is invoked as a CLI or a library call from custom code (
odd-cli).
Pull vs push at a glance
odd-collector
pull
collector-hosted
41 generic adapters: databases, BI tools, streams, MLOps
odd-collector-aws
pull
collector-hosted
11 AWS adapters: Glue, S3, Athena, Kinesis, SageMaker, …
odd-collector-azure
pull
collector-hosted
4 Azure adapters: PowerBI, Azure SQL, Blob Storage, Data Factory
odd-collector-gcp
pull
collector-hosted
4 GCP adapters: BigQuery, BigTable, GCS, GCS Delta
odd-collector-profiler
pull
collector-hosted
Statistical data profiling for Postgres / Azure SQL
odd-airflow-2
push
in-process plugin
Airflow DAG / task / lineage metadata via a Listener
odd-spark-adapter
push
in-process plugin
Spark job lineage (RDD, JDBC, Kafka batch, Snowflake, S3 Delta)
odd-great-expectations
push
in-process plugin
Great Expectations checkpoint results
odd-tracing-gateway
push
standalone gateway
Microservice identities and dependencies inferred from OpenTelemetry traces (HTTP, JDBC, Kafka, gRPC, AWS)
The same vocabulary appears in Main Concepts: a collector is the deployable container for pull adapters; a push adapter runs inside the source's runtime, beside it as a standalone gateway, or as a direct SDK / CLI call; a plugin is one configured pull-adapter instance inside a collector. "Pull adapter" is not a synonym for "collector" — pull adapters live inside collectors, plural per collector.
Which integration do I need?
A database, data warehouse, or BI tool (PostgreSQL, MySQL, Snowflake, Redshift, Tableau, …) →
odd-collector.An AWS service (Glue, S3, Athena, Kinesis, …) →
odd-collector-aws.An Azure service (PowerBI, Azure SQL, Blob Storage, Data Factory) →
odd-collector-azure.A GCP service (BigQuery, GCS, BigTable) →
odd-collector-gcp.Dataset statistics / profiling for Postgres or Azure SQL →
odd-collector-profiler.An Airflow scheduler running DAGs you want lineage for →
odd-airflow-2.dbt models and tests you want surfaced in the catalog →
odd-dbt.Spark jobs you want lineage from →
odd-spark-adapter.Great Expectations quality results →
odd-great-expectations.Local CSV / Parquet files, or an ad-hoc push from a script or CI step →
odd-cli.Microservices instrumented with OpenTelemetry — identities, HTTP / JDBC / Kafka / gRPC / AWS-SDK dependencies inferred from distributed traces →
odd-tracing-gateway. Reach for this when your stack already collects OpenTelemetry traces and you want the catalog to also reflect the microservices and the dependencies your existing observability pipeline already sees.
A single deployment commonly mixes strategies and shapes — e.g., one odd-collector container ingesting your warehouses on a schedule, odd-airflow-2 reporting DAG-level lineage as the orchestrator runs, and odd-tracing-gateway populating microservice identities from your OpenTelemetry pipeline. The platform is the same on the receiving end; pick per source.
Common configuration (collectors)
All collectors share the same top-level configuration schema, defined once in the SDK. The full reference, with every field, lives in Build and run ODD Collectors → Full configuration reference; the abridged shape is:
Push adapters are configured separately by their host tool (Airflow Connection, Spark configs, dbt env vars, GE action block) — they do not consume collector_config.yaml. See each push-adapter page for the per-tool configuration.
One collector hosts many plugins
A single collector instance — one container, one process — hosts as many plugins as you list in plugins:. Plugins can mix adapter types, and you can add multiple plugins of the same type to ingest from several sources of the same kind (three PostgreSQL databases on different hosts, two S3 buckets in different accounts, …). Each plugin needs a unique name; that's the discriminator the collector uses in logs and metrics.
Two plugins of the same type are a routine deployment pattern — one container scales to your full pull-side surface, you don't run one container per source.
Beyond connection settings: per-adapter features
Many pull adapters expose features that go past "connect and read schema". Two of the most-used ones are surfaced once here so you know to look for them on individual adapter pages:
Ingestion filters —
schemas_filter(PostgreSQL, Snowflake),filename_filter(S3, Azure Blob Storage, GCS),datasets_filter(BigQuery),pipeline_filter(Azure Data Factory) and similar. Each takes regexinclude/excludelists. When omitted, the default is "include everything" — i.e. the adapter ingests every schema / file / dataset it can see. Use filters to scope a plugin to the slice you actually want catalogued. See the dedicated Ingestion filters page for the per-key shape, the include / exclude interaction rule, and a worked PostgreSQL example.Foreign-key (ERD) relationships — PostgreSQL and Snowflake plugins emit
ENTITY_RELATIONSHIPentities for tables connected by foreign keys (cross-schema relations included). The platform renders these as ERD diagrams on the dataset detail page. Other adapters do not currently extract foreign-key relationships.
The full per-adapter capability matrix (which adapters support filters, which support ERD, which have additional knobs like dataset partitioning) lives on the per-collector pages.
Secrets backend (optional)
Any field in collector_config.yaml can be sourced from AWS SSM Parameter Store instead of inline YAML — see Collector secrets backend. Only odd-collector (the generic collector) ships with a Secrets Backend hook today; the cloud and profiler collectors read configuration from YAML and environment variables.
Integration Wizard (in-app UI)
To shorten the path from "I picked an integration" to "I have a working collector_config.yaml snippet", the platform ships an Integration Wizard under Management → Integrations. The wizard is data-driven by manifests on the platform's classpath (META-INF/wizard/*.yaml) and exposes the same set through GET /api/integrations / GET /api/integrations/{integration_id}. For each integration the wizard shows a description, walks the operator through prerequisites, and renders a parameterised YAML snippet — fill in host / port / credentials and copy the result into the plugins: block of your collector_config.yaml.
The wizard is a starting point, not a replacement for collector_config.yaml: it generates one plugin's worth of YAML, not the full file. Operators still hand-author platform_host_url, token, default_pulling_interval, additional plugins, filters, and any secrets-backend references. For the per-card flow, the static-parameter substitution context (today only platform_url, resolved from odd.platform-base-url), and the API surface, see Integration Wizard.
Token and datasource registration
Every integration — pull or push — authenticates to the platform with a collector token issued by the ODD Platform. The token is created in the UI under Management → Collectors (see Try locally → Create Collector entity for the step-by-step). The same flow issues the token regardless of whether the integration that consumes it is pull or push; what differs is how each integration consumes the token:
odd-collector* (all pull collectors)
token: <COLLECTOR_TOKEN> field in collector_config.yaml, or TOKEN environment variable
odd-airflow-2
Airflow Connection named odd, password field
odd-dbt
ODD_PLATFORM_TOKEN env var (or --platform-token flag)
odd-spark-adapter
spark.odd.host.url / spark.odd.oddrn.key Spark configuration (no static token — the JAR identifies itself by oddrn.key)
odd-great-expectations
platform_token field in the ODDAction block
odd-cli
ODD_PLATFORM_TOKEN env var
On the platform side, every integration registers its data sources via POST /ingestion/datasources, which the platform exposes as part of the Ingress API. Pull collectors call this from their SDK; push adapters call it (or rely on the platform recognising entity ODDRNs implicitly on first push) per the ODD Specification.
The metadata-push endpoint accepts any caller by default. The collector token above is not checked on the entity-push path unless you turn it on. The /ingestion/** namespace is whitelisted in Spring Security, and the one filter that does validate the token (POST /ingestion/entities) is gated by auth.ingestion.filter.enabled, which defaults to false. With the default in place and the platform reachable on the network, any caller who can speak the Ingress API can push a spec-valid DataEntityList into any existing datasource — by writing that datasource's ODDRN in the payload — and the catalog renders the result to every user as authoritative metadata. The companion POST /ingestion/entities/datasets/stats endpoint is never covered by that filter under any setting. Enable the filter (and read the per-endpoint posture) before exposing the platform on any untrusted network: Enable security → Ingestion authentication.
Ingestion error contract
The POST /ingestion/entities endpoint is the load-bearing call every collector makes for each metadata batch. Three client-side error conditions currently surface as HTTP 5xx (rather than 4xx) because the platform's controller has no @ExceptionHandler advice for the underlying exception classes. Collector authors writing retry-with-backoff logic against the public contract should treat the conditions below as client errors that look like server errors — retrying them compounds platform pressure without any chance of success.
Duplicate ODDRN inside one batch
IllegalStateException: Duplicate key (thrown by Collectors.toMap's default merger)
5xx
Deduplicate the batch on the collector side before submitting; do not retry.
Unknown data_source_oddrn in the batch
NotFoundException
5xx
Verify the data-source has been registered via POST /ingestion/datasources before the entity batch; do not retry on this exception class.
Payload exceeds the configured codec limit
DataBufferLimitException (Spring WebFlux body codec)
5xx
Reduce batch size or coordinate with the operator to raise spring.codec.max-in-memory-size; do not retry the same payload.
Use a per-condition pre-flight (a GET against the target data source, an in-batch deduplication pass, a payload-size check against the operator-documented limit) rather than blanket exponential backoff on 5xx. The platform-side hardening to convert these conditions to structured 4xx responses (400, 404, 413) is on the roadmap; until it ships, the doc-side contract above is the canonical client guidance.
A separate concern — per-data-source serialisation under contention. The POST /ingestion/entities pipeline holds a PostgreSQL SELECT … FOR UPDATE row-lock on the resolved data_source row for the entire pipeline duration (data-source resolve + the 14-step ingestion processor chain + OTLP metric export). Two collectors emitting concurrently to the same data source serialise on that lock; the loser may exceed the transaction timeout and fail with a 5xx — there is no Retry-After header, no 429 Too Many Requests signal. Collectors that may emit concurrently to the same data source should apply per-data-source backoff with jitter at the client layer (a small randomised delay between consecutive POST /ingestion/entities calls for the same data source) until the platform-side fix ships either a narrower lock scope or an explicit contention signal.
A third structural concern — destructive-path observability. The ingestion service's rollback paths (every condition above plus the in-pipeline processor failures) have no structured logging today; operators investigating an ingestion failure need platform-side access to the application log to find the actual exception trace. The discriminators an SRE needs (collector identity, target data source ODDRN, entity count, batch identifier) are not present in the platform's logs on the rollback paths — only the raw exception trace surfaces. If your deployment treats ingestion failures as user-impacting, route the collector's own logs through your observability stack as the primary diagnostic surface rather than expecting the platform to mirror the failure detail.
Where to next
Scoping what a plugin ingests → Ingestion filters — regex
include/excludeper plugin.Bootstrapping a
collector_config.yamlsnippet from the in-app wizard → Integration Wizard.Storing collector secrets in AWS SSM → Collector secrets backend.
Building / running a collector locally → Build and run ODD Collectors.
The wire contract between any integration and the platform → ODD Specification.
Authoring a brand-new adapter (when an existing one doesn't fit) → Build a custom collector. The SDK lives at odd-collectors/odd-collector-sdk.
Existing repository overview — GitHub organization overview lists every ODD repo with one-line summaries.
Last updated