# Overview

An **integration** is any path metadata takes from a source system into the ODD Platform. ODD ships two strategies — **pull** (a [collector](/introduction/main-concepts.md) polls the source on a schedule) and **push** (a [push adapter](/introduction/main-concepts.md) lives inside or alongside the source and emits as the source runs). Pick by where the work happens: pull when the source is a passive data store you want snapshotted on a cadence, push when the source is an application or a stream you want reporting per-run lineage and results in real time.

Push adapters ship in three deployment shapes:

* **In-process plugin / extension** — the adapter is embedded inside the source tool's own runtime (`odd-airflow-2`, `odd-dbt`, `odd-spark-adapter`, `odd-great-expectations`). Operators install the adapter into the existing source application.
* **Standalone gateway** — the adapter is its own service that source systems push to over an externally-defined wire protocol (today: `odd-tracing-gateway` over OpenTelemetry/OTLP). Operators deploy the gateway as a separate process and point their existing observability pipeline at it.
* **Direct SDK / CLI use** — the adapter is invoked as a CLI or a library call from custom code (`odd-cli`).

## Pull vs push at a glance

| Integration              | Strategy | Deployment shape   | What it integrates                                                                                         | Repo                                                                                                                    | Page                                                                           |
| ------------------------ | -------- | ------------------ | ---------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
| `odd-collector`          | pull     | collector-hosted   | 41 generic adapters: databases, BI tools, streams, MLOps                                                   | [odd-collectors/odd-collector](https://github.com/opendatadiscovery/odd-collectors/tree/main/odd-collector)             | [odd-collector](/integrations/integrations/odd-collector.md)                   |
| `odd-collector-aws`      | pull     | collector-hosted   | 11 AWS adapters: Glue, S3, Athena, Kinesis, SageMaker, …                                                   | [odd-collectors/odd-collector-aws](https://github.com/opendatadiscovery/odd-collectors/tree/main/odd-collector-aws)     | [odd-collector-aws](/integrations/integrations/odd-collector-aws.md)           |
| `odd-collector-azure`    | pull     | collector-hosted   | 4 Azure adapters: PowerBI, Azure SQL, Blob Storage, Data Factory                                           | [odd-collectors/odd-collector-azure](https://github.com/opendatadiscovery/odd-collectors/tree/main/odd-collector-azure) | [odd-collector-azure](/integrations/integrations/odd-collector-azure.md)       |
| `odd-collector-gcp`      | pull     | collector-hosted   | 4 GCP adapters: BigQuery, BigTable, GCS, GCS Delta                                                         | [odd-collectors/odd-collector-gcp](https://github.com/opendatadiscovery/odd-collectors/tree/main/odd-collector-gcp)     | [odd-collector-gcp](/integrations/integrations/odd-collector-gcp.md)           |
| `odd-collector-profiler` | pull     | collector-hosted   | Statistical data profiling for Postgres / Azure SQL                                                        | [odd-collector-profiler](https://github.com/opendatadiscovery/odd-collector-profiler)                                   | [odd-collector-profiler](/integrations/integrations/odd-collector-profiler.md) |
| `odd-airflow-2`          | push     | in-process plugin  | Airflow DAG / task / lineage metadata via a Listener                                                       | [odd-airflow-2](https://github.com/opendatadiscovery/odd-airflow-2)                                                     | [odd-airflow-2](/integrations/integrations/odd-airflow-2.md)                   |
| `odd-dbt`                | push     | in-process plugin  | dbt model lineage and test results                                                                         | [odd-dbt](https://github.com/opendatadiscovery/odd-dbt)                                                                 | [odd-dbt](/integrations/integrations/odd-dbt.md)                               |
| `odd-spark-adapter`      | push     | in-process plugin  | Spark job lineage (RDD, JDBC, Kafka batch, Snowflake, S3 Delta)                                            | [odd-spark-adapter](https://github.com/opendatadiscovery/odd-spark-adapter)                                             | [odd-spark-adapter](/integrations/integrations/odd-spark-adapter.md)           |
| `odd-great-expectations` | push     | in-process plugin  | Great Expectations checkpoint results                                                                      | [odd-great-expectations](https://github.com/opendatadiscovery/odd-great-expectations)                                   | [odd-great-expectations](/integrations/integrations/odd-great-expectations.md) |
| `odd-cli`                | push     | direct SDK / CLI   | Local files and ad-hoc dataset metadata                                                                    | [odd-cli](https://github.com/opendatadiscovery/odd-cli)                                                                 | [odd-cli](/integrations/integrations/odd-cli.md)                               |
| `odd-tracing-gateway`    | push     | standalone gateway | Microservice identities and dependencies inferred from OpenTelemetry traces (HTTP, JDBC, Kafka, gRPC, AWS) | [odd-tracing-gateway](https://github.com/opendatadiscovery/odd-tracing-gateway)                                         | [odd-tracing-gateway](/integrations/integrations/odd-tracing-gateway.md)       |

The same vocabulary appears in [Main Concepts](/introduction/main-concepts.md): a **collector** is the deployable container for pull adapters; a **push adapter** runs inside the source's runtime, beside it as a standalone gateway, or as a direct SDK / CLI call; a **plugin** is one configured pull-adapter instance inside a collector. "Pull adapter" is **not** a synonym for "collector" — pull adapters live inside collectors, plural per collector.

## Which integration do I need?

* **A database, data warehouse, or BI tool** (PostgreSQL, MySQL, Snowflake, Redshift, Tableau, …) → [`odd-collector`](/integrations/integrations/odd-collector.md).
* **An AWS service** (Glue, S3, Athena, Kinesis, …) → [`odd-collector-aws`](/integrations/integrations/odd-collector-aws.md).
* **An Azure service** (PowerBI, Azure SQL, Blob Storage, Data Factory) → [`odd-collector-azure`](/integrations/integrations/odd-collector-azure.md).
* **A GCP service** (BigQuery, GCS, BigTable) → [`odd-collector-gcp`](/integrations/integrations/odd-collector-gcp.md).
* **Dataset statistics / profiling** for Postgres or Azure SQL → [`odd-collector-profiler`](/integrations/integrations/odd-collector-profiler.md).
* **An Airflow scheduler** running DAGs you want lineage for → [`odd-airflow-2`](/integrations/integrations/odd-airflow-2.md).
* **dbt models and tests** you want surfaced in the catalog → [`odd-dbt`](/integrations/integrations/odd-dbt.md).
* **Spark jobs** you want lineage from → [`odd-spark-adapter`](/integrations/integrations/odd-spark-adapter.md).
* **Great Expectations** quality results → [`odd-great-expectations`](/integrations/integrations/odd-great-expectations.md).
* **Local CSV / Parquet files**, or an ad-hoc push from a script or CI step → [`odd-cli`](/integrations/integrations/odd-cli.md).
* **Microservices instrumented with OpenTelemetry** — identities, HTTP / JDBC / Kafka / gRPC / AWS-SDK dependencies inferred from distributed traces → [`odd-tracing-gateway`](/integrations/integrations/odd-tracing-gateway.md). Reach for this when your stack already collects OpenTelemetry traces and you want the catalog to also reflect the microservices and the dependencies your existing observability pipeline already sees.

A single deployment commonly mixes strategies and shapes — e.g., one `odd-collector` container ingesting your warehouses on a schedule, `odd-airflow-2` reporting DAG-level lineage as the orchestrator runs, and `odd-tracing-gateway` populating microservice identities from your OpenTelemetry pipeline. The platform is the same on the receiving end; pick per source.

## Common configuration (collectors)

All collectors share the same top-level configuration schema, defined once in the SDK. The full reference, with every field, lives in [Build and run ODD Collectors → Full configuration reference](/developer-guides/build-and-run/build-and-run-odd-collectors.md#full-configuration-reference); the abridged shape is:

```yaml
platform_host_url: http://your.odd.platform:8080  # required
token: <COLLECTOR_TOKEN>                          # required (see "Token and datasource registration" below)
default_pulling_interval: 10                      # optional, in minutes — when unset, the collector runs once and exits
plugins:                                          # required, list — see below
  - type: postgresql                              # adapter type literal
    name: warehouse_main                          # operator-chosen, must be unique within the file
    # …per-adapter fields
```

Push adapters are configured separately by their host tool (Airflow Connection, Spark configs, dbt env vars, GE action block) — they do **not** consume `collector_config.yaml`. See each push-adapter page for the per-tool configuration.

### One collector hosts many plugins

A single collector instance — one container, one process — hosts as many plugins as you list in `plugins:`. Plugins can mix adapter types, and you can add **multiple plugins of the same type** to ingest from several sources of the same kind (three PostgreSQL databases on different hosts, two S3 buckets in different accounts, …). Each plugin needs a unique `name`; that's the discriminator the collector uses in logs and metrics.

```yaml
platform_host_url: http://localhost:8080
token: <COLLECTOR_TOKEN>
default_pulling_interval: 10
plugins:
  # Two PostgreSQL plugins → two databases on different hosts.
  - type: postgresql
    name: warehouse_eu
    host: pg-eu.internal
    port: 5432
    database: warehouse
    user: odd_reader
    password: !ENV ${PG_EU_PASSWORD}
  - type: postgresql
    name: warehouse_us
    host: pg-us.internal
    port: 5432
    database: warehouse
    user: odd_reader
    password: !ENV ${PG_US_PASSWORD}
  # A different adapter type → MySQL, same container.
  - type: mysql
    name: legacy_billing
    host: mysql.internal
    port: 3306
    database: billing
    user: odd_reader
    password: !ENV ${MYSQL_PASSWORD}
```

Two plugins of the same type are a routine deployment pattern — one container scales to your full pull-side surface, you don't run one container per source.

### Beyond connection settings: per-adapter features

Many pull adapters expose features that go past "connect and read schema". Two of the most-used ones are surfaced once here so you know to look for them on individual adapter pages:

* **Ingestion filters** — `schemas_filter` (PostgreSQL, Snowflake), `filename_filter` (S3, Azure Blob Storage, GCS), `datasets_filter` (BigQuery), `pipeline_filter` (Azure Data Factory) and similar. Each takes regex `include` / `exclude` lists. When omitted, the default is "include everything" — i.e. the adapter ingests every schema / file / dataset it can see. Use filters to scope a plugin to the slice you actually want catalogued. See the dedicated [Ingestion filters](/integrations/integrations/ingestion-filters.md) page for the per-key shape, the include / exclude interaction rule, and a worked PostgreSQL example.
* **Foreign-key (ERD) relationships** — PostgreSQL and Snowflake plugins emit `ENTITY_RELATIONSHIP` entities for tables connected by foreign keys (cross-schema relations included). The platform renders these as ERD diagrams on the dataset detail page. Other adapters do not currently extract foreign-key relationships.

The full per-adapter capability matrix (which adapters support filters, which support ERD, which have additional knobs like dataset partitioning) lives on the per-collector pages.

### Secrets backend (optional)

Any field in `collector_config.yaml` can be sourced from AWS SSM Parameter Store instead of inline YAML — see [Collector secrets backend](/configuration-and-deployment/collectors-secrets-backend.md). Only `odd-collector` (the generic collector) ships with a Secrets Backend hook today; the cloud and profiler collectors read configuration from YAML and environment variables.

## Integration Wizard (in-app UI)

To shorten the path from "I picked an integration" to "I have a working `collector_config.yaml` snippet", the platform ships an **Integration Wizard** under **Management → Integrations**. The wizard is data-driven by manifests on the platform's classpath (`META-INF/wizard/*.yaml`) and exposes the same set through `GET /api/integrations` / `GET /api/integrations/{integration_id}`. For each integration the wizard shows a description, walks the operator through prerequisites, and renders a parameterised YAML snippet — fill in host / port / credentials and copy the result into the `plugins:` block of your `collector_config.yaml`.

The wizard is a **starting point**, not a replacement for `collector_config.yaml`: it generates one plugin's worth of YAML, not the full file. Operators still hand-author `platform_host_url`, `token`, `default_pulling_interval`, additional plugins, filters, and any [secrets-backend](/configuration-and-deployment/collectors-secrets-backend.md) references. For the per-card flow, the static-parameter substitution context (today only `platform_url`, resolved from `odd.platform-base-url`), and the API surface, see [Integration Wizard](/integrations/integrations/integration-wizard.md).

## Token and datasource registration

Every integration — pull or push — authenticates to the platform with a **collector token** issued by the ODD Platform. The token is created in the UI under [**Management → Collectors**](/features/management.md) (see [Try locally → Create Collector entity](/configuration-and-deployment/trylocally.md#create-collector-entity) for the step-by-step). The same flow issues the token regardless of whether the integration that consumes it is pull or push; what differs is **how each integration consumes the token**:

| Integration                            | How the token is supplied                                                                                                     |
| -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `odd-collector*` (all pull collectors) | `token: <COLLECTOR_TOKEN>` field in `collector_config.yaml`, or `TOKEN` environment variable                                  |
| `odd-airflow-2`                        | Airflow `Connection` named `odd`, `password` field                                                                            |
| `odd-dbt`                              | `ODD_PLATFORM_TOKEN` env var (or `--platform-token` flag)                                                                     |
| `odd-spark-adapter`                    | `spark.odd.host.url` / `spark.odd.oddrn.key` Spark configuration (no static token — the JAR identifies itself by `oddrn.key`) |
| `odd-great-expectations`               | `platform_token` field in the `ODDAction` block                                                                               |
| `odd-cli`                              | `ODD_PLATFORM_TOKEN` env var                                                                                                  |

On the platform side, every integration registers its data sources via `POST /ingestion/datasources`, which the platform exposes as part of the Ingress API. Pull collectors call this from their SDK; push adapters call it (or rely on the platform recognising entity ODDRNs implicitly on first push) per the [ODD Specification](https://github.com/opendatadiscovery/opendatadiscovery-specification).

{% hint style="danger" %}
**The metadata-push endpoint accepts any caller by default.** The collector token above is *not* checked on the entity-push path unless you turn it on. The `/ingestion/**` namespace is whitelisted in Spring Security, and the one filter that does validate the token (`POST /ingestion/entities`) is gated by `auth.ingestion.filter.enabled`, which **defaults to `false`**. With the default in place and the platform reachable on the network, any caller who can speak the Ingress API can push a spec-valid `DataEntityList` into **any** existing datasource — by writing that datasource's ODDRN in the payload — and the catalog renders the result to every user as authoritative metadata. The companion `POST /ingestion/entities/datasets/stats` endpoint is never covered by that filter under any setting. Enable the filter (and read the per-endpoint posture) before exposing the platform on any untrusted network: [Enable security → Ingestion authentication](/configuration-and-deployment/enable-security.md#ingestion-authentication).
{% endhint %}

## Ingestion error contract

The `POST /ingestion/entities` endpoint is the load-bearing call every collector makes for each metadata batch. Three client-side error conditions currently surface as **HTTP 5xx** (rather than 4xx) because the platform's controller has no `@ExceptionHandler` advice for the underlying exception classes. Collector authors writing retry-with-backoff logic against the public contract should treat the conditions below as **client errors that look like server errors** — retrying them compounds platform pressure without any chance of success.

| Client-side condition                      | Underlying exception                                                                   | Current HTTP shape | Recommended client behaviour                                                                                                                |
| ------------------------------------------ | -------------------------------------------------------------------------------------- | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------- |
| Duplicate ODDRN inside one batch           | `IllegalStateException: Duplicate key` (thrown by `Collectors.toMap`'s default merger) | 5xx                | Deduplicate the batch on the collector side before submitting; do not retry.                                                                |
| Unknown `data_source_oddrn` in the batch   | `NotFoundException`                                                                    | 5xx                | Verify the data-source has been registered via `POST /ingestion/datasources` before the entity batch; do not retry on this exception class. |
| Payload exceeds the configured codec limit | `DataBufferLimitException` (Spring WebFlux body codec)                                 | 5xx                | Reduce batch size or coordinate with the operator to raise `spring.codec.max-in-memory-size`; do not retry the same payload.                |

Use a per-condition pre-flight (a GET against the target data source, an in-batch deduplication pass, a payload-size check against the operator-documented limit) rather than blanket exponential backoff on 5xx. The platform-side hardening to convert these conditions to structured 4xx responses (`400`, `404`, `413`) is on the roadmap; until it ships, the doc-side contract above is the canonical client guidance.

A separate concern — **per-data-source serialisation under contention**. The `POST /ingestion/entities` pipeline holds a PostgreSQL `SELECT … FOR UPDATE` row-lock on the resolved `data_source` row for the entire pipeline duration (data-source resolve + the 14-step ingestion processor chain + OTLP metric export). Two collectors emitting concurrently to the **same** data source serialise on that lock; the loser may exceed the transaction timeout and fail with a 5xx — there is no `Retry-After` header, no `429 Too Many Requests` signal. Collectors that may emit concurrently to the same data source should apply **per-data-source backoff with jitter** at the client layer (a small randomised delay between consecutive `POST /ingestion/entities` calls for the same data source) until the platform-side fix ships either a narrower lock scope or an explicit contention signal.

A third structural concern — **destructive-path observability**. The ingestion service's rollback paths (every condition above plus the in-pipeline processor failures) have no structured logging today; operators investigating an ingestion failure need platform-side access to the application log to find the actual exception trace. The discriminators an SRE needs (collector identity, target data source ODDRN, entity count, batch identifier) are not present in the platform's logs on the rollback paths — only the raw exception trace surfaces. If your deployment treats ingestion failures as user-impacting, route the collector's own logs through your observability stack as the primary diagnostic surface rather than expecting the platform to mirror the failure detail.

## Where to next

* **Scoping what a plugin ingests** → [Ingestion filters](/integrations/integrations/ingestion-filters.md) — regex `include` / `exclude` per plugin.
* **Bootstrapping a `collector_config.yaml` snippet from the in-app wizard** → [Integration Wizard](/integrations/integrations/integration-wizard.md).
* **Storing collector secrets in AWS SSM** → [Collector secrets backend](/configuration-and-deployment/collectors-secrets-backend.md).
* **Building / running a collector locally** → [Build and run ODD Collectors](/developer-guides/build-and-run/build-and-run-odd-collectors.md).
* **The wire contract** between any integration and the platform → [ODD Specification](https://github.com/opendatadiscovery/opendatadiscovery-specification).
* **Authoring a brand-new adapter** (when an existing one doesn't fit) → [Build a custom collector](/developer-guides/build-and-run/custom-collectors.md). The SDK lives at [odd-collectors/odd-collector-sdk](https://github.com/opendatadiscovery/odd-collectors/tree/main/odd-collector-sdk).
* **Existing repository overview** — [GitHub organization overview](/developer-guides/github-organization-overview.md) lists every ODD repo with one-line summaries.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.opendatadiscovery.org/integrations/integrations.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
