> For the complete documentation index, see [llms.txt](https://docs.opendatadiscovery.org/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.opendatadiscovery.org/developer-guides/build-and-run/custom-collectors.md).

# Build a custom collector

{% hint style="info" %}
**Audience: Python developers** extending ODD when an existing adapter doesn't fit. Most source systems are already covered by the bundled collectors — start at the [Integrations hub](/integrations/integrations.md) before you reach for the SDK.
{% endhint %}

The `odd-collector-sdk` Python library is what powers every pull collector in the [`odd-collectors`](https://github.com/opendatadiscovery/odd-collectors) monorepo (`odd-collector` generic, `odd-collector-aws`, `odd-collector-azure`, `odd-collector-gcp`) and the standalone [`odd-collector-profiler`](https://github.com/opendatadiscovery/odd-collector-profiler). It handles the parts that every collector shares — config loading, adapter discovery, scheduling, the Platform Ingress API client, signal-based shutdown — so authoring a new adapter is mostly about writing the source-specific extract-and-map step.

This guide walks through building a custom collector or a custom adapter against the SDK. It covers the plugin / config pattern, the adapter contract (sync, async, and async-generator variants), wiring up the entry point, packaging, and the seams where ODDRN generation and runtime configuration plug in.

## When to author a custom adapter or collector

Before writing code, confirm that the existing collectors don't already cover your case:

* **A new database / BI / ML / streaming source not yet in `odd-collector`** → add a new adapter to the [generic collector](/integrations/integrations/odd-collector.md) and contribute it back upstream.
* **An AWS / Azure / GCP managed service not yet in the cloud collectors** → add an adapter to [`odd-collector-aws`](/integrations/integrations/odd-collector-aws.md), [`odd-collector-azure`](/integrations/integrations/odd-collector-azure.md), or [`odd-collector-gcp`](/integrations/integrations/odd-collector-gcp.md).
* **A push-strategy integration** (the source already runs your code — Airflow, dbt, Spark, Great Expectations, custom CI/CD) → the source-embedded pattern is implemented separately in repos like [`odd-airflow-2`](/integrations/integrations/odd-airflow-2.md), [`odd-dbt`](/integrations/integrations/odd-dbt.md), [`odd-spark-adapter`](/integrations/integrations/odd-spark-adapter.md), [`odd-great-expectations`](/integrations/integrations/odd-great-expectations.md). The SDK described here targets pull collectors; push integrations follow the host system's plugin or listener API rather than this SDK.
* **A standalone collector container with no overlap with the bundled set** (proprietary SaaS, internal data system, an isolated research source) → build a brand-new collector against the SDK using the layout below.
* **Just want a one-off ad-hoc push** → consider [`odd-cli`](/integrations/integrations/odd-cli.md) before authoring code.

The "one-off pull adapter" path is by far the most common — every adapter inside `odd-collector` started life as a `Plugin` subclass plus an `Adapter` class plus a `Generator` subclass.

## SDK packages and versions

* **Python:** the SDK pins `python = "^3.9"` in [`odd-collector-sdk/pyproject.toml`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/pyproject.toml). Any Python 3.9.x or later 3.x interpreter works.
* **Pydantic:** the SDK pins `pydantic = "^2.7.1"`. The `Plugin` base class uses Pydantic v2's `BaseSettings` from the separate `pydantic-settings` package — be careful not to import `BaseSettings` from `pydantic` itself (Pydantic v2 moved it).
* **Scheduler:** APScheduler v3 (`apscheduler = "^3.8.1"`).
* **Async transport:** `aiohttp = "^3.8.1"`.
* **Models:** `odd-models = "^2.0.47"` (the Python Pydantic model package generated from the [ODD Specification](https://github.com/opendatadiscovery/opendatadiscovery-specification)). `DataEntity`, `DataEntityList`, and `DataSource` come from here.
* **ODDRN generator:** `oddrn-generator = "^0.1.101"`. Per-source `Generator` subclasses live in this package (`PostgresqlGenerator`, `SnowflakeGenerator`, `KafkaGenerator`, `FeastGenerator`, …).

Install the SDK into your project with Poetry:

```bash
poetry add odd-collector-sdk
```

## Anatomy of a collector

A pull collector is a long-running process that does five things on a schedule:

1. **Load configuration** from `collector_config.yaml` (and the environment), validated against a Pydantic model.
2. **Discover adapters** by importing the package referenced by each plugin's `type` literal and instantiating the `Adapter` class found in that package.
3. **Register data sources** with the Platform via `POST /ingestion/datasources` once at startup.
4. **Run each adapter on the schedule** via APScheduler, or once if no schedule is configured.
5. **Send the resulting `DataEntityList` to the Platform** via `POST /ingestion/entities`, chunked to keep individual requests bounded.

The SDK provides:

| Component                                                | Purpose                                                                                         | Source                                                                                                                                                                                                                                                               |
| -------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Collector`                                              | Lifecycle entry point — config loading, adapter discovery, register, schedule, shutdown         | [`odd_collector_sdk.collector`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/odd_collector_sdk/collector.py)                                                                                                                      |
| `CollectorConfig`                                        | Pydantic model for the top-level YAML schema (token, platform URL, plugins list, runtime knobs) | [`odd_collector_sdk.domain.collector_config`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/odd_collector_sdk/domain/collector_config.py)                                                                                          |
| `Plugin`                                                 | Base class for adapter configuration objects (Pydantic `BaseSettings`, `extra="allow"`)         | [`odd_collector_sdk.domain.plugin`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/odd_collector_sdk/domain/plugin.py)                                                                                                              |
| `PluginFactory`                                          | `Dict[str, Type[Plugin]]` mapping type literal → Plugin subclass; the discriminator             | [`odd_collector_sdk.types`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/odd_collector_sdk/types/__init__.py)                                                                                                                     |
| `AbstractAdapter`, `BaseAdapter`, `AsyncAbstractAdapter` | Three contracts an adapter can implement; SDK dispatches automatically                          | [`odd_collector_sdk.domain.adapter`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/odd_collector_sdk/domain/adapter.py)                                                                                                            |
| `Filter`                                                 | Reusable `include` / `exclude` regex filter for ingestion scoping                               | [`odd_collector_sdk.domain.filter`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/odd_collector_sdk/domain/filter.py)                                                                                                              |
| `PlatformApi`                                            | Async client for `/ingestion/datasources` and `/ingestion/entities`                             | [`odd_collector_sdk.api.datasource_api`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/odd_collector_sdk/api/datasource_api.py)                                                                                                    |
| `BaseSecretsBackend`                                     | Optional pluggable secrets-backend hook for sourcing config from external stores                | [`odd_collector_sdk.secrets.base_secrets`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/odd_collector_sdk/secrets/base_secrets.py) — see [Collector secrets backend](/configuration-and-deployment/collectors-secrets-backend.md) |

## Project layout

A custom collector named `my_collector` follows this canonical layout — the same shape every monorepo collector uses:

```
my_collector/
├── my_collector/
│   ├── __init__.py
│   ├── __main__.py                 # entry point — instantiates Collector and calls .run()
│   ├── adapters/
│   │   ├── __init__.py
│   │   ├── my_source/              # adapter package; name MUST match the type literal
│   │   │   ├── __init__.py
│   │   │   ├── adapter.py          # class Adapter(AbstractAdapter | BaseAdapter | AsyncAbstractAdapter)
│   │   │   └── mappers/            # optional — source rows → ODD DataEntity
│   │   └── another_source/
│   │       └── adapter.py
│   └── domain/
│       ├── __init__.py
│       └── plugin.py               # Plugin subclasses + PLUGIN_FACTORY
├── collector_config.yaml           # operator-supplied runtime config
├── pyproject.toml
├── Dockerfile                      # optional, for container deployment
└── README.md
```

Two non-negotiable rules:

* **The directory name under `adapters/` must equal the plugin's `type` literal.** The SDK's [`load_adapters`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/odd_collector_sdk/load_adapter.py) imports `{root_package}.{plugins_package}.{plugin.type}` and looks for `package.adapter.Adapter` inside it. A mismatch between the YAML's `type:` field and the directory name will surface as a missing-module ImportError at startup.
* **Each adapter package must expose a class named exactly `Adapter`** in `adapter.py`. The class can extend any of the three adapter contracts below — the SDK detects which one by inspection.

## Define the plugin (config schema)

A plugin is a Pydantic model that mirrors the YAML shape an operator writes into `collector_config.yaml` for one configured instance of an adapter. Every plugin extends `Plugin` from the SDK:

```python
# my_collector/domain/plugin.py
from typing import Literal, Optional

from odd_collector_sdk.domain.plugin import Plugin
from odd_collector_sdk.types import PluginFactory
from pydantic import SecretStr


class MySourcePlugin(Plugin):
    type: Literal["my_source"]              # MUST match the adapters/{name}/ directory
    host: str
    port: int = 8080
    user: str
    password: SecretStr
    enable_lineage: bool = False


class AnotherSourcePlugin(Plugin):
    type: Literal["another_source"]
    api_key: SecretStr
    base_url: str = "https://api.example.com"


PLUGIN_FACTORY: PluginFactory = {
    "my_source": MySourcePlugin,
    "another_source": AnotherSourcePlugin,
}
```

The base [`Plugin`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/odd_collector_sdk/domain/plugin.py) provides three fields every plugin inherits — `name` (required, operator-chosen, unique per collector), `description` (optional metadata), and `namespace` (optional metadata). The base also sets `extra="allow"` on the underlying `pydantic_settings.BaseSettings`, so adapter-specific fields are accepted without further declaration. Use `pydantic.SecretStr` for secrets so they are masked in repr output.

`PLUGIN_FACTORY` is the discriminator the SDK uses to map an entry's `type:` value to the right Pydantic class. **The directory name under `adapters/` must equal the type literal**, and the type literal must equal a key in `PLUGIN_FACTORY`. Three names, one string.

For a small inheritance example modelled on the bundled collectors, see [`AwsPlugin` and its subclasses](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-aws/odd_collector_aws/domain/plugin.py) — every AWS adapter inherits a common-auth base; specific adapters declare only the source-specific fields. The same pattern — a private base class plus per-source subclasses — works in any collector.

## Implement the adapter

An adapter does two things: returns the data source's ODDRN and produces a `DataEntityList`. The SDK supports three implementation contracts and dispatches automatically based on the shape of `get_data_entity_list`:

| Base class                                             | When to use                                                                          | `get_data_entity_list` signature                                                                                     |
| ------------------------------------------------------ | ------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------- |
| `AbstractAdapter`                                      | Synchronous source (typical SQL adapter using `psycopg2` or similar blocking driver) | `def get_data_entity_list(self) -> DataEntityList`                                                                   |
| `AsyncAbstractAdapter`                                 | Async source (`aiohttp`, `aiomysql`, `asyncpg`, …)                                   | `async def get_data_entity_list(self) -> DataEntityList`                                                             |
| `BaseAdapter` (concrete subclass of `AbstractAdapter`) | Synchronous source where you already have an `oddrn_generator.Generator` subclass    | inherits `get_data_source_oddrn` from the generator; you implement `create_generator()` and `get_data_entity_list()` |

The SDK's [`create_job`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/odd_collector_sdk/job.py) inspects `get_data_entity_list` and instantiates one of:

* `SyncJob` — adapter returns a `DataEntityList` directly; SDK iterates over `.items` in chunks of `chunk_size` (default 250).
* `AsyncJob` — adapter awaits a `DataEntityList`; SDK chunks per-call.
* `AsyncGeneratorJob` — adapter is an `async def ... yield` generator yielding multiple `DataEntityList` objects (or a flat iterable thereof); useful for very large catalogs where building the full list in memory is not affordable.

You don't pick the job type explicitly — define `get_data_entity_list` in whichever shape fits your source's I/O model and the SDK does the rest.

### `AbstractAdapter` — minimum contract

```python
# my_collector/adapters/my_source/adapter.py
from odd_collector_sdk.domain.adapter import AbstractAdapter
from odd_models.models import DataEntityList
from oddrn_generator import PostgresqlGenerator   # pick the generator that fits

from my_collector.domain.plugin import MySourcePlugin


class Adapter(AbstractAdapter):
    def __init__(self, config: MySourcePlugin) -> None:
        self.config = config
        self._generator = PostgresqlGenerator(
            host_settings=config.host,
            databases=config.name,    # one plugin = one logical source
        )
        # Open whatever client your source needs:
        # self._client = MyClient(host=config.host, port=config.port, ...)

    def get_data_source_oddrn(self) -> str:
        return self._generator.get_data_source_oddrn()

    def get_data_entity_list(self) -> DataEntityList:
        # 1. Fetch raw catalog rows from the source.
        # 2. Map each row to an odd_models.DataEntity (a mappers/ subpackage is the
        #    conventional place — it isolates ODDRN-stitching code from I/O).
        # 3. Return them under the source's ODDRN.
        items = []   # list[DataEntity]
        return DataEntityList(
            data_source_oddrn=self.get_data_source_oddrn(),
            items=items,
        )
```

The SDK injects the matching `Plugin` instance as the constructor argument — so `config` is already validated against your Pydantic model. Always store it as `self.config`; if you don't, the SDK's [`load_adapters`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/odd_collector_sdk/load_adapter.py) will fall back to setting `adapter.config = plugin` after construction so the scheduler can read `adapter.config.name` for log labels.

For a real working example of this shape, the [Feast adapter](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector/odd_collector/adapters/feast/adapter.py) is concise and self-contained — it constructs a `FeatureStore`, holds a `FeastGenerator`, and returns a sync `DataEntityList` derived from `self.__feature_store.list_feature_views()`.

### `AsyncAbstractAdapter` — async I/O

```python
from odd_collector_sdk.domain.adapter import AsyncAbstractAdapter
from odd_models.models import DataEntityList


class Adapter(AsyncAbstractAdapter):
    def __init__(self, config) -> None:
        self.config = config
        # async client setup is typically deferred to the first call so the
        # event loop is available; or use an aiohttp.ClientSession context.

    def get_data_source_oddrn(self) -> str:
        ...

    async def get_data_entity_list(self) -> DataEntityList:
        async with aiohttp.ClientSession() as session:
            ...
        return DataEntityList(...)
```

### `BaseAdapter` — generator-driven shape

`BaseAdapter` is a concrete `AbstractAdapter` subclass that bundles the common pattern of "I have an `oddrn_generator.Generator` and I want it to drive `get_data_source_oddrn`":

```python
from odd_collector_sdk.domain.adapter import BaseAdapter
from oddrn_generator import Generator, PostgresqlGenerator
from odd_models.models import DataEntityList


class Adapter(BaseAdapter):
    config: MySourcePlugin     # type the inherited attribute for IDE help

    def create_generator(self) -> Generator:
        return PostgresqlGenerator(host_settings=self.config.host, databases=self.config.name)

    def get_data_entity_list(self) -> DataEntityList:
        # self.generator is already populated by BaseAdapter.__init__
        ...
```

`BaseAdapter` saves the boilerplate of declaring `__init__` and `get_data_source_oddrn` yourself; pick it whenever the generator-on-self pattern fits.

### Async generator — for very large catalogs

If your source's catalog is large enough that materialising it as a single `DataEntityList` is a memory concern, declare `get_data_entity_list` as an async generator and yield batches:

```python
from typing import AsyncGenerator

from odd_collector_sdk.domain.adapter import AsyncAbstractAdapter
from odd_models.models import DataEntityList


class Adapter(AsyncAbstractAdapter):
    ...

    async def get_data_entity_list(self) -> AsyncGenerator[DataEntityList, None]:
        async for page in self._client.iter_pages():
            yield DataEntityList(
                data_source_oddrn=self.get_data_source_oddrn(),
                items=[map_row_to_data_entity(row) for row in page],
            )
```

The SDK's `AsyncGeneratorJob` further sub-chunks each yielded list to `chunk_size` items per request, so you can yield large pages and still respect the request-size limit.

## Wire up the entry point

`__main__.py` instantiates the SDK's `Collector` and hands control to its `run()` method, which sets up signal handlers (`SIGHUP`, `SIGTERM`, `SIGINT`), registers data sources, and either schedules the polling loop or runs once and exits depending on whether `default_pulling_interval` is set:

```python
# my_collector/__main__.py
import logging
from os import path

from odd_collector_sdk.collector import Collector

from my_collector.domain.plugin import PLUGIN_FACTORY


def main() -> None:
    logging.basicConfig(
        level=logging.INFO,
        format="[%(asctime)s] %(levelname)s in %(module)s: %(message)s",
    )

    config_path = path.join(path.dirname(path.realpath(__file__)), "../collector_config.yaml")

    collector = Collector(
        config_path=config_path,
        root_package="my_collector",        # parent of the adapters/ subpackage
        plugin_factory=PLUGIN_FACTORY,
    )
    collector.run()


if __name__ == "__main__":
    main()
```

Then run with:

```bash
poetry run python -m my_collector
```

The `Collector` constructor signature is `Collector(config_path, root_package, plugin_factory, plugins_package="adapters")`. Override `plugins_package` only if your project nests adapters under a non-default subpackage. `run(loop=None)` accepts an existing event loop — supply one when integrating into a host process; omit it for standalone collector containers.

## End-to-end skeleton

A minimal-but-complete custom collector that does nothing useful but starts cleanly:

```python
# my_collector/domain/plugin.py
from typing import Literal

from odd_collector_sdk.domain.plugin import Plugin
from odd_collector_sdk.types import PluginFactory


class HelloPlugin(Plugin):
    type: Literal["hello"]
    greeting: str = "hello"


PLUGIN_FACTORY: PluginFactory = {"hello": HelloPlugin}
```

```python
# my_collector/adapters/hello/adapter.py
from odd_collector_sdk.domain.adapter import AbstractAdapter
from odd_models.models import DataEntityList


class Adapter(AbstractAdapter):
    def __init__(self, config) -> None:
        self.config = config

    def get_data_source_oddrn(self) -> str:
        # Replace with a real ODDRN once you know what your source's identity is.
        return f"//hello/host/{self.config.name}"

    def get_data_entity_list(self) -> DataEntityList:
        return DataEntityList(
            data_source_oddrn=self.get_data_source_oddrn(),
            items=[],   # no real entities yet
        )
```

```yaml
# collector_config.yaml
platform_host_url: http://localhost:8080
token: <COLLECTOR_TOKEN>
default_pulling_interval: 10
plugins:
  - type: hello
    name: hello_world
    greeting: hi
```

```python
# my_collector/__main__.py — see the entry point section above
```

Run it:

```bash
poetry add odd-collector-sdk odd-models
poetry run python -m my_collector
```

The collector starts, registers a `//hello/host/hello_world` data source against the platform, attempts to send a `DataEntityList` every 10 minutes, and exits cleanly on SIGINT.

{% hint style="warning" %}
**The skeleton above emits an&#x20;*****empty*** **`items` list, and the platform rejects an empty batch.** `POST /ingestion/entities` filters on a non-empty item list and raises `BadUserRequestException("Ingestion payload is empty")` — an **HTTP 400** — when `items` is empty. So this scaffold logs a `400` on every scheduled cycle until you populate `get_data_entity_list` with real entities; it does not silently succeed. This is expected for an empty skeleton — replace the `items=[]` body with real source extraction (the next section) and the 400 stops. The empty-payload rule lives platform-side in `IngestionController.postDataEntityList`; the success status for a non-empty push is **`200`** (note: the Ingress API spec declares `201` — the platform currently returns `200`, so code your client to accept `200`).
{% endhint %}

Replace the `get_data_entity_list` body with real source extraction and you have a working adapter.

## Generate ODDRNs

Every entity the adapter emits — the data source itself, datasets, columns, jobs, transformers — needs an [ODDRN](/introduction/main-concepts.md#oddrn). ODDRNs are how the platform recognises the same entity across ingests, across collectors, and over time; getting them right is what makes cross-system lineage possible.

Use the [`oddrn-generator`](https://pypi.org/project/oddrn-generator/) Python package — it ships per-source `Generator` subclasses (`PostgresqlGenerator`, `SnowflakeGenerator`, `KafkaGenerator`, `FeastGenerator`, `MysqlGenerator`, `MssqlGenerator`, …). Pick the one that matches your source category, or subclass `Generator` for a brand-new source family. Set the host / database / namespace once at adapter construction and let the generator stitch path components onto each entity ODDRN.

The Java equivalent (`oddrn-generator-java`) exists for JVM-side push adapters. See [ODDRN](/introduction/main-concepts.md#oddrn) for the format, examples, and consumer requirements (the same ODDRN must identify the same entity across producers — coordinate hostnames and identifiers across your deployment).

## Runtime configuration

The full reference for the top-level `collector_config.yaml` shape — `platform_host_url`, `token`, `default_pulling_interval`, `plugins`, plus `connection_timeout_seconds`, `chunk_size`, `misfire_grace_time`, `max_instances`, `verify_ssl` — lives once at [Build and run ODD Collectors → Full configuration reference](https://github.com/opendatadiscovery/documentation/blob/main/docs/developer-guides/build-and-run/build-and-run/build-and-run-odd-collectors.md#full-configuration-reference). That page is the operator-side companion to this developer guide; treat it as the runtime-config source of truth.

Custom collectors share that exact schema — your adapter only adds per-plugin fields under each `plugins[*]` entry. Setting `default_pulling_interval` to a positive integer (in minutes) makes the collector poll on that cadence; leaving it unset makes the collector run all adapters once and exit (useful for one-shot ingestion in CI / cron).

For sourcing config values from an external secret store instead of inline YAML, see [Collector secrets backend](/configuration-and-deployment/collectors-secrets-backend.md). The SDK ships an `AWSSystemsManagerParameterStore` provider and a `BaseSecretsBackend` abstract class for additional providers.

## Package and deploy

The bundled monorepo collectors package as Docker images. The same Dockerfile shape works for a custom collector:

```dockerfile
# Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Install Poetry and copy the project.
RUN pip install --no-cache-dir poetry==1.8.3
COPY pyproject.toml poetry.lock ./
RUN poetry config virtualenvs.create false && poetry install --only main --no-root

COPY my_collector ./my_collector

# collector_config.yaml is mounted at runtime, not baked in.
ENV CONFIG_PATH=/app/collector_config.yaml

ENTRYPOINT ["python", "-m", "my_collector"]
```

```bash
docker build -t my-collector:latest .
docker run --rm \
  -v $(pwd)/collector_config.yaml:/app/collector_config.yaml \
  -e TOKEN=$COLLECTOR_TOKEN \
  my-collector:latest
```

The SDK's [`CollectorConfigLoader`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/odd_collector_sdk/domain/collector_config_loader.py) (instantiated and called from `Collector.__init__` at `collector.py:61` as `CollectorConfigLoader(config_path, plugin_factory).load()`) reads the path passed to `Collector(config_path=...)`, falling back to `$CONFIG_PATH`, then to `./collector_config.yaml`. Mount the operator-supplied YAML at the path your `__main__.py` resolves to. Token and other secrets are best supplied via environment variables resolved by `pyaml-env` (the SDK's YAML loader) rather than baked into the image.

For Kubernetes deployments, follow the same pattern as the bundled collectors' Helm charts at [`opendatadiscovery/charts`](https://github.com/opendatadiscovery/charts) — a single Deployment running one container per collector, with `collector_config.yaml` as a ConfigMap or Secret volume.

## Testing locally

Run a one-shot pull against a local Platform to smoke-check the adapter loads and the data source registers:

1. Start the [ODD Platform locally](/configuration-and-deployment/trylocally.md).
2. In the platform UI, create a Collector entity (Management → Collectors) and copy the issued token.
3. Set `token: <COLLECTOR_TOKEN>` in `collector_config.yaml` and remove `default_pulling_interval` (so the collector runs once and exits — much faster feedback than a 10-minute schedule loop).
4. `poetry run python -m my_collector` and watch the log for the `DataSource` registration POST and the per-adapter `[name] collecting metadata started.` / `[name] metadata collected in …` lines.
5. Open the platform UI's Catalog page and confirm the data source appears.

When something doesn't work, the SDK logs are the first place to look — `loguru` writes structured per-adapter lines (the format is set in [`logger.py`](https://github.com/opendatadiscovery/odd-collectors/blob/main/odd-collector-sdk/odd_collector_sdk/logger.py)). Most failures fall into three buckets:

* **`LoadConfigError` at startup** — a YAML field doesn't match the Pydantic model. The exception body names the offending field path.
* **`ImportError` looking for `package.adapter.Adapter`** — the directory under `adapters/` doesn't match the plugin's `type` literal, or `adapter.py` doesn't expose a class named exactly `Adapter`.
* **`PlatformApiError` on `register_datasource` / `ingest_data`** — token, platform URL, or TLS verification (`verify_ssl`) is wrong; double-check `platform_host_url` and that the token belongs to a Collector entity that's registered in the platform. Two platform-side specifics worth knowing: (1) data-source registration (`POST /ingestion/datasources`) resolves the collector token to a *session* and throws `IllegalStateException("Collector id is null")` — surfacing as a **5xx, not a 401** — when the token does not resolve to a registered Collector (a wrong/unregistered token, or a non-sticky load balancer that lands the request on an instance without the collector session); (2) several distinct client-side errors on `POST /ingestion/entities` (duplicate ODDRN within a batch, unknown `data_source_oddrn`, oversized payload) also surface as opaque 5xx rather than 4xx — see the [ingestion error contract](/integrations/integrations.md#ingestion-error-contract) on the Integrations hub for the full table and the recommended client behaviour (do not blindly retry these).

## Where to look in the SDK

When the doc above doesn't cover a specific question, read the source — it is small enough to navigate directly:

* `odd-collector-sdk/odd_collector_sdk/collector.py` — the `Collector` class and its lifecycle (`run`, `start_polling`, `register_data_sources`, `one_time_run`).
* `odd-collector-sdk/odd_collector_sdk/domain/adapter.py` — the three adapter contracts.
* `odd-collector-sdk/odd_collector_sdk/domain/plugin.py` — the `Plugin` base and the `Config` alias.
* `odd-collector-sdk/odd_collector_sdk/domain/collector_config.py` — `CollectorConfig` (the runtime Pydantic model). The module also defines a `load_config` helper, but it is **test-only** (called only from `tests/test_module_importer.py`) — do not use it for runtime config loading.
* `odd-collector-sdk/odd_collector_sdk/domain/collector_config_loader.py` — `CollectorConfigLoader` (the runtime config loader). `Collector.__init__` instantiates it as `CollectorConfigLoader(config_path, plugin_factory).load()`; it integrates with the optional secrets backend, merges priority-ordered settings and plugins, and returns a validated `CollectorConfig`.
* `odd-collector-sdk/odd_collector_sdk/load_adapter.py` — adapter package discovery and instantiation.
* `odd-collector-sdk/odd_collector_sdk/job.py` — `SyncJob` / `AsyncJob` / `AsyncGeneratorJob` and the `create_job` dispatch.
* `odd-collector-sdk/odd_collector_sdk/api/datasource_api.py` — the Ingress API client.
* `odd-collector-sdk/odd_collector_sdk/secrets/` — the `BaseSecretsBackend` and the AWS SSM provider.

For working examples of each adapter shape, the bundled collectors in [`odd-collectors`](https://github.com/opendatadiscovery/odd-collectors) are the best teachers — every type literal you see on the [generic collector page](/integrations/integrations/odd-collector.md) maps to a real adapter under `odd-collector/odd_collector/adapters/{type}/adapter.py` that you can read end-to-end.

## Further reading

* [Core concepts of creating a new Adapter for ODD Collector](https://medium.com/opendatadiscovery/core-concepts-of-creating-a-new-adapter-for-odd-collector-fa9d7b6ca7a6) — a high-level walkthrough on Medium that complements this reference. Useful as a tour of the same ground in narrative form, with end-to-end framing of the plugin → adapter → generator → entry-point progression.

## Contribute back

If your custom adapter targets a source that other operators are likely to use, contribute it to the appropriate bundled collector instead of maintaining a fork. The contribution flow follows the standard ODD process — fork, branch, PR — see [How to contribute](https://github.com/opendatadiscovery/documentation/blob/main/docs/developer-guides/build-and-run/how-to-contribute.md). New adapters generally go into:

* `odd-collector` — generic data sources (databases, BI, streams, MLOps).
* `odd-collector-aws` / `odd-collector-azure` / `odd-collector-gcp` — cloud-native sources, when there's a clear cloud affinity.
* A standalone push-adapter repo (`odd-dbt`, `odd-airflow-2`, `odd-spark-adapter`, …) — for source-runtime-embedded integrations.

Custom collectors that don't fit the bundled pattern can also live as separate community-maintained repos that depend on `odd-collector-sdk` directly; see [GitHub organization overview](https://github.com/opendatadiscovery/documentation/blob/main/docs/developer-guides/build-and-run/github-organization-overview.md) for the existing repo set and the role of each.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.opendatadiscovery.org/developer-guides/build-and-run/custom-collectors.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.