# Build and run ODD Collectors

For instructions on how to run the ODD Platform and ODD Collectors locally in a Docker environment, please follow [Try locally](/configuration-and-deployment/trylocally.md) article.

## ODD Collectors tech stack

There are 4 main collectors at the moment, all bundled in the [`odd-collectors`](https://github.com/opendatadiscovery/odd-collectors) monorepo:

* [**ODD Collector**](https://github.com/opendatadiscovery/odd-collectors/tree/main/odd-collector) — covering databases, BI tools, data warehouses, etc. (per-adapter reference: [odd-collector](/integrations/integrations/odd-collector.md))
* [**ODD Collector AWS**](https://github.com/opendatadiscovery/odd-collectors/tree/main/odd-collector-aws) — covering AWS services (per-adapter reference: [odd-collector-aws](/integrations/integrations/odd-collector-aws.md))
* [**ODD Collector GCP**](https://github.com/opendatadiscovery/odd-collectors/tree/main/odd-collector-gcp) — covering GCP services (per-adapter reference: [odd-collector-gcp](/integrations/integrations/odd-collector-gcp.md))
* [**ODD Collector Azure**](https://github.com/opendatadiscovery/odd-collectors/tree/main/odd-collector-azure) — covering Azure services (Azure SQL, Data Factory, Blob Storage, Power BI; per-adapter reference: [odd-collector-azure](/integrations/integrations/odd-collector-azure.md))

A specialist profiler collector — [**ODD Collector Profiler**](https://github.com/opendatadiscovery/odd-collector-profiler) — lives in its own repo and produces statistical profiles for Postgres / Azure SQL sources; its per-adapter reference is at [odd-collector-profiler](/integrations/integrations/odd-collector-profiler.md). For the broader pull / push picture (how the four monorepo collectors compare to push adapters like dbt, Spark, Airflow, Great Expectations), see the [Integrations hub](/integrations/integrations.md).

ODD Collector AWS uses [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html), ODD Collector GCP uses Google's Cloud SDKs, and ODD Collector Azure uses the Azure SDK family; ODD Collector itself relies on per-data-source libraries.

> The previously standalone `odd-collector`, `odd-collector-aws`, `odd-collector-gcp`, and `odd-collector-azure` repositories were archived on 2023-11-06 and consolidated into the `odd-collectors` monorepo above. New work happens only in the monorepo.

General tech stack is:

* Python
* Poetry
* asyncio

## Prerequisites

* Python 3.9 or higher (the monorepo's `pyproject.toml` files all pin `python = "^3.9"`, so any 3.9.x or later 3.x interpreter works)
* [Poetry](https://python-poetry.org/) 1.2.0
* [Docker Engine 19.03.0+](https://docs.docker.com/engine/install/)
* preferably the latest [docker-compose](https://docs.docker.com/compose/install/)

## Build ODD Collector into Docker container

Fork and clone the [`odd-collectors`](https://github.com/opendatadiscovery/odd-collectors) monorepo if you haven't done it already.

```shell
git clone https://github.com/{username}/odd-collectors.git
```

Go into the directory of the sub-collector you want to build (one of `odd-collector`, `odd-collector-aws`, `odd-collector-gcp`, `odd-collector-azure`):

```shell
cd odd-collectors/odd-collector
```

Run the following command, replacing `<tag>` with any tag name you'd like

```shell
docker build . -t odd-collector:<tag>
```

## Run ODD Collector locally

### Run ODD Platform locally as a target for ODD Collector

In order to run ODD Platform locally please follow [this guide](/configuration-and-deployment/trylocally.md).

### Activate environment

Go into the sub-collector's directory inside the monorepo (substituting the sub-collector you want to run):

```shell
cd odd-collectors/odd-collector
```

Run following commands to create local python environment and install dependencies

```shell
poetry install
```

Change your python context to created one.

```shell
poetry shell
```

### Configure ODD Collector to send request to target catalog

Create collector in the ODD Platform and copy created token using [this guide](/configuration-and-deployment/trylocally.md#create-collector-entity).

Configure `collector-config.yaml` for the adapters you intend to run. The canonical per-adapter reference — field-by-field plugin shape, defaults, supported features — lives on the per-collector pages of the [Integrations hub](/integrations/integrations.md): [odd-collector](/integrations/integrations/odd-collector.md), [odd-collector-aws](/integrations/integrations/odd-collector-aws.md), [odd-collector-azure](/integrations/integrations/odd-collector-azure.md), [odd-collector-gcp](/integrations/integrations/odd-collector-gcp.md), [odd-collector-profiler](/integrations/integrations/odd-collector-profiler.md). The raw upstream YAML examples are also browsable at each collector's `config_examples/` directory — [odd-collector](https://github.com/opendatadiscovery/odd-collectors/tree/main/odd-collector/config_examples), [odd-collector-aws](https://github.com/opendatadiscovery/odd-collectors/tree/main/odd-collector-aws/config_examples), [odd-collector-gcp](https://github.com/opendatadiscovery/odd-collectors/tree/main/odd-collector-gcp/config_examples), [odd-collector-azure](https://github.com/opendatadiscovery/odd-collectors/tree/main/odd-collector-azure/config_examples). Replace `<COLLECTOR_TOKEN>` with the token obtained in the previous step.

```yaml
default_pulling_interval: 10
token: <COLLECTOR_TOKEN>
platform_host_url: http://localhost:8080
plugins:
  - type: my_adapter
    some_field_one: str
    some_field_two: int
```

### Full configuration reference

The example above is the minimum that gets a collector running. The collector SDK's `CollectorConfig` accepts the following top-level fields:

| Field                        | Type              | Default                                               | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| ---------------------------- | ----------------- | ----------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `platform_host_url`          | string            | required                                              | URL of the ODD Platform that the collector pushes metadata to.                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| `token`                      | string            | required                                              | Collector token issued by the platform (see [Create Collector entity](/configuration-and-deployment/trylocally.md#create-collector-entity)).                                                                                                                                                                                                                                                                                                                                                                  |
| `plugins`                    | list              | required                                              | Adapter configurations — each entry is one configured connection.                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| `default_pulling_interval`   | integer (minutes) | unset                                                 | Polling cadence applied to every plugin. When unset, each plugin runs once and the collector exits — useful for one-shot ingestion in CI / cron.                                                                                                                                                                                                                                                                                                                                                              |
| `connection_timeout_seconds` | integer           | `300`                                                 | HTTP timeout for requests from the collector to the ODD Platform's Ingestion API. Does **not** affect adapter-to-source connections — those are governed by each adapter's own client. Raise it when the platform is slow to acknowledge large pushes.                                                                                                                                                                                                                                                        |
| `chunk_size`                 | integer           | `250`                                                 | Maximum number of data entities batched into a single Ingestion API request. Lower values reduce memory pressure when ingesting very large catalogs; higher values reduce request count.                                                                                                                                                                                                                                                                                                                      |
| `misfire_grace_time`         | integer (seconds) | unset (falls back to `default_pulling_interval × 60`) | APScheduler grace window — if a scheduled run is delayed by more than this many seconds (e.g. previous run still executing, host pause), the missed run is dropped instead of firing late. When this field is unset, the SDK substitutes `default_pulling_interval × 60` seconds — i.e. one full polling interval expressed in seconds — so missed runs are tolerated for up to one interval before being dropped. Set explicitly only when you want a tighter or looser tolerance than one polling interval. |
| `max_instances`              | integer           | `1`                                                   | Maximum concurrent runs of the same plugin. The default prevents a slow source from queuing overlapping pulls; raise it only when a plugin is explicitly safe to run in parallel.                                                                                                                                                                                                                                                                                                                             |
| `verify_ssl`                 | boolean           | `true`                                                | Whether the collector verifies the ODD Platform's TLS certificate on every Ingestion API call. Set to `false` only when the platform is served behind a self-signed certificate (development clusters, air-gapped deployments) — disabling certificate verification in production is a security risk.                                                                                                                                                                                                         |

### Run ODD Collector

Run ODD Collector locally using following command:

```shell
sh ./start.sh
```

## How to implement new integration

For authoring a new pull adapter — defining a `Plugin` subclass, implementing the `AbstractAdapter` / `BaseAdapter` / `AsyncAbstractAdapter` contract, wiring `PLUGIN_FACTORY`, generating ODDRNs, packaging, and contributing back — see the dedicated [Build a custom collector](/developer-guides/build-and-run/custom-collectors.md) developer guide.

For push-strategy integrations (Airflow, dbt, Spark, Great Expectations, custom CI/CD), see the per-tool pages under the [Integrations hub](/integrations/integrations.md) — those follow each host system's plugin or listener API rather than the pull-collector SDK.

## Troubleshooting

### Running ODD Collector on M1

libraries `pyodbc` , `confluent-kafka` and `grpcio` have problem during installing and building project on M1 Macbooks.

* [mkleehammer/pyodbc#846](https://github.com/mkleehammer/pyodbc/issues/846)
* [confluentinc/confluent-kafka-python#1190](https://github.com/confluentinc/confluent-kafka-python/issues/1190)
* [grpc/grpc#25082](https://github.com/grpc/grpc/issues/25082)

Possible solution:

{% hint style="info" %}
The easiest way is to add all export statements to your .bashrc/.zshrc file
{% endhint %}

```shell
# pyodbc dependencies
brew install unixodbc freetds openssl

# confluent-kafka
export LDFLAGS="-L/opt/homebrew/lib -L/opt/homebrew/Cellar/unixodbc/2.3.11/include -L/opt/homebrew/opt/freetds/lib -L/opt/homebrew/opt/openssl@3/lib"
export CFLAGS="-I/opt/homebrew/Cellar/unixodbc/2.3.11/include -I/opt/homebrew/opt/freetds/include"
export CPPFLAGS="-I/opt/homebrew/include -I/opt/homebrew/Cellar/unixodbc/2.3.11/include -I/opt/homebrew/opt/openssl@3/include"

brew install librdkafka
export C_INCLUDE_PATH=/opt/homebrew/Cellar/librdkafka/1.9.0/include
export LIBRARY_PATH=/opt/homebrew/Cellar/librdkafka/1.9.0/lib
export PATH="/opt/homebrew/opt/openssl@3/bin:$PATH"

# grpcio
export GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1
export GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.opendatadiscovery.org/developer-guides/build-and-run/build-and-run-odd-collectors.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
