odd-collector-profiler

Statistical data-profiling collector — runs DataProfiler against datasets and pushes per-dataset statistics to the platform.

Status: Stable, narrow scope. Released as its own Docker image. Currently supports profiling of PostgreSQL and Azure SQL sources only.

odd-collector-profiler is a separate single-purpose collector that runs Capital One's DataProfiler library against your datasets and pushes the resulting statistics (column-level distributions, null ratios, type detection, basic anomaly tags) into the catalog. It is the source of the Statistics view shown on a dataset's detail page in the platform UI.

It complements (does not replace) one of the regular pull collectors — odd-collector-profiler reads sample rows out of the source to compute statistics; the regular collector still does the schema-level catalog ingestion.

For the broader pull-vs-push picture, start at the Integrations hub.

Supported sources

Profiler type literal
Source system

postgres

PostgreSQL

azure_sql

Azure SQL Database

Source: profiler README. Other source types (Snowflake, BigQuery, MySQL, etc.) are not currently profiled by this collector.

Installation

docker pull ghcr.io/opendatadiscovery/odd-collector-profiler:latest

Or build from source with docker build . -t odd_collector_profiler from the repo root.

Configuration shape

The profiler config layout is similar to a normal collector but the list field is profilers: rather than plugins:, and per-profiler fields are slightly different — most notably, an explicit tables: list per profiler scopes which tables get profiled (you typically do not profile every table in a database — DataProfiler reads sample rows and would be expensive across the full surface).

Top-level field
Description

default_pulling_interval

Minutes between profiling runs. Profilers can be expensive — common values are 60–360.

token

Collector token issued by the platform.

platform_host_url

ODD Platform URL.

profilers

List of per-source profiler configs.

Source: profiler README → Config example.

Multiple profilers in one container

The same multi-plugin pattern as the regular collectors applies — a single container can profile multiple sources, including multiple sources of the same type:

Known limitations

  • Two source types onlypostgres and azure_sql. Profiling other source types isn't supported by this collector today.

  • Explicit tables: list required. The profiler does not auto-discover tables; you must list each schema.table (or just table) in the per-profiler tables field. This is a deliberate cost guard — DataProfiler reads sample rows.

  • Heavy dependency stack. DataProfiler pulls TensorFlow as a dependency for its automatic data-labeling. On Apple Silicon (M1/M2) tensorflow and pyodbc need extra build steps — see the profiler README → M1 Issue.

  • Independent of the catalog collector. The profiler does not register data sources — it expects each profiled table to already exist in the catalog (ingested by odd-collector or another integration). Run a catalog collector first, then point the profiler at the same source.

Where to next

Last updated