odd-collector-profiler
Statistical data-profiling collector — runs DataProfiler against datasets and pushes per-dataset statistics to the platform.
Status: Stable, narrow scope. Released as its own Docker image. Currently supports profiling of PostgreSQL and Azure SQL sources only.
odd-collector-profiler is a separate single-purpose collector that runs Capital One's DataProfiler library against your datasets and pushes the resulting statistics (column-level distributions, null ratios, type detection, basic anomaly tags) into the catalog. It is the source of the Statistics view shown on a dataset's detail page in the platform UI.
It complements (does not replace) one of the regular pull collectors — odd-collector-profiler reads sample rows out of the source to compute statistics; the regular collector still does the schema-level catalog ingestion.
For the broader pull-vs-push picture, start at the Integrations hub.
Supported sources
postgres
PostgreSQL
azure_sql
Azure SQL Database
Source: profiler README. Other source types (Snowflake, BigQuery, MySQL, etc.) are not currently profiled by this collector.
Installation
docker pull ghcr.io/opendatadiscovery/odd-collector-profiler:latestOr build from source with docker build . -t odd_collector_profiler from the repo root.
Configuration shape
The profiler config layout is similar to a normal collector but the list field is profilers: rather than plugins:, and per-profiler fields are slightly different — most notably, an explicit tables: list per profiler scopes which tables get profiled (you typically do not profile every table in a database — DataProfiler reads sample rows and would be expensive across the full surface).
default_pulling_interval
Minutes between profiling runs. Profilers can be expensive — common values are 60–360.
token
Collector token issued by the platform.
platform_host_url
ODD Platform URL.
profilers
List of per-source profiler configs.
Source: profiler README → Config example.
Multiple profilers in one container
The same multi-plugin pattern as the regular collectors applies — a single container can profile multiple sources, including multiple sources of the same type:
Known limitations
Two source types only —
postgresandazure_sql. Profiling other source types isn't supported by this collector today.Explicit
tables:list required. The profiler does not auto-discover tables; you must list eachschema.table(or justtable) in the per-profilertablesfield. This is a deliberate cost guard — DataProfiler reads sample rows.Heavy dependency stack. DataProfiler pulls TensorFlow as a dependency for its automatic data-labeling. On Apple Silicon (M1/M2)
tensorflowandpyodbcneed extra build steps — see the profiler README → M1 Issue.Independent of the catalog collector. The profiler does not register data sources — it expects each profiled table to already exist in the catalog (ingested by
odd-collectoror another integration). Run a catalog collector first, then point the profiler at the same source.
Where to next
odd-collector— generic catalog collector. Run this before the profiler so the tables exist in the catalog.Quality Dashboard — where profiler-driven statistics surface in the UI.
DataProfiler upstream — the library doing the heavy lifting.
Last updated