Ingestion filters

Ingestion filters — collector-side regex include / exclude rules that scope what schemas, tables, files, datasets, or pipelines a plugin ingests. Configured per-plugin in `collector_config.yaml`.

Pull adapters in ODD's collectors ingest everything they can see by default — every schema in a database, every file in a bucket, every dataset in a warehouse. Ingestion filters scope a plugin to a slice of that surface using regex include / exclude rules, so an operator can keep the catalog focused on what their teams actually care about.

This page covers the filter mechanism — the per-key shape, how include and exclude interact, and a worked PostgreSQL example. For per-adapter filter coverage (which adapter exposes which filter keys), see the per-collector pages under Integrations.

Not the same as the platform's "ingestion filter". The ODD Platform has a separate, unrelated feature that also carries the name ingestion filter — a token-based authentication gate on the /ingestion/entities endpoint, enabled with the auth.ingestion.filter.enabled setting (off by default). It controls who may push ingestion requests to the platform, not what a collector reads from a source. The filters on this page are collector-side and decide which schemas, tables, files, datasets, or pipelines an adapter ingests; they have nothing to do with authentication. If you came here to secure the ingestion endpoint, see Ingestion authentication instead.

Where filters are configured

Filters live in collector_config.yaml under the per-plugin block — not at the collector level. Each plugin type exposes its own filter keys named after the dimension being filtered:

  • schemas_filter — PostgreSQL, Snowflake (filter by database schema).

  • filename_filter — S3, Azure Blob Storage, GCS (filter by file path / name).

  • datasets_filter — BigQuery (filter by dataset).

  • pipeline_filter — Azure Data Factory (filter by pipeline name).

Other adapters expose filters under names that match their domain. The shape — include and exclude regex lists — is consistent across them.

Shape of a filter

Every filter takes two regex lists:

schemas_filter:
  include: ['regex_1', 'regex_2', ...]
  exclude: ['regex_1', 'regex_2', ...]
  • include — the plugin only ingests items matching at least one regex in the list. If include is set and no regex matches, the item is skipped.

  • exclude — the plugin skips items matching at least one regex in the list, even if they matched include.

When both lists are set, the rule is "included AND not excluded":

  1. The item must match at least one include pattern.

  2. The item must match zero exclude patterns.

Either list is optional. Omitting include means "include everything that isn't excluded". Omitting both filters off entirely means "ingest everything the adapter can see" — the default.

Patterns are regular expressions, not glob patterns. Anchor with ^ / $ if you need exact-prefix or exact-suffix matching; otherwise the regex matches anywhere in the candidate string.

Worked example — PostgreSQL schemas_filter

Suppose a PostgreSQL source has these schemas:

  • test_prod

  • application_dev

  • data_in_prod

  • test_data_in_prod_for_application

Configuring this filter on the PostgreSQL plugin:

The plugin processes each schema:

  • test_prod → matches include[0] (test) ✓ → matches exclude[0] (prod$) ✗ — excluded.

  • application_dev → matches no include rule — skipped (not included).

  • data_in_prod → matches no include rule — skipped (not included). (Note: ^in.*prod requires the schema name to start with in, which data_in_prod does not.)

  • test_data_in_prod_for_application → matches include[0] (test) ✓ → matches no exclude rule ✓ — ingested.

Net effect: only test_data_in_prod_for_application is ingested. The other three are filtered out at collection time and never appear in the catalog.

When filters apply

Filters apply at ingestion time, on the collector side — the platform never sees the filtered-out items. This means:

  • Filtered-out items consume zero database storage, zero search index, zero entity-page rendering cost. The filter is not a UI hide; it is a non-ingest.

  • Changing a filter rule and restarting the collector does not retroactively remove already-ingested items. To prune previously-ingested items the operator must also delete them from the platform (manual delete, or a controlled re-ingest after the filter change clears them from the source-truthy set).

  • Per-source coverage on Management → Datasources reflects what the filter let through — the entity counts there are post-filter.

Default behaviour without filters

When a plugin's filter block is absent or empty, the plugin ingests everything the source exposes. This is the default — for a fresh deployment, every schema, file, or dataset shows up in the catalog until the operator scopes the surface down.

Per-adapter coverage

Most pull adapters that read from sources with multiple "namespaceable" dimensions (schemas, datasets, paths, projects, ...) expose a corresponding filter. The complete adapter-by-adapter capability list lives on the odd-collectors repository's filtering documentation. When in doubt, consult the per-adapter page under Integrations for the exact key name.

Where to next

Last updated