odd-collector-aws

AWS-services pull collector — 11 adapters for Glue, S3, Athena, Kinesis, SageMaker, and more.

Status: Stable. Released as a tagged Docker image alongside the rest of the odd-collectors monorepo.

odd-collector-aws packages adapters for AWS managed services. Like the other pull collectors, it ships as a daemon container that hosts one or more configured plugins; one container can host multiple plugins of any combination of types.

For the broader pull-vs-push picture, start at the Integrations hub. For deployment-side detail (build, Docker, env vars), see Build and run ODD Collectors.

Supported adapters

The 11 adapters registered in odd_collector_aws/domain/plugin.py (PLUGIN_FACTORY). Every adapter has per-field documentation below — two (glue, s3) get longer deep-dive spotlights with deployment guidance and feature notes; the remaining 9 are catalogued in the per-adapter configuration reference section.

Type literal
AWS service
Spotlighted below

athena

Amazon Athena

dms

AWS Database Migration Service

dynamodb

DynamoDB

glue

AWS Glue Data Catalog

kinesis

Amazon Kinesis

quicksight

Amazon QuickSight

s3

Amazon S3 (object catalog)

s3_delta

Amazon S3 — Delta Lake tables

sagemaker

Amazon SageMaker

sagemaker_featurestore

SageMaker Feature Store

sqs

Amazon SQS

The reference YAML for each adapter lives at odd-collectors/odd-collector-aws/config_examples/. The Pydantic models that define accepted fields live at odd-collector-aws/odd_collector_aws/domain/plugin.py.

Common AWS authentication

Every adapter inherits from AwsPlugin and accepts the same set of optional AWS auth fields. When unset, the underlying boto3 client falls back to its standard credential chain — environment variables, ~/.aws/credentials, EC2 / EKS instance profile, etc.

Field
Type
Default
Description

aws_access_key_id

string

None

Static access key.

aws_secret_access_key

string

None

Static secret key.

aws_session_token

string

None

Required when using temporary credentials.

aws_region

string

None

AWS region. Required for region-bound services when no environment default exists.

aws_account_id

string

None

Account ID. Required by kinesis.

profile_name

string

None

Named profile from ~/.aws/credentials.

aws_role_arn

string

None

Role ARN to assume.

aws_role_session_name

string

None

Session name for the assumed role.

endpoint_url

string

None

Override the AWS endpoint — used for LocalStack and S3-compatible stores like MinIO.

Source: AwsPlugin base in odd-collector-aws/odd_collector_aws/domain/plugin.py.

Installation

Mount a collector_config.yaml at /app/collector_config.yaml. A reference Compose snippet is in the aws collector README.

Minimal config

Multiple plugins in one container

A single odd-collector-aws instance commonly fans out across multiple AWS accounts or regions:

Plugin name must be unique within the file.

Spotlight: Glue (type: glue)

Pulls the AWS Glue Data Catalog — databases, tables, columns, partition keys.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

aws_region

string

recommended

None

AWS region. Glue is region-bound.

AWS auth fields

See the common AWS authentication section above.

Source: GluePlugin in odd-collector-aws/.../plugin.py; reference YAML at config_examples/glue.yaml.

Spotlight: S3 (type: s3)

Pulls a curated set of S3 objects (or folders treated as datasets) and infers their schema. Supports CSV / TSV / Parquet, with explicit support for Hive-style partitioning.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

dataset_config.bucket

string

yes

S3 bucket name.

dataset_config.prefix

string

no

empty

Path prefix inside the bucket.

dataset_config.folder_as_dataset

object

no

Treat a folder as a single partitioned dataset (see partitioned example below).

endpoint_url

string

no

None

Override for S3-compatible stores (MinIO, LocalStack).

filename_filter.include

list of regex

no

[".*"]

Object names to include.

filename_filter.exclude

list of regex

no

[]

Object names to drop after include matches.

AWS auth fields

See the common AWS authentication section above.

Source: S3Plugin in odd-collector-aws/.../plugin.py; reference YAML at config_examples/s3.yaml.

Per-adapter configuration reference

The two spotlights above cover the deployment-shape questions; this section enumerates the per-field config schema for the remaining 9 adapters. Field names, types, and defaults are sourced from the Pydantic plugin classes in odd_collector_aws/domain/plugin.py; each adapter links to its config_examples/{type}.yaml reference YAML.

Every adapter inherits from AwsPlugin (documented under Common AWS authentication above) and therefore accepts the standard AWS auth fields aws_access_key_id, aws_secret_access_key, aws_session_token, aws_region, aws_account_id, profile_name, aws_role_arn, aws_role_session_name, and endpoint_url. The per-adapter tables below list only adapter-specific fields plus call out which AWS auth fields a given service requires in practice.

Amazon Athena (type: athena)

Catalogs Athena workgroups, databases, tables, and views.

The plugin declares no fields beyond the AwsPlugin base — aws_region is required in practice (Athena is region-bound) and the rest of the AWS auth set follows the boto3 default credential chain when unset.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

AWS auth fields

See Common AWS authentication. aws_region is required in practice.

Source: AthenaPlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/athena.yaml.

AWS Database Migration Service (type: dms)

Catalogs DMS replication instances, endpoints, and tasks.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

AWS auth fields

See Common AWS authentication. aws_region is required in practice.

Source: DmsPlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/dms.yaml.

DynamoDB (type: dynamodb)

Catalogs DynamoDB tables and infers attribute types from a row sample. The adapter scopes to one region per plugin.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

exclude_tables

list of string or null

no

[]

Literal table-name list to skip (e.g., to exclude internal / staging tables). Plain name match — not regex.

AWS auth fields

See Common AWS authentication. aws_region is required in practice; endpoint_url works for LocalStack and other DynamoDB-compatible local stores.

Source: DynamoDbPlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/dynamodb.yaml.

Amazon Kinesis (type: kinesis)

Catalogs Kinesis streams in one account / region per plugin.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

aws_account_id

string

yes

Required for kinesis. Other adapters inherit aws_account_id as Optional[str] from AwsPlugin; KinesisPlugin redeclares it as required. The collector errors out on startup if it isn't set.

AWS auth fields

See Common AWS authentication. aws_region is required in practice.

Source: KinesisPlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/kinesis.yaml.

Amazon QuickSight (type: quicksight)

Catalogs QuickSight datasets, dashboards, and analyses.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

AWS auth fields

See Common AWS authentication. aws_region is required in practice.

Source: QuicksightPlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/quicksight.yaml.

S3 Delta Lake (type: s3_delta)

Catalogs Delta Lake tables stored in S3 (or any S3-compatible storage). The adapter reads the Delta _delta_log/ to recover the table's evolved schema rather than inferring it from the underlying Parquet files.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

delta_tables

object

yes

A single Delta-table descriptor (bucket, prefix, optional filter). Note: this is one object — not a list — different from gcs_delta on the GCP collector which takes a list. To catalog multiple Delta tables, use multiple s3_delta plugin entries.

delta_tables.bucket

string

yes

S3 bucket name.

delta_tables.prefix

string

yes

Path prefix inside the bucket pointing at the Delta table root (the directory containing _delta_log/).

delta_tables.filter.include

list of regex

no

[".*"]

Per-table regex include list applied during enumeration.

delta_tables.filter.exclude

list of regex

no

[]

Per-table regex exclude list.

delta_tables.scheme

string (alias schema)

no

"s3"

Storage scheme. Defaults to s3 — override only when pointing the adapter at a non-S3 Delta location. The model accepts schema as an alias for backward compatibility.

endpoint_url

string or null

no

null

Override for S3-compatible endpoints (LocalStack, MinIO). Re-declared on S3DeltaPlugin over the inherited AwsPlugin field for clarity.

aws_storage_allow_http

boolean or null

no

false

Permit plain-HTTP access to the storage backend. Set to true for local MinIO / LocalStack deployments using http://; leave false for production S3.

AWS auth fields

Source: S3DeltaPlugin and DeltaTableConfig in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/s3_delta.yaml.

Amazon SageMaker (type: sagemaker)

Catalogs SageMaker experiments, trials, and model artifacts.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

aws_secret_access_key

string or null

yes

Required at the model level (re-declared without default).

aws_access_key_id

string or null

yes

Required at the model level (re-declared without default).

aws_region

string or null

yes

Required at the model level (re-declared without default).

aws_session_token

string or null

yes

Required at the model level (re-declared without default).

aws_account_id

string or null

yes

Required at the model level (re-declared without default).

experiments

list of string or null

yes

Allowlist of SageMaker experiment names to scope ingestion. The model is Optional[list[str]] with no default — pass an explicit list to scope or null to ingest every experiment. Literal name list — not a regex.

profile_name, aws_role_arn, aws_role_session_name, endpoint_url

Inherited from AwsPlugin; see Common AWS authentication.

Source: SagemakerPlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/sagemaker.yaml.

SageMaker Feature Store (type: sagemaker_featurestore)

Catalogs SageMaker Feature Store feature groups and feature definitions.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

AWS auth fields

See Common AWS authentication. aws_region is required in practice.

Source: SagemakerFeaturestorePlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/sagemaker_featurestore.yaml.

Amazon SQS (type: sqs)

Catalogs SQS queues in one region per plugin.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

AWS auth fields

See Common AWS authentication. aws_region is required in practice.

Source: SQSPlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/sqs.yaml.

Per-adapter feature matrix

Feature
Where it applies

Ingestion filters (filename_filter)

s3, s3_delta (via delta_tables.filter). Regex include / exclude lists; default includes everything.

Folder-as-dataset / Hive partitioning

s3. dataset_config.folder_as_dataset accepts file_format (parquet / csv / tsv), flavor (hive / presto), and an optional field_names list for non-Hive layouts.

Cross-region credentials via aws_role_arn

Every adapter inheriting from AwsPlugin.

endpoint_url override

s3, s3_delta, dynamodb (and any AWS-SDK call boto3 routes through the configured endpoint). Used for LocalStack and MinIO.

exclude_tables

dynamodb. Plain list of table names to skip.

aws_storage_allow_http toggle

s3_delta. Enables plain-HTTP storage access (MinIO / LocalStack); off by default.

Literal-name allowlist filters

sagemaker.experiments (experiment names). Plain list — not a regex. Required at the model level (no default); pass null to ingest every experiment, or a list to scope.

Required aws_account_id

kinesis. Re-declared as required (the rest of the AWS adapters take aws_account_id as Optional[str]).

Source: PLUGIN_FACTORY in odd-collector-aws/.../plugin.py.

The cloud collectors do not ship with the AWS SSM secrets backend hook that odd-collector (the generic one) ships — see Collector secrets backend for the supported scope.

Known limitations

  • Static credentials in YAML are not the recommended path. IAM roles via aws_role_arn (or pod-identity / instance-profile) avoid leaking long-lived keys into config files. The reference Compose template wires credentials only as env-vars.

  • s3.datasets field rejected at validation. Use dataset_config (singular). The collector errors out on startup if datasets: is present — see the validate_datasets validator on S3Plugin.

  • No foreign-key / ERD extraction in any AWS adapter — that capability is PostgreSQL- and Snowflake-only on the generic collector.

  • kinesis requires aws_account_id explicitly — it's the only field outside the common AWS auth set that is required for kinesis.

  • sagemaker re-declares the AWS auth fields without defaults, making them effectively required even when you intend to inherit from the boto3 credential chain. Provide each field explicitly (use null when you want the boto3 fallback). This is an asymmetry with the rest of the adapters in the collector.

  • s3_delta.delta_tables is a single object, not a list. To catalog multiple Delta tables in one collector, use multiple s3_delta plugin entries. This is an asymmetry with gcs_delta on the GCP collector, which takes a list of delta_tables.

  • s3.dataset_config is a single object, not a list. To catalog multiple S3 buckets, use multiple s3 plugin entries.

Where to next

Last updated