odd-collector-aws
AWS-services pull collector — 11 adapters for Glue, S3, Athena, Kinesis, SageMaker, and more.
Status: Stable. Released as a tagged Docker image alongside the rest of the odd-collectors monorepo.
odd-collector-aws packages adapters for AWS managed services. Like the other pull collectors, it ships as a daemon container that hosts one or more configured plugins; one container can host multiple plugins of any combination of types.
For the broader pull-vs-push picture, start at the Integrations hub. For deployment-side detail (build, Docker, env vars), see Build and run ODD Collectors.
Supported adapters
The 11 adapters registered in odd_collector_aws/domain/plugin.py (PLUGIN_FACTORY). Every adapter has per-field documentation below — two (glue, s3) get longer deep-dive spotlights with deployment guidance and feature notes; the remaining 9 are catalogued in the per-adapter configuration reference section.
athena
Amazon Athena
dms
AWS Database Migration Service
dynamodb
DynamoDB
glue
AWS Glue Data Catalog
✓
kinesis
Amazon Kinesis
quicksight
Amazon QuickSight
s3
Amazon S3 (object catalog)
✓
s3_delta
Amazon S3 — Delta Lake tables
sagemaker
Amazon SageMaker
sagemaker_featurestore
SageMaker Feature Store
sqs
Amazon SQS
The reference YAML for each adapter lives at odd-collectors/odd-collector-aws/config_examples/. The Pydantic models that define accepted fields live at odd-collector-aws/odd_collector_aws/domain/plugin.py.
Common AWS authentication
Every adapter inherits from AwsPlugin and accepts the same set of optional AWS auth fields. When unset, the underlying boto3 client falls back to its standard credential chain — environment variables, ~/.aws/credentials, EC2 / EKS instance profile, etc.
aws_access_key_id
string
None
Static access key.
aws_secret_access_key
string
None
Static secret key.
aws_session_token
string
None
Required when using temporary credentials.
aws_region
string
None
AWS region. Required for region-bound services when no environment default exists.
aws_account_id
string
None
Account ID. Required by kinesis.
profile_name
string
None
Named profile from ~/.aws/credentials.
aws_role_arn
string
None
Role ARN to assume.
aws_role_session_name
string
None
Session name for the assumed role.
endpoint_url
string
None
Override the AWS endpoint — used for LocalStack and S3-compatible stores like MinIO.
Source: AwsPlugin base in odd-collector-aws/odd_collector_aws/domain/plugin.py.
The container image pulls credentials from environment variables by convention (AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY). Inline YAML credentials work but are typically left empty for IAM-role-based deployments — the reference Compose file wires them as env-vars only. Prefer IAM roles over static keys in production.
Installation
Mount a collector_config.yaml at /app/collector_config.yaml. A reference Compose snippet is in the aws collector README.
Minimal config
Multiple plugins in one container
A single odd-collector-aws instance commonly fans out across multiple AWS accounts or regions:
Plugin name must be unique within the file.
Spotlight: Glue (type: glue)
type: glue)Pulls the AWS Glue Data Catalog — databases, tables, columns, partition keys.
name
string
yes
—
Operator-chosen unique plugin name.
aws_region
string
recommended
None
AWS region. Glue is region-bound.
AWS auth fields
—
—
—
See the common AWS authentication section above.
Source: GluePlugin in odd-collector-aws/.../plugin.py; reference YAML at config_examples/glue.yaml.
Spotlight: S3 (type: s3)
type: s3)Pulls a curated set of S3 objects (or folders treated as datasets) and infers their schema. Supports CSV / TSV / Parquet, with explicit support for Hive-style partitioning.
name
string
yes
—
Operator-chosen unique plugin name.
dataset_config.bucket
string
yes
—
S3 bucket name.
dataset_config.prefix
string
no
empty
Path prefix inside the bucket.
dataset_config.folder_as_dataset
object
no
—
Treat a folder as a single partitioned dataset (see partitioned example below).
endpoint_url
string
no
None
Override for S3-compatible stores (MinIO, LocalStack).
filename_filter.include
list of regex
no
[".*"]
Object names to include.
filename_filter.exclude
list of regex
no
[]
Object names to drop after include matches.
AWS auth fields
—
—
—
See the common AWS authentication section above.
Source: S3Plugin in odd-collector-aws/.../plugin.py; reference YAML at config_examples/s3.yaml.
The legacy datasets: field on S3Plugin is deprecated and rejected at validation time. Use dataset_config (singular) — the reference YAML and the Pydantic validator both enforce this.
Per-adapter configuration reference
The two spotlights above cover the deployment-shape questions; this section enumerates the per-field config schema for the remaining 9 adapters. Field names, types, and defaults are sourced from the Pydantic plugin classes in odd_collector_aws/domain/plugin.py; each adapter links to its config_examples/{type}.yaml reference YAML.
Every adapter inherits from AwsPlugin (documented under Common AWS authentication above) and therefore accepts the standard AWS auth fields aws_access_key_id, aws_secret_access_key, aws_session_token, aws_region, aws_account_id, profile_name, aws_role_arn, aws_role_session_name, and endpoint_url. The per-adapter tables below list only adapter-specific fields plus call out which AWS auth fields a given service requires in practice.
Amazon Athena (type: athena)
type: athena)Catalogs Athena workgroups, databases, tables, and views.
The plugin declares no fields beyond the AwsPlugin base — aws_region is required in practice (Athena is region-bound) and the rest of the AWS auth set follows the boto3 default credential chain when unset.
name
string
yes
—
Operator-chosen unique plugin name.
Source: AthenaPlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/athena.yaml.
AWS Database Migration Service (type: dms)
type: dms)Catalogs DMS replication instances, endpoints, and tasks.
name
string
yes
—
Operator-chosen unique plugin name.
Source: DmsPlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/dms.yaml.
DynamoDB (type: dynamodb)
type: dynamodb)Catalogs DynamoDB tables and infers attribute types from a row sample. The adapter scopes to one region per plugin.
name
string
yes
—
Operator-chosen unique plugin name.
exclude_tables
list of string or null
no
[]
Literal table-name list to skip (e.g., to exclude internal / staging tables). Plain name match — not regex.
AWS auth fields
—
—
—
See Common AWS authentication. aws_region is required in practice; endpoint_url works for LocalStack and other DynamoDB-compatible local stores.
Source: DynamoDbPlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/dynamodb.yaml.
Amazon Kinesis (type: kinesis)
type: kinesis)Catalogs Kinesis streams in one account / region per plugin.
name
string
yes
—
Operator-chosen unique plugin name.
aws_account_id
string
yes
—
Required for kinesis. Other adapters inherit aws_account_id as Optional[str] from AwsPlugin; KinesisPlugin redeclares it as required. The collector errors out on startup if it isn't set.
Source: KinesisPlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/kinesis.yaml.
Amazon QuickSight (type: quicksight)
type: quicksight)Catalogs QuickSight datasets, dashboards, and analyses.
name
string
yes
—
Operator-chosen unique plugin name.
Source: QuicksightPlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/quicksight.yaml.
S3 Delta Lake (type: s3_delta)
type: s3_delta)Catalogs Delta Lake tables stored in S3 (or any S3-compatible storage). The adapter reads the Delta _delta_log/ to recover the table's evolved schema rather than inferring it from the underlying Parquet files.
name
string
yes
—
Operator-chosen unique plugin name.
delta_tables
object
yes
—
A single Delta-table descriptor (bucket, prefix, optional filter). Note: this is one object — not a list — different from gcs_delta on the GCP collector which takes a list. To catalog multiple Delta tables, use multiple s3_delta plugin entries.
delta_tables.bucket
string
yes
—
S3 bucket name.
delta_tables.prefix
string
yes
—
Path prefix inside the bucket pointing at the Delta table root (the directory containing _delta_log/).
delta_tables.filter.include
list of regex
no
[".*"]
Per-table regex include list applied during enumeration.
delta_tables.filter.exclude
list of regex
no
[]
Per-table regex exclude list.
delta_tables.scheme
string (alias schema)
no
"s3"
Storage scheme. Defaults to s3 — override only when pointing the adapter at a non-S3 Delta location. The model accepts schema as an alias for backward compatibility.
endpoint_url
string or null
no
null
Override for S3-compatible endpoints (LocalStack, MinIO). Re-declared on S3DeltaPlugin over the inherited AwsPlugin field for clarity.
aws_storage_allow_http
boolean or null
no
false
Permit plain-HTTP access to the storage backend. Set to true for local MinIO / LocalStack deployments using http://; leave false for production S3.
Source: S3DeltaPlugin and DeltaTableConfig in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/s3_delta.yaml.
Amazon SageMaker (type: sagemaker)
type: sagemaker)Catalogs SageMaker experiments, trials, and model artifacts.
SagemakerPlugin re-declares the AWS auth fields and experiments without defaults, which makes them effectively required in Pydantic — the adapter will not start until you provide aws_access_key_id, aws_secret_access_key, aws_region, aws_session_token, aws_account_id, and experiments (each can be null if you intend to fall back to the boto3 credential chain or to ingest every experiment, but the keys must be present in the YAML). This is asymmetric with every other AWS adapter in the collector. Set values explicitly or pass null per field.
name
string
yes
—
Operator-chosen unique plugin name.
aws_secret_access_key
string or null
yes
—
Required at the model level (re-declared without default).
aws_access_key_id
string or null
yes
—
Required at the model level (re-declared without default).
aws_region
string or null
yes
—
Required at the model level (re-declared without default).
aws_session_token
string or null
yes
—
Required at the model level (re-declared without default).
aws_account_id
string or null
yes
—
Required at the model level (re-declared without default).
experiments
list of string or null
yes
—
Allowlist of SageMaker experiment names to scope ingestion. The model is Optional[list[str]] with no default — pass an explicit list to scope or null to ingest every experiment. Literal name list — not a regex.
profile_name, aws_role_arn, aws_role_session_name, endpoint_url
—
—
—
Inherited from AwsPlugin; see Common AWS authentication.
Source: SagemakerPlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/sagemaker.yaml.
SageMaker Feature Store (type: sagemaker_featurestore)
type: sagemaker_featurestore)Catalogs SageMaker Feature Store feature groups and feature definitions.
name
string
yes
—
Operator-chosen unique plugin name.
Source: SagemakerFeaturestorePlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/sagemaker_featurestore.yaml.
Amazon SQS (type: sqs)
type: sqs)Catalogs SQS queues in one region per plugin.
name
string
yes
—
Operator-chosen unique plugin name.
Source: SQSPlugin in odd_collector_aws/domain/plugin.py; reference YAML at config_examples/sqs.yaml.
Per-adapter feature matrix
Ingestion filters (filename_filter)
s3, s3_delta (via delta_tables.filter). Regex include / exclude lists; default includes everything.
Folder-as-dataset / Hive partitioning
s3. dataset_config.folder_as_dataset accepts file_format (parquet / csv / tsv), flavor (hive / presto), and an optional field_names list for non-Hive layouts.
Cross-region credentials via aws_role_arn
Every adapter inheriting from AwsPlugin.
endpoint_url override
s3, s3_delta, dynamodb (and any AWS-SDK call boto3 routes through the configured endpoint). Used for LocalStack and MinIO.
exclude_tables
dynamodb. Plain list of table names to skip.
aws_storage_allow_http toggle
s3_delta. Enables plain-HTTP storage access (MinIO / LocalStack); off by default.
Literal-name allowlist filters
sagemaker.experiments (experiment names). Plain list — not a regex. Required at the model level (no default); pass null to ingest every experiment, or a list to scope.
Required aws_account_id
kinesis. Re-declared as required (the rest of the AWS adapters take aws_account_id as Optional[str]).
Source: PLUGIN_FACTORY in odd-collector-aws/.../plugin.py.
The cloud collectors do not ship with the AWS SSM secrets backend hook that odd-collector (the generic one) ships — see Collector secrets backend for the supported scope.
Known limitations
Static credentials in YAML are not the recommended path. IAM roles via
aws_role_arn(or pod-identity / instance-profile) avoid leaking long-lived keys into config files. The reference Compose template wires credentials only as env-vars.s3.datasetsfield rejected at validation. Usedataset_config(singular). The collector errors out on startup ifdatasets:is present — see thevalidate_datasetsvalidator onS3Plugin.No foreign-key / ERD extraction in any AWS adapter — that capability is PostgreSQL- and Snowflake-only on the generic collector.
kinesisrequiresaws_account_idexplicitly — it's the only field outside the common AWS auth set that is required forkinesis.sagemakerre-declares the AWS auth fields without defaults, making them effectively required even when you intend to inherit from the boto3 credential chain. Provide each field explicitly (usenullwhen you want the boto3 fallback). This is an asymmetry with the rest of the adapters in the collector.s3_delta.delta_tablesis a single object, not a list. To catalog multiple Delta tables in one collector, use multiples3_deltaplugin entries. This is an asymmetry withgcs_deltaon the GCP collector, which takes a list ofdelta_tables.s3.dataset_configis a single object, not a list. To catalog multiple S3 buckets, use multiples3plugin entries.
Where to next
odd-collector— generic collector with PostgreSQL, Snowflake, etc.odd-collector-azure/odd-collector-gcp— sibling cloud collectors.Build and run ODD Collectors — common SDK schema and from-source build flow.
Last updated