odd-collector-gcp

GCP-services pull collector — adapters for BigQuery, BigTable, Google Cloud Storage, and GCS Delta Lake tables.

Status: Stable. Released as a tagged Docker image alongside the rest of the odd-collectors monorepo.

odd-collector-gcp packages adapters for Google Cloud managed services. Like the other pull collectors, it ships as a daemon container that hosts one or more configured plugins; one container can host multiple plugins of any combination of types.

For the broader pull-vs-push picture, start at the Integrations hub. For deployment-side detail, see Build and run ODD Collectors.

Authentication

Authentication uses the Google Cloud Application Default Credentials chain. When running outside GCP, set GOOGLE_APPLICATION_CREDENTIALS to a JSON key file path on the container — the adapters do not accept inline credentials in the plugin config. When running on GCP (GKE, GCE, Cloud Run), the workload identity / service-account attached to the runtime is used automatically.

Supported adapters

The 4 adapters registered in odd_collector_gcp/domain/plugin.py (PLUGIN_FACTORY):

Type literal
GCP service
Spotlighted below

bigquery_storage

BigQuery (datasets, tables, views, columns)

bigtable

Cloud Bigtable

gcs

Google Cloud Storage (object catalog)

gcs_delta

Google Cloud Storage — Delta Lake tables

The reference YAML for each adapter lives in the GCP collector README. The Pydantic models that define accepted fields live at odd-collector-gcp/odd_collector_gcp/domain/plugin.py.

Installation

docker pull ghcr.io/opendatadiscovery/odd-collector-gcp:latest

Mount a collector_config.yaml at /app/collector_config.yaml and the GCP service-account JSON key at the path named by GOOGLE_APPLICATION_CREDENTIALS. A reference Compose snippet is in the gcp collector README.

Minimal config

Multiple plugins in one container

Spotlight: BigQuery (type: bigquery_storage)

Pulls BigQuery datasets, tables, views, and column metadata for one project per plugin.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

project

string

yes

GCP project ID; one plugin = one project.

page_size

integer

no

100

Pagination size for BigQuery list calls.

datasets_filter.include

list of regex

no

[".*"]

Dataset names to include.

datasets_filter.exclude

list of regex

no

[]

Dataset names to drop after include matches.

Source: BigQueryStoragePlugin in odd-collector-gcp/.../plugin.py.

Spotlight: Google Cloud Storage (type: gcs)

Pulls GCS objects (or folders treated as datasets) and infers schema. Supports CSV / Parquet, with explicit support for Hive-style partitioning.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

project

string

yes

GCP project ID.

datasets

list of objects

yes

List of { bucket, prefix?, folder_as_dataset? } entries.

filename_filter.include

list of regex

no

[".*"]

Object names to include.

filename_filter.exclude

list of regex

no

[]

Object names to drop after include matches.

parameters

object

no

Optional pyarrow.fs.GcsFileSystem knobs — see "GCS parameters" below.

The parameters block accepts the optional GCS-client knobs documented in the GCP collector README → GoogleCloudStorageanonymous, access_token, target_service_account, credential_token_expiration, default_bucket_location, scheme, endpoint_override, default_metadata, retry_time_limit. They map onto pyarrow.fs.GcsFileSystem and are typically left at their defaults.

Source: GCSPlugin in odd-collector-gcp/.../plugin.py.

Spotlight: BigTable (type: bigtable)

Pulls Cloud Bigtable instances, tables, and column families. The adapter samples the first N rows per table to infer the column-family-to-qualifier shape — Bigtable has no fixed schema, so a row sample is the only way to reflect real-world data layout into the catalog.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

project

string

yes

GCP project ID; one plugin = one project (inherited from GcpPlugin).

rows_limit

integer

no

10

Number of rows the adapter reads per table to derive the type combination across qualifiers. Higher values widen the sample (better schema coverage on heterogeneous tables) at the cost of more read units; the literal README phrasing is "get combination of all types in table used across the first N rows".

Source: BigTablePlugin in odd-collector-gcp/.../plugin.py; reference YAML in the GCP collector README → BigTable.

Spotlight: GCS Delta tables (type: gcs_delta)

Pulls Delta Lake tables stored in GCS — schemas, columns, and partitioning. Each entry under delta_tables: declares one Delta table by bucket + prefix; the adapter reads the Delta _delta_log/ to recover the table's evolved schema rather than inferring it from underlying Parquet.

Field
Type
Required
Default
Description

name

string

yes

Operator-chosen unique plugin name.

project

string

yes

GCP project ID (inherited from GcpPlugin).

delta_tables

list of objects

yes

One entry per Delta table; each is a DeltaTableConfig (bucket, prefix, optional filter). The list is required (not the same pattern as gcs.datasets, which the GCS plugin treats differently).

delta_tables[*].bucket

string

yes

GCS bucket name.

delta_tables[*].prefix

string

yes

Path prefix inside the bucket; should point at the Delta table root (the directory that contains _delta_log/).

delta_tables[*].filter.include

list of regex

no

[".*"]

Per-table regex include list applied during enumeration.

delta_tables[*].filter.exclude

list of regex

no

[]

Per-table regex exclude list.

parameters

object

no

Optional GCSAdapterParams block — same shape as on the gcs plugin (anonymous, access_token, target_service_account, credential_token_expiration, default_bucket_location, scheme, endpoint_override, default_metadata, retry_time_limit). Documented inline in the GCP collector README → GoogleCloudStorageDeltaTables; the Pydantic model defers to it.

Source: GCSDeltaPlugin in odd-collector-gcp/.../plugin.py; reference YAML in the GCP collector README → GoogleCloudStorageDeltaTables.

delta_tables[*].scheme exists in the Pydantic model with default gs (mapped from a schema alias for backward compat with older configs); leave at default unless you are pointing the adapter at a non-GCS Delta location.

Per-adapter feature matrix

Feature
Where it applies

Ingestion filters (datasets_filter)

bigquery_storage. Regex include / exclude for BigQuery dataset names; default includes everything.

Ingestion filters (filename_filter)

gcs. Regex include / exclude for GCS object names.

Ingestion filters (filter inside delta_tables)

gcs_delta. Per-Delta-table regex include / exclude.

Folder-as-dataset / Hive partitioning

gcs. folder_as_dataset with file_format (parquet / csv / tsv) and flavor (hive / presto).

GCS client parameter overrides

gcs, gcs_delta. Overrides for endpoint, bucket location, retry, etc.

Row-sampling for schema inference

bigtable. rows_limit controls the sample size used to derive column-family / qualifier types.

Source: PLUGIN_FACTORY in odd-collector-gcp/.../plugin.py.

Known limitations

  • No inline credentials: every adapter expects credentials via GOOGLE_APPLICATION_CREDENTIALS or the GCP runtime's workload identity. There is no plugin-level auth field.

  • One project per plugin: each bigquery_storage / bigtable / gcs / gcs_delta plugin scans exactly one project. Cross-project ingestion needs additional plugins (one per project).

  • No foreign-key / ERD extraction in any GCP adapter.

  • The README's BigQuery / GCS examples are the canonical reference for parameters (the Pydantic model defers to GCSAdapterParams which is documented in the README, not in plugin.py). The repo's anchor #googlecloudstoragedeltatables has a typo in the markdown (## instead of #) — link from the README directly if anchors don't resolve.

Where to next

Last updated