odd-collector-gcp
GCP-services pull collector — adapters for BigQuery, BigTable, Google Cloud Storage, and GCS Delta Lake tables.
Status: Stable. Released as a tagged Docker image alongside the rest of the odd-collectors monorepo.
odd-collector-gcp packages adapters for Google Cloud managed services. Like the other pull collectors, it ships as a daemon container that hosts one or more configured plugins; one container can host multiple plugins of any combination of types.
For the broader pull-vs-push picture, start at the Integrations hub. For deployment-side detail, see Build and run ODD Collectors.
Authentication
Authentication uses the Google Cloud Application Default Credentials chain. When running outside GCP, set GOOGLE_APPLICATION_CREDENTIALS to a JSON key file path on the container — the adapters do not accept inline credentials in the plugin config. When running on GCP (GKE, GCE, Cloud Run), the workload identity / service-account attached to the runtime is used automatically.
Supported adapters
The 4 adapters registered in odd_collector_gcp/domain/plugin.py (PLUGIN_FACTORY):
bigquery_storage
BigQuery (datasets, tables, views, columns)
✓
bigtable
Cloud Bigtable
gcs
Google Cloud Storage (object catalog)
✓
gcs_delta
Google Cloud Storage — Delta Lake tables
The reference YAML for each adapter lives in the GCP collector README. The Pydantic models that define accepted fields live at odd-collector-gcp/odd_collector_gcp/domain/plugin.py.
Installation
docker pull ghcr.io/opendatadiscovery/odd-collector-gcp:latestMount a collector_config.yaml at /app/collector_config.yaml and the GCP service-account JSON key at the path named by GOOGLE_APPLICATION_CREDENTIALS. A reference Compose snippet is in the gcp collector README.
Minimal config
Multiple plugins in one container
Spotlight: BigQuery (type: bigquery_storage)
type: bigquery_storage)Pulls BigQuery datasets, tables, views, and column metadata for one project per plugin.
name
string
yes
—
Operator-chosen unique plugin name.
project
string
yes
—
GCP project ID; one plugin = one project.
page_size
integer
no
100
Pagination size for BigQuery list calls.
datasets_filter.include
list of regex
no
[".*"]
Dataset names to include.
datasets_filter.exclude
list of regex
no
[]
Dataset names to drop after include matches.
Source: BigQueryStoragePlugin in odd-collector-gcp/.../plugin.py.
Spotlight: Google Cloud Storage (type: gcs)
type: gcs)Pulls GCS objects (or folders treated as datasets) and infers schema. Supports CSV / Parquet, with explicit support for Hive-style partitioning.
name
string
yes
—
Operator-chosen unique plugin name.
project
string
yes
—
GCP project ID.
datasets
list of objects
yes
—
List of { bucket, prefix?, folder_as_dataset? } entries.
filename_filter.include
list of regex
no
[".*"]
Object names to include.
filename_filter.exclude
list of regex
no
[]
Object names to drop after include matches.
parameters
object
no
—
Optional pyarrow.fs.GcsFileSystem knobs — see "GCS parameters" below.
The parameters block accepts the optional GCS-client knobs documented in the GCP collector README → GoogleCloudStorage — anonymous, access_token, target_service_account, credential_token_expiration, default_bucket_location, scheme, endpoint_override, default_metadata, retry_time_limit. They map onto pyarrow.fs.GcsFileSystem and are typically left at their defaults.
Source: GCSPlugin in odd-collector-gcp/.../plugin.py.
Spotlight: BigTable (type: bigtable)
type: bigtable)Pulls Cloud Bigtable instances, tables, and column families. The adapter samples the first N rows per table to infer the column-family-to-qualifier shape — Bigtable has no fixed schema, so a row sample is the only way to reflect real-world data layout into the catalog.
name
string
yes
—
Operator-chosen unique plugin name.
project
string
yes
—
GCP project ID; one plugin = one project (inherited from GcpPlugin).
rows_limit
integer
no
10
Number of rows the adapter reads per table to derive the type combination across qualifiers. Higher values widen the sample (better schema coverage on heterogeneous tables) at the cost of more read units; the literal README phrasing is "get combination of all types in table used across the first N rows".
Source: BigTablePlugin in odd-collector-gcp/.../plugin.py; reference YAML in the GCP collector README → BigTable.
Spotlight: GCS Delta tables (type: gcs_delta)
type: gcs_delta)Pulls Delta Lake tables stored in GCS — schemas, columns, and partitioning. Each entry under delta_tables: declares one Delta table by bucket + prefix; the adapter reads the Delta _delta_log/ to recover the table's evolved schema rather than inferring it from underlying Parquet.
name
string
yes
—
Operator-chosen unique plugin name.
project
string
yes
—
GCP project ID (inherited from GcpPlugin).
delta_tables
list of objects
yes
—
One entry per Delta table; each is a DeltaTableConfig (bucket, prefix, optional filter). The list is required (not the same pattern as gcs.datasets, which the GCS plugin treats differently).
delta_tables[*].bucket
string
yes
—
GCS bucket name.
delta_tables[*].prefix
string
yes
—
Path prefix inside the bucket; should point at the Delta table root (the directory that contains _delta_log/).
delta_tables[*].filter.include
list of regex
no
[".*"]
Per-table regex include list applied during enumeration.
delta_tables[*].filter.exclude
list of regex
no
[]
Per-table regex exclude list.
parameters
object
no
—
Optional GCSAdapterParams block — same shape as on the gcs plugin (anonymous, access_token, target_service_account, credential_token_expiration, default_bucket_location, scheme, endpoint_override, default_metadata, retry_time_limit). Documented inline in the GCP collector README → GoogleCloudStorageDeltaTables; the Pydantic model defers to it.
Source: GCSDeltaPlugin in odd-collector-gcp/.../plugin.py; reference YAML in the GCP collector README → GoogleCloudStorageDeltaTables.
delta_tables[*].scheme exists in the Pydantic model with default gs (mapped from a schema alias for backward compat with older configs); leave at default unless you are pointing the adapter at a non-GCS Delta location.
Per-adapter feature matrix
Ingestion filters (datasets_filter)
bigquery_storage. Regex include / exclude for BigQuery dataset names; default includes everything.
Ingestion filters (filename_filter)
gcs. Regex include / exclude for GCS object names.
Ingestion filters (filter inside delta_tables)
gcs_delta. Per-Delta-table regex include / exclude.
Folder-as-dataset / Hive partitioning
gcs. folder_as_dataset with file_format (parquet / csv / tsv) and flavor (hive / presto).
GCS client parameter overrides
gcs, gcs_delta. Overrides for endpoint, bucket location, retry, etc.
Row-sampling for schema inference
bigtable. rows_limit controls the sample size used to derive column-family / qualifier types.
Source: PLUGIN_FACTORY in odd-collector-gcp/.../plugin.py.
Known limitations
No inline credentials: every adapter expects credentials via
GOOGLE_APPLICATION_CREDENTIALSor the GCP runtime's workload identity. There is no plugin-level auth field.One project per plugin: each
bigquery_storage/bigtable/gcs/gcs_deltaplugin scans exactly one project. Cross-project ingestion needs additional plugins (one per project).No foreign-key / ERD extraction in any GCP adapter.
The README's BigQuery / GCS examples are the canonical reference for
parameters(the Pydantic model defers toGCSAdapterParamswhich is documented in the README, not inplugin.py). The repo's anchor#googlecloudstoragedeltatableshas a typo in the markdown (##instead of#) — link from the README directly if anchors don't resolve.
Where to next
odd-collector— generic collector for databases, BI, streams.odd-collector-aws/odd-collector-azure— sibling cloud collectors.Build and run ODD Collectors — common SDK schema and from-source build flow.
Last updated