odd-spark-adapter

Spark Listener that captures Spark job lineage and pushes it to the ODD Platform.

odd-spark-adapter is a push adapter for Apache Spark distributed as a JVM JAR. It runs as a Spark Listener attached to the driver, captures lineage from each job's read / write operations, and pushes the resulting metadata to the ODD Platform.

For the broader pull-vs-push picture, start at the Integrations hub.

Requirements

  • Spark 3.3.1. No other Spark version is supported by v0.0.1 today.

  • The driver must be able to reach the ODD Platform over HTTP.

Supported lineage sources

The v0.0.1 release captures lineage from:

  • RDD low-level jobs.

  • JDBC data sources (read and write).

  • Kafka topics in batch mode.

  • Snowflake tables.

  • S3 Delta tables.

Spark Structured Streaming is not currently supported.

Source: odd-spark-adapter README.

Installation

Download the latest JAR from the Releases page on the repo. Attach it to your spark-submit (or your Spark cluster's classpath via your platform's standard mechanism — --jars, EMR bootstrap, Databricks libraries, etc.).

Configuration

The adapter reads two configuration parameters from Spark's configuration:

Spark config key
Value

spark.odd.host.url

URL of your ODD Platform deployment (e.g. http://odd-platform.internal:8080).

spark.odd.oddrn.key

A unique string identifier for this Spark cluster. The adapter uses it to construct ODDRNs for entities emitted from this cluster.

A representative spark-submit:

Refer to the README for the up-to-date listener class name in case of changes.

What gets sent

  • Spark applications — application name, run ID, start / end times.

  • Lineage edges — from each supported source (JDBC, Kafka, Snowflake, S3 Delta) to each supported sink, derived from the Spark plan as the listener observes it.

The platform stitches Spark-emitted lineage to the rest of the catalog via ODDRNs — JDBC reads of a table already catalogued by odd-collector (PostgreSQL, MySQL, …) connect to the existing dataset entity automatically.

Known limitations

  • Spark 3.3.1 only. Other 3.x versions and any 2.x version are not supported by v0.0.1. The roadmap mentions broader version coverage but has not shipped.

  • No Structured Streaming. Streaming jobs are not captured.

  • No static collector token. The adapter identifies itself via spark.odd.oddrn.key; the platform must be configured to accept ingestion from that ODDRN under your authentication posture (see Enable security → Ingestion authentication).

  • JAR is the only delivery format. No PyPI package; no container image. Operators bundle the JAR with their Spark deployments.

Where to next

Last updated