odd-spark-adapter
Spark Listener that captures Spark job lineage and pushes it to the ODD Platform.
Status: v0.0.1 — Spark 3.3.1 only. Other Spark versions are not currently supported. Spark Structured Streaming is on the roadmap but not yet implemented.
odd-spark-adapter is a push adapter for Apache Spark distributed as a JVM JAR. It runs as a Spark Listener attached to the driver, captures lineage from each job's read / write operations, and pushes the resulting metadata to the ODD Platform.
For the broader pull-vs-push picture, start at the Integrations hub.
Requirements
Spark 3.3.1. No other Spark version is supported by v0.0.1 today.
The driver must be able to reach the ODD Platform over HTTP.
Supported lineage sources
The v0.0.1 release captures lineage from:
RDD low-level jobs.
JDBC data sources (read and write).
Kafka topics in batch mode.
Snowflake tables.
S3 Delta tables.
Spark Structured Streaming is not currently supported.
Source: odd-spark-adapter README.
Installation
Download the latest JAR from the Releases page on the repo. Attach it to your spark-submit (or your Spark cluster's classpath via your platform's standard mechanism — --jars, EMR bootstrap, Databricks libraries, etc.).
Configuration
The adapter reads two configuration parameters from Spark's configuration:
spark.odd.host.url
URL of your ODD Platform deployment (e.g. http://odd-platform.internal:8080).
spark.odd.oddrn.key
A unique string identifier for this Spark cluster. The adapter uses it to construct ODDRNs for entities emitted from this cluster.
A representative spark-submit:
Refer to the README for the up-to-date listener class name in case of changes.
What gets sent
Spark applications — application name, run ID, start / end times.
Lineage edges — from each supported source (JDBC, Kafka, Snowflake, S3 Delta) to each supported sink, derived from the Spark plan as the listener observes it.
The platform stitches Spark-emitted lineage to the rest of the catalog via ODDRNs — JDBC reads of a table already catalogued by odd-collector (PostgreSQL, MySQL, …) connect to the existing dataset entity automatically.
Known limitations
Spark 3.3.1 only. Other 3.x versions and any 2.x version are not supported by v0.0.1. The roadmap mentions broader version coverage but has not shipped.
No Structured Streaming. Streaming jobs are not captured.
No static collector token. The adapter identifies itself via
spark.odd.oddrn.key; the platform must be configured to accept ingestion from that ODDRN under your authentication posture (see Enable security → Ingestion authentication).JAR is the only delivery format. No PyPI package; no container image. Operators bundle the JAR with their Spark deployments.
Where to next
Lineage feature in the catalog — what the platform does with the lineage edges this adapter emits.
Repo — sources, releases, and issues.
Last updated