Visibility for Data Quality Engineer

Key words: data quality metrics, Great Expectations, dbt tests, DataProfiler, custom DQ frameworks.

Challenge

As a Quality Assurance Engineer, I cannot cover all data quality monitoring activities. I know that some book orders can be mapped to wrong dimensions or even miss crucial fields associated with an order. I want to automate the DQ monitoring process and have a place where my team and our users can monitor pipeline health on a given day.

Solution

The ODD Platform ingests test results from Great Expectations and dbt tests (both push-clients), plus statistical profiles from odd-collector-profiler (powered by Capital One's DataProfiler). Teams with a custom DQ framework can push results through the POST /ingestion/entities/datasets/stats endpoint of the ODD Specification. See the Test Results Import page under Data Quality for the platform-side view.

Scenario

  1. My team’s pipeline is processing more than two billion book orders daily and uses two OLTP systems and ten dimensional tables as its sources.

  2. I want to check the following DQ KPIs based on six DQ dimensions: \

  • Timeliness: how much time does it take for an order to become available in my product? \

  • Completeness: do I have any missing values in the most crucial fields, e.g. date, book ID, amount, etc.? \

  • Uniqueness: do I have any duplicated book orders in my dataset? \

  • Validity: do the values comply with expected value format, e.g. book ISBN has an expected number of digits? \

  • Consistency: when I do a lookup on dimensional table to return a book name, do I get all book IDs covered? \

  • Accuracy: does my sales data reconcile with other sources?

  1. I cover the Timeliness, Completeness, Uniqueness and Validity dimensions with Great Expectations test suites and statistical profiles produced by odd-collector-profiler, both of which land in ODD alongside every other dataset's metadata.

  2. For the Consistency and Accuracy dimensions I need to compare several profiles across datasets, which the out-of-the-box frameworks don't cover — I write a small SQL script, run it on a schedule, and push its results through the POST /ingestion/entities/datasets/stats endpoint so the custom KPIs show up next to the framework-produced ones.

  3. I import test suite results from Great Expectations to ODD.

  4. As ODD allows a DQ import not only from pre-defined libraries but also from custom frameworks, I add my custom test suite results to the Platform as well.

  5. I can expose all my DQ KPIs to the ODD Platform and share it with my stakeholders: both my team and my users.

Result: I provide a transparent and accessible way of pipeline health monitoring and also use this feature when assessing reliability of other sources of my interest.

Last updated