Visibility for Data Quality Engineer
Key words: data quality metrics, Great Expectations, dbt tests, DataProfiler, custom DQ frameworks.
Challenge
As a Quality Assurance Engineer, I cannot cover all data quality monitoring activities. I know that some book orders can be mapped to wrong dimensions or even miss crucial fields associated with an order. I want to automate the DQ monitoring process and have a place where my team and our users can monitor pipeline health on a given day.
Solution
The ODD Platform ingests test results from Great Expectations and dbt tests (both push-clients), plus statistical profiles from odd-collector-profiler (powered by Capital One's DataProfiler). Teams with a custom DQ framework can push results through the POST /ingestion/entities/datasets/stats endpoint of the ODD Specification. See the Test Results Import page under Data Quality for the platform-side view.
Scenario
My team’s pipeline is processing more than two billion book orders daily and uses two OLTP systems and ten dimensional tables as its sources.
I want to check the following DQ KPIs based on six DQ dimensions: \
Timeliness: how much time does it take for an order to become available in my product? \
Completeness: do I have any missing values in the most crucial fields, e.g. date, book ID, amount, etc.? \
Uniqueness: do I have any duplicated book orders in my dataset? \
Validity: do the values comply with expected value format, e.g. book ISBN has an expected number of digits? \
Consistency: when I do a lookup on dimensional table to return a book name, do I get all book IDs covered? \
Accuracy: does my sales data reconcile with other sources?
I cover the Timeliness, Completeness, Uniqueness and Validity dimensions with Great Expectations test suites and statistical profiles produced by
odd-collector-profiler, both of which land in ODD alongside every other dataset's metadata.For the Consistency and Accuracy dimensions I need to compare several profiles across datasets, which the out-of-the-box frameworks don't cover — I write a small SQL script, run it on a schedule, and push its results through the
POST /ingestion/entities/datasets/statsendpoint so the custom KPIs show up next to the framework-produced ones.I import test suite results from Great Expectations to ODD.
As ODD allows a DQ import not only from pre-defined libraries but also from custom frameworks, I add my custom test suite results to the Platform as well.
I can expose all my DQ KPIs to the ODD Platform and share it with my stakeholders: both my team and my users.
Result: I provide a transparent and accessible way of pipeline health monitoring and also use this feature when assessing reliability of other sources of my interest.
Last updated