> Validating the correctness of a large scale data pipeline can be incredibly difficult as the successful operation of a pipeline doesn't conclusively determine whether the data is actually correct for the end user. People working seriously in this space understand that traditional practices here like unit testing only go so far.
I'm glad to see someone calling this out because the comment here are a sea of "data engineering needs more unit tests." Reliably getting data into a database is rarely where I've experienced issues. That's the easy part.
This is the biggest opportunity in this space, IMHO, since validation and data completeness/accuracy is where I spend the bulk of my work. Something that can analyze datasets and provide some sort of ongoing monitoring for confidence on the completeness and accuracy of the data would be great. These tools seem to exist mainly in the network security realm, but I'm sure they could be generalized to the DE space. When I can't leverage a second system for validation, I will generally run some rudimentary statistics to check to see if the volume and types of data I'm getting is similar to what is expected.
There is a huge round of "data observability" startups that address exactly this. As a category it was overfunded prior to the VC squeeze. Some of them are actually good.
They all have various strengths and weaknesses with respect to anomaly detection, schema change alerts, rules-based approaches, sampled diffs on PRs, incident management, tracking lineage for impact analysis, and providing usage/performance monitoring.
Datafold, Metaplane, Validio, Monte Carlo, Bigeye
Great Expectations has always been an open source standby as well and is being turned into a product.
Engineers demanding unit test for data is a perfect test to weed out the SWEs who are bit DEs. Ask about experience with data quality and data testing when you interview candidates and you'll distinguish the people who will solve a problem with a simple relational join in 1 hour (DEs) vs those who will try to unknowingly build a shitty implementation of a database engine to solve a problem in one month (SWEs trying to solve data problems with C++ or Java).
Unit testing is a means to an end: how do we verify that code is correct the first time, and how do we set ourselves up to evolve the code safely and prevent regressions in the future?
Strong typing can reduce the practical need for some unit testing. Systems written in loosely-typed languages like Python and JavaScript often see real-world robustness improvements from paranoid unit testing to validate sane behavior when fed wrongly-typed arguments. Those particular unit tests may not be needed in a more strongly typed language like Java or TypeScript or Rust.
Similarly, SQL is a better way to solve certain data problems, and may not need certain checks.
Nevertheless, my experience as a software developer has taught me that in addition to the code I write which implements the functionality, I need to write code which proves I got it right. This is in addition to QA spot checking that the code functions as expected in context (in dev, in prod, etc.) Doing both automated testing and QA gets the code to what I consider an acceptable level of robustness, despite the human tendency to write incorrect code the first time.
There are plenty of software developers who disagree about that and eschew unit testing in particular and error checking in general — and we tend to disagree about how best to achieve high development velocity. I expect there will always be such a bifurcation within the field of data engineering as well.
If you need to maintain some sort of deductive correctness - ie. my inputs are correct and my code is correct therefore my outputs are also correct - you're gonna expose yourself to only a tiny amount of real world problems.
Data engineering is typically closely aligned with business and it's processes are inherently fuzzy. Things are 'correct' as long as no people/quality checks are complaining. There is no deductibe reasoning. No true axioms. No 'correctness'. You can only measure non-quality by how many complaints you have received, but not the actual quality, since it's not a closed deductive system.
Correctness is also defined by somebody downstream from you. What one team considers correct, the other complains about. You don't want to start throwing out good data for one team just because somebody else complained. But many people do. Or typically people coming from SWE into DE tend to, before they learn.
I'm glad to see someone calling this out because the comment here are a sea of "data engineering needs more unit tests." Reliably getting data into a database is rarely where I've experienced issues. That's the easy part.
This is the biggest opportunity in this space, IMHO, since validation and data completeness/accuracy is where I spend the bulk of my work. Something that can analyze datasets and provide some sort of ongoing monitoring for confidence on the completeness and accuracy of the data would be great. These tools seem to exist mainly in the network security realm, but I'm sure they could be generalized to the DE space. When I can't leverage a second system for validation, I will generally run some rudimentary statistics to check to see if the volume and types of data I'm getting is similar to what is expected.