Testing & Monitoring
Before we talk about building a robust testing strategy, it is worth discussing some common approaches to testing and monitoring that are widely adopted across the industry but not optimal.
Anti-patterns
Redundant testing
Data platforms are composed of data assets — tables, metrics, dashboards and more — created by transformation tools like dbt models, Spark jobs or ingest pipelines.
A simple mental model I frequently use is to see data platforms as a network of data assets, interconnected by transformations that build them on top of each other. Data flows through the network from source systems through layers of transformations into its final use case, whether a dashboard, data export, or machine learning model.
To ensure these complex chains of transformations work, data practitioners are increasingly applying various testing techniques such as data tests, anomaly monitors, data contracts or unit tests. Especially built-in tests in transformations frameworks such as unique
, not_null
, accepted_values
, or relationship
tests are the most popular.
To encourage adoption of testing many teams set test coverage targets and generally encourage teams to write tests for every model. But while the intention might be good, this ‘test every model’ approach to testing might not yield the best results.
The model-centric approach does not consider the wider context, which almost always leads to redundancies. Tests are applied in a sequence, following the DAG of transformations, often without any chance for this data to go wrong as it moves between the models. This doesn’t just lead to unnecessary computing costs. More severely, it creates a false sense of safety.
Example: Redundant basic dbt tests
We’ve built a model called stg_latest_issues
that pulls data from a stg_issues
model. It has relatively simple code:
select
id,
workspace,
...
from
{{ ref('stg_issues') }}
order by
ingested_at
limit
1 by id
The upstream table contains one record for every modification of a given issue, and the model pulls the latest record for each issue by id, which is the system’s unique identifier for the problem.
One possible issue with such data is that an id
will be empty, suggesting that we are working with a corrupted issue record. It is also why the upstream table stg_issues
verifies that such a situation can’t happen with the following test:
select * from {{ ref('stg_issues') }} where id = ''
If this test returns any records, it will fail.
But what happens when we pull data to the downstream model stg_latest_issues
? Do we test for empty id
again? How could this id
possibly become empty? The short answer is that it can’t.
The above example is a very simplified case of a problem that can lead to significant redundancies in testing.
Mechanically re-applied tests without awareness of the broader data pipeline do not add more safety to our data.
Disconnected tools
As data systems evolve, so does the data quality tooling. Today, it’s common to see teams using combination of tools for data quality: dbt for testing, a separate platform to execute anomaly monitoring, and another system to monitor overall ETL pipelines that coordinate their execution.
Teams involved in analytics across Data Engineering, Analytics Engineering, Data Science, Data Analytics also have different needs. While data engineers are predominantly focused on executing pipelines and want to ensure that all necessary jobs work correctly, analytics engineers or data analysts are more concerned about the content of the data itself, deploying data tests and anomaly monitoring. On top of this, governance teams have their expectations to understand the strategic evolution of data quality, for which they frequently profile the data and measure its quality on several governance dimensions.
It leads to fragmentation.
Example: Freshness monitoring on tables goes wild
A practical example is a setup with a dbt project with hundreds of tables complemented by a sub-optimally configured anomaly monitoring tool. Dbt executes in hourly jobs, so the entire DAG refreshes once per hour under normal system operations.
On the side, the team has deployed an anomaly monitoring system, and one of its rules is monitoring for freshness, e.g., detecting if the data gets abnormally delayed. These types of monitors are now table stakes in many monitoring tools, and therefore, they will quickly learn a correct pattern of data refreshing every hour.
One day, this pipeline breaks at 2 am. One dbt model didn’t execute correctly due to a timeout issue with the underlying data platform. It causes the entire DAG to fail, with one failed model and hundreds of others skipped.
Thanks to a well-organized dbt alert, the right team gets notified about the failure with the proper context of what failed and the fact that the entire downstream pipeline is affected.
But what about the freshness monitors? Every monitor for every skipped model reports an issue just an hour later. The team is hit with unactionable alerts, which add little value to their issue resolution. dbt job failure has already provided all the context of what models have been skipped.
A more cohesive and strategic approach to testing could have avoided the unpleasant experience of hundreds of unnecessary alerts.
For example:
If we correctly monitor our dbt jobs, why do we want to deploy any freshness anomaly monitors on any tables created by models in these dbt pipelines?
There should be no scenario in which we wouldn’t know about failure—deterministically—by monitoring the jobs. In above scenario an anomal monitor that behaves probabilistically introduces a potential risk of false alerts and at best, it will learn to predict delays in our scheduler, which we already know and control.
‘Test/monitor all’
Data ecosystems are becoming increasingly complicated, and testing approaches recommending testing or anomaly monitoring every table need more context of critical use cases. One example is hundreds of freshness tests deployed where they add no value.
One standard industry advice is monitoring a wide range of tables (or all) for table-level anomaly patterns. One possible reason for this advice is that such monitors have become relatively easy to execute at scale. But such a setup is often noisy. Data anomalies tend to cascade across data stacks. So, an anomaly detected in one table could trigger dozens of other tables to report abnormal data patterns without much additional value.
The same applies to testing. If we attempt to test everything, we will likely receive a proportionally higher number of alerts, but we might need more incentive or urgency to fix them.
Therefore, the desire to have a wholly tested data stack, whether with anomaly monitoring or data testing, is academic. It sounds like a good thing to do in theory. Still, today’s data systems practically contain a mixture of use cases, and data flows with different risks and expectations of reliability.
Focus on models
Today’s prevalent approach to testing in the industry is focused on models, which is why we see advice such as ‘every model should be tested.’ We also see teams setting targets, such as test coverage, to incentivize teams to make measurable progress towards such goals. While such advice takes some inspiration from software engineering, it lacks nuance. Most of the software engineering code in systems we take inspiration from is in production. If such systems fail, they cause an immediate and direct impact on customers, so their reliability is paramount. But that is not the only type of code software engineers write.
In some cases, engineers write one-off scripts to execute a finite task. Such code will not become part of the production system and, in some ways, becomes obsolete once the task is done. However, for posterity, scripts are frequently checked in a source code repository and live alongside the production codebase. Still, engineers typically apply much more rigorous testing on code that will have to run in production than ad-hoc code.
This distinction is even more critical in analytics data platforms.
Today’s analytics systems don’t clearly distinguish between production and scripting. In software engineering, the software is typically composed of reusable modules and libraries that can be freely reused in scripts, often executed on an engineer’s local computer. Data teams have to deal with many more constraints. Their systems run on cloud data platforms and must process vast amounts of data to perform experiments. They often have no option but to include their work in the rest of the data in their production data platforms.
This forces data teams to experiment in a way that is much more integrated with their other, often critical data. In other words, the experimental code is mixed in a DAG of models created on each other.
Case Study: Life-cycle of a reverse ETL
You are an analytics engineer at a mid-sized organization with 500 dbt models. Since your organization has invested in data for a while, many of these models feed into various dashboards and other data applications. Some of these data outputs are very important to your company. One of them is a reverse ETL that feeds data back into Salesforce, which helps your commercial team colleagues prioritize which accounts they reach out to tomorrow. Another essential use case is in marketing. Your demand gen team uses customer CLTV data to feed look-alike models in advertising platforms. These use cases are critical.
You’ve been asked to develop a new version of customer scoring. The analysts created a solid hypothesis on how new scoring can work and validated with stakeholders that new metrics are indeed meaningful and would be valuable to the commercial teams, but it needs some work; they only wrangled data in dozens of spreadsheets, but it’s fragile and can’t be used as is. That is why you come in to make it more robust. You will need to bring several new data sources into the data stack and write a series of models that process them. This will take at least a week and collaboration with data engineers.
In such a scenario, it’s natural and expected that all this code starts incrementally landing into the dbt project, layer by layer. From this point, it may take weeks before this new pipeline matures enough to replace the current production system.
One day, you get a ping from your colleague on a weekly rota to triage incoming data errors, and some of your models are timing out. The other day, one of the data sources didn’t refresh. The first problem was a mistake when excessive data arrived at the warehouse, which caused problems. In other words, the pipeline skipped one crucial task, and retry didn’t work. You knew this could happen as data eng. The team is also working on the new pipelines. It’s not a big deal; it’s a work in progress. All this will be solved before the system gets to production, and the team is well aware.
And this is just your project. A dozen others are being developed in parallel.
It’s essential to understand this dynamic.
It’s not uncommon for data platforms to feed into dozens or hundreds of final use cases that need to be operated with vastly different reliability guarantees.
In some cases, the specific use case is a work in progress. In others, it’s an experimental dashboard one of the product analysts put together for their team. In another case, it’s a daily pulse one of the teams in operations sets as a reminder of something nice to know. In contrast, you might have use cases such as an automated marketing pipeline deciding how to allocate hundreds of thousands of dollars monthly. All these use cases are intertwined.
The result: Alert overload
Combine all above and we’ve built a solid recipe for alert overload. Many teams are drowning in large volume of hard-to-action, unclear, or unimportant alerts.
This nicely highlights the challenge: We need a better, more strategic, approach. The best approach to testing will work with several key, to a degree contradicting forces:
- Minimize the number of alerts received to prevent alert fatigue.
- We must minimize the number of tests and monitors we create and execute to reduce overall complexity, cost, and alert fatigue.
- Maximize the chance of detecting issues.
Testing software well is an art and skill that software engineers spend years mastering. Data teams are no different. However, the benefits are clear: well-done testing means more reliable systems and more robust data.
There are several vital considerations to make.
Designing a Testing Strategy
Testing pipelines, not models
The above example could help us inspire an alternative way of looking at our data platform. Instead of seeing it as a network of interconnected models with dozens of outputs, we can start from the end use cases and data products and expand them into pipelines.
A data product pipeline is a direct acyclic graph of its upstream dependencies, including all models, tables, and pipelines that move data from its source into the product.
Consequently, the pipeline can be seen as an end-to-end horizontal slice through the entire data platform, traversing all system layers. This approach brings several benefits, especially if we’ve done diligent work on the definition of data products and their severity:
- Severity can back-propagate. By definition, the pipeline of models feeding into the P1 critical data product has to be treated to the same degree. Otherwise, we will create misaligned incentives. The downstream team is held up to a P1 priority while someone else upstream is not; that doesn’t work. Severity back-propagates.
- More apparent impact assessment: Failures identified within the data product pipeline can be directly linked to the product. This can significantly improve communication across teams, as the pipeline will be expected to run through data assets owned by different teams.
- Feedback on architecture. Since the content of the pipeline carries the severity of the end use case, we could use pipelines to understand how much use cases are tangled across different priorities (P1 to Pn). More intersections between them mean we increase the chance of work on the P4 pipeline, possibly unintentionally impacting the P1 system.
Another way to look at the centric approach is that besides various technical and workflow benefits, identifying the pipelines, especially the ones feeding into critical data use cases, gives us a lot of focus. As a result, we can evolve our approach to testing, focusing our energy and effort on pipelines that feed highly critical products.
—
However, despite the above benefits, the data pipeline-centric approach is rare in the analytics community. At SYNQ, we believe this is mainly because we don’t have the proper tools to work that way. To establish this new approach, we need to solve the following:
- Anchor data product definition into observability and catalogs and use underlying metadata, particularly lineage, to identify up-to-date dependency chains leading to the data products.
- Create pipeline automation workflows. Once we identify the pipeline, we need better systems to govern it. We should be able to deploy monitoring and focus our testing on specific pipeline(s), as well as create workflows that help us improve the architecture of our system, making it easier to maintain large numbers of such pipelines in a single data ecosystem.
- Establish data product pipeline oversight. Instead of looking at entire systems with metrics such as model test coverage, we will need to reason about pipelines by understanding failure rates, SLAs, issues, and incidents by pipelines.
Case Study: Establishing P1 Data Product pipeline for SYNQ user-facing analytics.
At SYNQ, we operate a data platform with hundreds of tables and models and face similar challenges as most data teams: our data platform, ClickHouse, feeds a broad mixture of data use cases, from P1 user-facing systems to exploratory analysis of feature adoption in our product (P4).
Our latest pipeline was for user-facing analytics that we expose to customers to help them understand quality metrics such as the percentage of failed tests, time to resolution, and the number of issues they can analyze by data products or teams.
As a consequence of being customer-facing, this pipeline is P1, e.g., its highest importance, and we treat it as any other production system that is powering our user-facing product.
The end product is a React.js application that interactively exposes data to our customers. Under the hood, it queries multiple ClickHouse Parameterised Views that act as an interface to the FE system. This view leverages data from 4 different source systems (SYNQ micro-services) to calculate the final metrics.
To create clarity, we have defined final views as a data product that anchored this use case into our observability approach.
With the following definition, we’ve used lineage to identify the pipeline.
Understanding our pipeline would allow us to start reasoning about our testing in a particular way—ensuring that data that arrives into final views is robust and reliable, with little room for error.
Testing Layers
With an approach to testing anchored around data pipelines, we’ve created a lot more clarity and focus on specific models that we want to test to an elevated degree since they feed the P1 product. This is beneficial for the particular pipeline, but without additional concepts, it could be dangerous in the long term.
As we discussed earlier, it’s widespread and reasonable to expect data product pipelines to be interconnected. However, developing a unique and specific strategy for every pipeline could also be complex and infeasible.
This is why we think about testing in layers.
Data platform layers are a standard mechanism for tackling their complexity. We often hear about raw, staging, mart layers, or medallion architectures with bronze/silver/gold layers.
For data architecture and modeling, the data platform layers help us decide what kind of transformations to apply where. Testing is the same; we want a testing strategy outlining what type of testing to use at every data platform layer.
We do so with several objectives:
- Alignment around what to expect. Layers literally’ slice’ the big testing problem into smaller parts, making it easier to reason about what kind of testing we can expect on data assets.
- We are setting a common standard. We classify the decisions by strategy, removing some from everyday work. By deciding what to test at each layer, we align the team.
- We are removing redundancies. Retesting for the same context, like retesting that customer_id is not_null in every single model across layers of the transformation, doesn’t bring more reliability, generates more cost, and possibly contributes to alert fatigue. We want to eliminate redundancies.
The golden rule to strive for (the same as we do when testing in software engineering) is as follows:
For any potential failure mode of our data, we want the minimum, ideally one, test or monitor to fail.
Let’s say we have five customer records with customer_id NULL. This is unexpected. But if more than one test fails, it’s a code smell. It contributes to clutter in alerting, adds more code to maintain, and reduces clarity. One test should have failed.
Defining the testing layers
Testing layers should be tightly coupled with architecture. In another way, design testing into layers that fit your architecture design.
At SYNQ, on the highest level, we slice our data platform into the following layers:
- Pipelines start with the data source (operational system) and end when data enters the analytical system. This is where you transform the data, typically in the first layer of tables in the data warehouse. Pipelines often contain ETL pipelines, raw data lakes, or streaming systems.
- Sources are the first layer of data that enters the data warehouse. The source is typically the first layer of data where we can apply transformations in SQL and implement tests from dbt or similar analytics engineering tooling.
- Transformations are layers of data transformations where we clean up the data, model key business concepts, and create data assets ready for analytics use cases, typically our core data models.
- Marts is the last layer of data, where data leaves the platform into a given use case(s).
The above model is generic and should apply well to most of today’s typical data platforms.
The Strategic Approach
We’ve finally set enough foundations to move on to the meaty part of the testing and monitoring strategy design: Deciding what types of tests to apply and where.
We will work with the following fundamental principles:
- Tests and monitors are designed to complement each other. We decide what failure modes we test within each layer of our data platform and pipelines.
- We minimize redundancy. We avoid putting similar or identical tests into a sequence of models in the DAG to ensure that only one test fails on failure.
- We combine data testing and anomaly monitors into a cohesive testing strategy that avoids duplication.
Testing Sources
Let’s start with the definition.
What do we mean by a source?
The source layer of a data platform is a collection of data assets ingested from third-party systems like CRM, ERP, company products, or other non-data teams like engineers and business operations within or outside the company.
In other words, it’s the layer of data that is under the control of the data team for the first time.
In many organizations, this is directly linked to a concept of dbt sources, as we frequently load data into data platforms from ingest pipelines or cold storage systems, which are primarily not used for analysis to start the modeling activities inside of dbt.
While this definition is practical and applicable to most data teams, some teams have additional pre-processing steps before dbt, and therefore, their data sources are more upstream.
A practical guideline to decide what our sources are would be as follows:
Find a layer of data as close as possible to the source, where you can control the data and add monitoring and tests without dependency on other teams.
dbt sources or the first layers of models in dbt perfectly fit this definition. But so would an upstream platform with well-established testing tooling such as great_expectations.
Why invest in testing sources?
The above definition of data sources as the upstream layer in the data team’s control has a reason. A source layer of data that fits this definition essentially becomes an interface.
It is an interface with many exciting properties, which are very helpful for testing:
- It separates systems built predominantly by the data team from other systems. Everything downstream of sources is data models and other analytical assets. Most of the business happens upstream.
- Detecting issues at source means detecting issues from upstream. Since we test data that arrives at the data platform as upstream as possible, we are not testing any login within the models. We are testing assumptions about the input data to the analytics system, which means we are testing the consistency of data that arrives from the source.
- It aligns well with ownership—downstream from the source, data analytics teams own it, while upstream is owned by data engineering and operations and engineering teams across the business. Well-tested sources could be a great mechanism to engage the upstream team in data quality.
Sources are input we must work with throughout the rest of the pipeline. They are the most upstream assets within data analytics modeling, and so failure to detect errors at this layer often has broad and severe consequences.
Testing sources is a high-leverage activity. We are verifying the quality of data that feeds into every other model in our system, which has the most significant number of downstream dependencies and attributable usage.
This definition is also a reason for the following advice, which we apply internally at SYNQ but also advice to our customers:
When in doubt about where to start with testing, start with sources.
We will combine this advice with our shifted focus on testing pipelines to create a practical prioritization framework for quality improvements.
What does a well-tested source look like?
Previously, we’ve demystified a common technique: put unique and not_null tests on the primary key and consider the source (or model) well-tested. Such a strategy promotes a setup where most data flows through the system untested, creating a false sense of security.
This is particularly important at the source.
One critical decision we will make is what kind of tests we want to apply at the source. To some degree, this is generic, but you should always consider your context critically, especially from the risk perspective: What is the risk of not detecting an issue at the source?
We follow with a list of typical data issues we want to prevent at the source, outline what types of tests are most applicable, and provide a rationale for why and how we should set them up.
Testing structure
data comes from the business in many shapes and forms, and given that the business tends to evolve, and so does the data structure, it is essential to verify that the critical expectations we rely on are met.
When reasoning about how deep to go, especially when the criticality of the use cases for the data being brought to the data platform is not fully known, it’s worth being defensive and testing sources well so they can support use cases with the highest criticality.
Another way to assess the depth of testing sources is to look at the data source from the business perspective:
How central is this data to our business?
I’ll give a few examples that bring this advice to life:
- In fintech, data sources that represent transactions
- In e-commerce or logistics, data sources that represent orders or shipments
- In SaaS or infrastructure, the data that represents usage and engagement
- In Marketplaces, the data that represents demand and supply
These sources will likely be central to the respective data platforms. Of course, this is a small example to illustrate the principle. Think about how key every source is to your business domain, which will likely correlate with the criticality of use cases built on top of it.
With that in mind, let’s dive into the testing itself. For well-tested sources, we, in general, want to have the following in place:
- Data is present and complete. We are mainly focused on detecting not_null and empty values. This can be well done with dbt and similar frameworks, as these tests are readily available and widely used.
- Ensuring uniqueness—duplicate data is one of the most common ways data can go wrong. Applying unique tests from dbt is a way to prevent issues. Besides the built-in unique dbt test, there are also several extensions, such as unique_combination_of_columns or even conditional unique tests, that could test for more nuanced business scenarios around unique data.
- Testing values—for low cardinality fields, it’s worth verifying the expected (or blocked) values to catch deviations early. We tend to recommend an approach with explicit enumeration (list of accepted_values which comes with dbt Core) because the data team gets explicitly notified about new values, which often have to be handled in downstream models as otherwise they might get rolled up into ‘other’ logic branches or even completely unhandled, causing issues.
- Testing format of values—For fields with higher cardinality where it’s infeasible to enumerate the values explicitly, it’s worth thinking about testing the format of the values in another form. For numeric values, this could be done via min/max logic; for test values, this could be a regular expression for the expected format (like birthdate or email). These tests are beneficial in detecting corrupted records at scale early.
- Monitoring ingests of data—one of the most common sources of issues in analytics systems is the broken flow of data, typically as an unintentional side-effect of change upstream. Multiple layers of ensuring that consistent data flow happens are worth breaking down further:
- Freshness testing is an excellent tool for scenarios where we want to ensure complex SLAs regarding the timeliness of the data. For data sources that we now have to be refreshed at least every period, a deterministic freshness test (like the dbt source freshness test) is the right tool.
- Freshness monitoring—in comparison, many data sources are not as specific in their update frequencies. Further, setting up hard thresholds on hundreds or thousands of sources is impractical. Freshness anomaly monitoring is a viable alternative, as the threshold can be learned from historical data patterns. This approach is superior in case freshness has seasonality or pattern that is hard to express in simple freshness tests, like scenarios when the data pipeline gets quiet every night or every weekend. Freshness tests would either alert too frequently (and cause alert fatigue) or have to accept considerable delays to account for the most extended gaps of inactivity. These criteria could be used to decide if a test or monitor is a better solution.
- Volume monitoring—ensuring a solid volume of data is even more firmly in the anomaly monitoring domain. Maintaining expected thresholds of increments of data is very impractical and tedious. Especially given typical evolution of business, which grows and changes, data volumes tend to fluctuate much more than the frequency of loads (freshness). Therefore, anomaly monitoring that can further model seasonalities is a more feasible option.
- Monitoring deeper—the final consideration is whether tracking data on the table level is sufficient or if deeper monitoring is required. Consider a scenario where the data source in the platform is a fan-in pattern of many underlying data sources—like hundreds of events tracked from a website. A single event missing would most likely be unnoticed by all other solutions — tests, volume, or freshness monitors set up at table level will not be able to pick up nuanced changes where perhaps less than 0,1% of data stops flowing. Yet this could still be 100% of typical data for a specific website traffic event that broke due to a faulty website update. In critical data sources, deeper monitoring that works on segments of data (volume monitoring per event) is a great tool.
Does this sound like a lot? It’s indeed more than ‘one test per model,’ but this is what it takes to test sources well. I’ve previously mentioned that mechanical testing is problematic as we are not intentionally designing test suites. But I’d like to nuance it from this point.
By defining a testing strategy for sources like the one above, we have applied strategic nuance—we’ve intentionally decided how we will test sources, deciding what tests we use (and what tests don’t). But with a chosen strategy, we can roll out the above setup at scale, to a degree with automation, making the daunting task of writing all these tests much more approachable.
Case Study: Testing sources for P1 critical data product at SYNQ
Foo bar baz!
The role of data contracts
At this point, it’s worth reflecting on the relationship of source testing with another concept: data contracts.
A data contract is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and its consumers. It is like an API but for data. —https://datacontract.com/
This definition is essential as it clarifies that well-tested sources and data contract definitions are complementary techniques. Data sources are very reliant on source testing. This is because data contract specification and tooling focus on an explicit definition of data contracts, including the definition of quality—which might include a definition of data testing—but data contracts themselves are a description. They don’t enforce the quality itself.
With that in mind, teams that already use data contracts should integrate testing data sources and implementing data contracts into a single workstream. A data contract can act as a prescription for what should be tested while tooling such as dbt or SYNQ can act as an execution environment where these tests get provisioned, executed, and tracked.
Testing Transformations
It’s precisely because we are still applying a pipeline-centric approach to testing that the volume of tests we will be adding going forward will start to decrease. In other words, we’ve done a lot of testing at source, which we don’t have to repeat.
With a well-established testing strategy for sources, transformations can have much lighter testing suites, where we focus only on what has changed.
This fundamentally differs from model-centric testing, where we look at each model in isolation. Instead, going forward, we consider the tests we’ve already established upstream (at source) and fill the gaps.
Several key types of transformations are essential to test because they tend to be prone to errors:
- Data cleaning and normalization—is often simple logic that is easy to test
- New combined/derived columns—when new columns are created due to SQL logic that transforms or combines multiple other columns, it’s a new business logic. This could be anything from simple transformations, such as recoding numerical codes into more user-friendly values, to complex CASE / WHEN statements that could introduce a big chunk of business logic.
- Table joins—which are a persistent source of errors caused by duplicate joining variables.
- Data aggregation—where we create new groups of data typically by using GROUP BY clauses or analytical functions
- Data Structuring and Reshaping—where data structure in the table is fundamentally regrouped, typically using groups by clauses or, in some cases, more advanced concepts like pivots.
Given each model in our ecosystem could be a different combination of the above, it could be helpful to define testing techniques for each type of concept, which can then be combined into a final testing suite for the given model.
Data Cleaning and Normalization
Cleaning and normalization are often very lightweight data transformations; therefore, testing them could be equally lightweight.
Cleaning and normalization code is also very frequently a source of redundancy. It’s because lightweight transformations are very unlikely (or in some cases impossibly) a source of data errors.
Think about simple changes such as normalizing column names:
select
id as issue_id,
workspace as tenant,
toDateTime(created_at) as started_date
...
from
{{ source('incidents', 'issues') }}
order by
We could retest issue_id
and tenant
or started_date
, but such tests could be redundant to the testing we have already done at the source.
Such examples can be tests we’ve done to verify that the upstream id is not null or unique, that the workspace is always there and has correct reference to the workspaces table, and that created_at exists; all these tests significantly contribute to the reliability of this downstream model, too. For example, the upstream id uniqueness test guarantees that issue_id will be unique, too.
Eliminating such redundant tests has several key benefits:
- It is easier to reason about—we have a more unified place for each type of test
- It’s less likely to generate redundant alerting—especially if you don’t run a dbt build or these tests have WARN-level severity. Otherwise, you would get more failed test results than you need.
- It’s less compute—each test is an additional query to your data, and therefore, eliminating redundant tests also eliminates redundant traffic.
Case Study: Removing redundant uniqueness tests to save 10% of warehouse cost
At the start of 2024, we engaged with a customer looking for ways to save costs on their data platform. They have done an excellent job with models, switching many to incremental materialization or optimizing their underlying SQL structure, but they forgot about testing.
After a quick analysis, we identified a sequence of 3 tables in their DAG, each testing for the uniqueness of the primary key (id) on a large table. The test was verifying the uniqueness of more than a billion records hourly and to make matters worse. These tests were not just slow on their own but also redundant; just one test would create enough safety. By removing redundant downstream tests and changing their frequency, we reduced data warehouse computing cost by 10% in a single commit, saving tens of thousands of dollars per year.
Newly derived columns
Newly derived columns could be looked at through a simple conceptual lens.
A model that creates derived columns is their source; therefore, we can apply a playbook from testing data at the source to test derived columns.
In other words, derived columns are worth testing deeply, proportional to the complexity of the logic applied to create them. We use various typical data testing methods, such as uniqueness, non-null, accepted values, etc. We should always remember the logic and consider where it can break.
For example, consider the following logic:
CASE
WHEN … THEN ‘active’
WHEN … THEN ‘pending’
ELSE 'inactive'
END AS status
Testing such a column for not_null
values in many scenarios doesn’t make sense as we implicitly ensure some by the ELSE
clause.
Derived columns often contain much more complex logic, which could combine data from many columns that might come on top of join across multiple tables. In such scenarios, bringing a technique we haven’t discussed might be beneficial: unit testing.
Unit testing differs from data testing as it creates a complete testing cycle, from seeding the correct data to executing the logic and asserting the output.
This makes it a great tool of choice when testing multiple scenarios where we seed/test/assert different setups. Doing such testing on transformations that power-derived columns could be an efficient and not-so-complex way to introduce unit tests to your quality strategy.
But keep one crucial thing in mind.
The maintenance cost of unit tests is higher than the maintenance cost of data testing.
When we decide to change the SQL logic, we must update unit test expectations for all scenarios. It’s a natural trade-off: the more diligent reliability guarantees we get, the more work we must do to maintain the tests.
It’s essential to balance robust test coverage and ease of maintenance. Every unit test should have its purpose and be typically introduced in scenarios when more straightforward data tests are insufficient.
Joins
Joins are notoriously error-prone, typically due to unexpected duplicates in any of the joining variables.
The scenario is simple: we join two tables on a variable that is — unexpectedly — not unique. As a result, we duplicate rows. This problem is called ‘fanout’.
It’s why testing models with joins is particularly important, especially with the right types of tests that specifically aim to detect the ‘fanout’ problem.
At SYNQ we frequently use one of the following strategies.
Row count validation
In cases where we’re joining data without aggregating or filters, we can apply the row count check.
Consider the following scenario
select
issues.id as issue_id,
incidents.incident_id as incident_id,
...
from
issues
left join incidents on issues.incident_id = incidents.id
The issues
table is a base table of the join. In our domain, one issue
can be part of no or one incident
. This means that under no circumstance we should join multiple items. But it would be naive to assume that this important business assumption is always met. This is why we need a test.
Given we’ve constructed our join mainly to enrich issues
with additional data from incidents
without additional logic we can write a simple check. Number of rows in our new table should match number of records in issues
table. In other words we verify that no additional rows were created.
One of the benefits of the row count test is that we can execute it as comparison of select count(*) from orders
and select count(*) from orders_with_incidents
. Executing such query is typically cheap, as data platform can use underlying metadata potentially without scanning any data from the underlying storage.
Verify cardinality of join keys
Alternative approach is to verify cardinality of columns involved in the join. As we discussed, the assumption is that an issue
can only be part of one incident
. This could be direclty verified by one of the tests in dbt_utils
:
# models/issues.yml
version: 2
models:
- name: issues
tests:
- dbt_utils.cardinality_equality:
compare_model: ref('incidents')
compare_column: id
column: incident_id
This test will fail if the number of distinct values in incident_id
is different from the number of distinct values in id
in incidents
table.
The above test has a potential to be even more accurate way to prevent the fanout problem, but it has a caveat. Compared with row count test it has a lot more complex SQL logic. This might not be a problem for small models, but it might be a challenge if you’re dealing with large number of rows.