We’ve led data and software teams at Google, GWI, Pleo, and Monzo and spoken with thousands of data teams over the past years. Over again we see the same struggles from data teams – they lack a framework to support the growing expectations of use cases from data in their companies.
Data has evolved from being a nice-to-have to becoming a critical component of core business processes. Data used for decision-making is important and if data is incorrect it may lead to wrong decisions and over time a loss of trust in data. But data-forward businesses have data that is truly business-critical. If this data is wrong or stale you have a hair-on-fire moment and there is an immediate business impact: Tens of thousands of customers may get the wrong email as the reverse ETL tool is reading from a stale data model. You’re reporting incorrect data to regulators and your C-suite can be held personally liable. Or your forecasting model is not running and hundreds of employees in customer support can’t get their next shift schedules before the holidays.
15% of dbt projects have more than 1,000 models, and 5% have more than 5,000 models. These aren’t just large numbers—they represent growing data ecosystems connected to hundreds of upstream data sources and feeding hundreds of downstream destinations. Data Mesh has gained traction as a solution for scaling data work but teams struggle translating the principles into actions.
This scale comes with a cost. Speed and agility take a hit, leading to frustration both within the data team and among stakeholders. Collaboration becomes increasingly challenging as no one has full visibility into the entire codebase. Meetings start consuming more time than actual data work. Ensuring quality becomes more difficult across an expanding surface area, resulting in a rise in user-reported errors. SLA performance also declines as job failures become more frequent, with retrospectives offering little to reverse the trend.
The root cause of the problem is that data teams are not well-equipped to take a strategic approach to building reliable data. This leads to tests being placed sporadically, with little consideration for the use case of the data, and unclear ownership, creating ‘broken windows’—tests that remain failing for too long.
As a result, data teams become overwhelmed by alerts, stakeholders are often the first to uncover issues, and confidence in the value of testing begins to erode.
We recommend you think systemically about the reliability of your data stack through a 5-step framework – from defining data use cases as products, setting ownership & severity, deploying strategic tests & monitors, and establishing quality metrics. We call it the Data Product Reliability Workflow.
In this guide, we’ll work you through actionable steps you can take to adopt the Data Reliability Workflow at your company, including specific examples, tips & tricks, and pitfalls.
Reliable data isn’t just an operational requirement, it’s a competitive advantage. By prioritizing trust and quality, your team will be trusted to own new critical processes and move faster.
Happy reading!
Petr & Mikkel
If you have questions about specific chapters or want to book a free consulting session with the authors of this guide, Petr Janda or Mikkel Dengsøe, find time here.
A data product is a group of data assets structured based on their use case. If you define your data products well, everything follows. Ownership becomes clear, it guides your testing and monitoring strategy, and you can manage and maintain SLAs in a way that’s consistent with data consumers’ expectations.
The building blocks of a data product – ownership, prioritization, code & logic, tests, documentation, and metrics.
A data product rarely works in isolation but most often relies on input, either other data products or data directly from operational systems. Therefore, expectations for the data product should be set depending on its entire lineage. In this chapter, we’ll look at everything you need to know to get started with data products – from identifying and defining them to determining their priority and setting SLAs.
The data team at Aiven, the open-source AI & data platform, struggled with untangling dependency across their data stack with more than 900 dbt models. Circular dependencies meant that it was difficult to make sense of the lineage which slowed down root cause analysis and made it difficult to make system-design decisions. It was particularly difficult for their core data products with hundreds of dependencies such as the ARR calculation, which is one of the most important metrics provided by the data team.
By encapsulating data products across data-producing and consuming domains, they could clearly understand the lineage, and instead of having to understand hundreds of upstream tables, they could visualize and understand the line through the lens of just a handful of data products.
With this at hand, they can trace back an issue to a faulty upstream data product and easily identify its owner for escalation, or see which data products are impacted downstream from an upstream failure–all without everyone needing to understand all the internals of each data product.
Another benefit is that system design issues stand out, so for example, if a circular dependency is introduced, it’s much easier to understand it across a few dozen data products instead of a spaghetti lineage of hundreds of data models.
If you can identify the most critical business processes your data team supports, those are most often the data products you should identify. If you’re unsure which ones they are, look for these signals such as what was impacted in your most recent critical data failure, or which data assets have the highest usage across your company.
Here are some examples of what can make up a data product:
At SYNQ, we think about our data products as either producer or consumer products. Producer data products are read from operational systems (e.g., API or SalesForce data) and owned by data or platform engineers. Encapsulating them in data products provides an easy-to-understand getaway for downstream teams to escalate issues upstream without grasping the full complexity of the internals of the data products. Consumer data products directly expose data to consumers and read from other data products or assets (e.g., CLTV or Marketing Attribution Model), often owned by data analysts or scientists. Encapsulating these in data products gives everyone a good understanding of the direct impact of issues and which stakeholders should be notified.
The number of data products is highly dependent on your company’s size and complexity. Some companies have dozens or hundreds of data products while others have just a handful. Start small, by identifying a handful of your most important data products, and then build out from there.
The data team at Aiven started with high-level products such as Sales and Marketing but realized they needed to go a step deeper to have the most impact. “If the Marketing data product has an issue, that may be fine. However, if the Attribution data product within Marketing has an issue, we must immediately jump on it. This is the level of detail our data products need to be able to capture.” - Stijn, Data Engineering Manager
Once you’ve identified your data products, the next step is to define them. When defining your data products, we recommend following these five steps:
As you scale your data products, you can group them into domains to make it easier to manage them. We recommend you keep data products as focused as possible – for example, it’s never a good idea having a data product consist of many hundreds of assets as it will create a web of interdependencies.
The data team at Aiven groups data products into business groups such as Sales, Finance, and Marketing as well as Core and API for more technical data products. “With this at hand, we can monitor the health across different groups to get a high-level overview, while also zooming into specific data products. This is also helpful for us when mapping priority–for example, our Attribution Marketing product is P1 while our Market Research product is P3.” - Stijn, Data Engineering Manager
The definitions of data products should follow existing workflows. For example, if you already have a process for defining asset ownership and priority, this will also fit data product definitions.
At SYNQ, we take a pragmatic approach. Producer data products are defined based on specific assets such as Postgres tables and dbt sources. Intermediate data products are defined in code through dbt metadata tags and groups. Consumer data products are defined through their folder structure in dbt and Looker which resembles their use cases.
version: 2
data_products:
- name: marketing_attribution
description: >
This data product includes models for tracking marketing attribution
across various channels. It powers the Marketing Attribution Dashboard
and is critical for assessing campaign performance and optimizing
marketing spend.
owner:
name: marketing_data_team
slack_channel: "#marketing-data-team"
priority: P1
assets:
- name: dim_marketing_campaigns
- name: fct_channel_performance
- name: fct_attribution_model
A data product consisting of key marketing assets with priority, owner, and Slack channel defined in a dbt yml file
While you’ll want visibility into the data product’s dependencies as far upstream as possible, we don’t recommend managing this manually. It’s not uncommon to have hundreds of dependencies upstream of a data product, and new dependencies can be added without the data product owner being aware of it. Instead, you should rely on automated lineage toolings such as those from a data catalog, dbt, or a data reliability platform.
As you identify your data products, you should carefully consider their priority. This serves as a guiding principle for key data product workflows: How soon should issues be addressed, do they require an on-call schedule and what’s a reasonable SLA.
Determining the priority of your data products is significantly easier if you’ve defined your data products at the right level of granularity. For example, if you’ve defined an entire product as Marketing, many people will have a different take on its priority. But if you’ve split Marketing into Market Research, Attribution Reporting, and CLTV calculations, some will naturally be more important than others.
At SYNQ, we use the following priority levels:
Defining severity based on the use case. P1 is critical. P4 is exploratory
The priority of your data products must be closely linked to expectations for how to deal with data incidents. We’ll look more into how to define the severity of data issues in the chapter Ownership & Alerting.
One pitfall when defining data product priority – especially in larger data teams with many stakeholders – is that they’ll have widely different opinions on what’s important. You should establish a set of boundaries for how these are defined. After all, not everything can be P1. We recommend working with senior stakeholders to agree on what the priority is and get their buy-in.
The Danish fintech Lunar established a data governance framework with exec buy-in to define critical data elements across the company. “Every three months, we meet with the chief risk officer, chief technology officer, and bank CEO to update them on the latest developments, risks, and opportunities. This helps everyone contribute to and have a stake in our data quality“ - Bjørn, Data Manager
An SLA (Service Level Agreement) is a contract that defines the expected service level between a provider and consumer, including remedies if expectations aren’t met. For example, a P1 SLA for a critical customer-facing dashboard might guarantee 99.9% uptime. If the dashboard is stale or has other data quality issues, the response time is 30 minutes with a two-hour resolution. In contrast, a less critical P3 dashboard may have 95% uptime and slower response times, like 48 hours for resolution.
We don’t recommend measuring the entire system with metrics such as model test coverage, as these can give a false sense of security and lead your team to optimize toward the wrong goals.
Measuring data product SLAs directly ties to the experience your data consumers have and is therefore a much better metric for the quality of your team’s output.
We recommend that you consider two metrics to get the most complete picture.
Only by looking at both metrics can you confidently say whether the data product SLA is meeting the expectations you’ve set. Both metrics should be monitored across the lineage of the data product to include the status of any upstream dependencies.
The data team at Lunar reports on data quality KPIs of their most important data products to the C-level every 3 months. “As a regulated company, we need to be able to demonstrate to regulators that we have sufficient data controls and an ability to trace these across all dependencies. This also gives us something demonstrable we can show to our regulators.” - Bjørn, Data Manager
Coverage–define the data control expectations (coverage) for your data products based on their priority. This helps align everyone around expectations and the level of monitoring that’s sufficient. For example, for P1 data products you may agree on expectations as:
not_null
and uniqueness
testsYou can read more in the chapter Testing & Monitoring for our recommendations on how to monitor your data products.
Quality score/SLA–the best way to measure the SLA of a data product is to look at all data controls through the lens of different KPIs, also known as Service Level Indicators (SLIs). Grouping your SLIs into meaningful areas helps isolate problematic areas and fits well into the type of tests you can perform in tools like dbt. (1) Accuracy—does the data reflect real-world facts (2) Completeness—is all required data present and available (3) Consistency—is the data uniform across systems or sources (4) Uniqueness—are there no duplicate records in the dataset (5) Timeliness—is the data updated and fresh (6) Validity—does the data conform to the required formats and business rules.
To measure the SLA of data products you can sum up the performance of the SLIs. SLA is then calculated as sum(errored SLIs) / sum(SLIs)
.
Some data products may have different sensitivity to different SLIs. For example, for an ML model training set, accurate and complete data may be more important than timely data. In these cases, you should define separate Service Level Objectives (SLOs) for each SLI.
A benefit of the approach above is that once you’ve set up adequate data controls, monitoring, and measurement are fully automated and objective. We recommend that you also monitor the number of declared incidents by their priority as these give you a more subjective view into when a data product was impacted by an issue that was escalated to an incident.
See the chapter Continuous Improvement for more information about how to select your metrics and obtain the relevant data.
Internally at SYNQ, we rely on automated grouping for all our SLIs. This means that new data controls are automatically grouped and counted towards the data product SLA and coverage as they’re added.
For a given month, to keep different SLA levels given 1,000 data checks on and upstream of the data product, this is the level of failures you can tolerate
For Shalion – the eCommerce data solution platform – data is their product. If they provide unreliable insights to end-users through their customer-facing dashboards, this directly impacts customer trust and retention. Therefore, the data team’s next step is to contractually commit to SLAs around the accuracy and correctness of their data products and thoroughly measure the time it takes to detect and notify impacted stakeholders of issues.
In the ideal world, you would be able to guarantee 99.9% SLA of all your P1 data products. In reality, this may be a steep goal to reach. In some cases, the SLA may be something you commit to contractually such as the uptime of a customer-facing dashboard. But in most cases, it will serve as a guideline, and help align data consumers and producers around a common goal.
You should align your incident management processes closely to the priority and SLA of your data products. A 24-hour response time on a P2 data product with a 90% SLA may be OK. But you may need to jump on issues on P1 data products much faster than that.
We’ll look more into this in the chapter Ownership & Alerting.
Before we talk about building a robust testing strategy, it is worth discussing some common approaches to testing and monitoring that are widely adopted across the industry but not optimal.
Data platforms are composed of data assets — tables, metrics, dashboards and more — created by transformation tools like dbt models, Spark jobs or ingest pipelines.
A simple mental model I frequently use is to see data platforms as a network of data assets, interconnected by transformations that build them on top of each other. Data flows through the network from source systems through layers of transformations into its final use case, whether a dashboard, data export, or machine learning model.
To ensure these complex chains of transformations work, data practitioners are increasingly applying various testing techniques such as data tests, anomaly monitors, data contracts or unit tests. Especially built-in tests in transformations frameworks such as unique
, not_null
, accepted_values
, or relationship
tests are the most popular.
To encourage adoption of testing many teams set test coverage targets and generally encourage teams to write tests for every model. But while the intention might be good, this ‘test every model’ approach might not yield the best results.
The model-centric approach does not consider the wider context, which almost always leads to redundancies. Tests are applied in a sequence, following the DAG of transformations, often without any chance for this data to go wrong as it moves between the models. This doesn’t just lead to unnecessary computing costs. More severely, it creates a false sense of safety.
We’ve built a model called stg_latest_issues
that pulls data from a stg_issues
model. It has relatively simple code:
select
id,
workspace,
...
from
{{ ref('stg_issues') }}
order by
ingested_at
limit
1 by id
The upstream table contains one record for every modification of a given issue, and the model pulls the latest record for each issue by id, which is the system’s unique identifier for the problem.
One possible issue with such data is that an id
will be empty, suggesting that we are working with a corrupted issue record. It is also why the upstream table stg_issues
verifies that such a situation can’t happen with the following test:
select * from {{ ref('stg_issues') }} where id = ''
If this test returns any records, it will fail.
But what happens when we pull data to the downstream model stg_latest_issues
? Do we test for empty id
again? How could this id
possibly become empty? The short answer is that it can’t.
The above example is a very simplified case of a problem that can lead to significant redundancies in testing.
Mechanically re-applied tests without awareness of the broader data pipeline do not add more safety to our data.
As data systems evolve, so does the data quality tooling. Today, it’s common to see teams using combination of tools for data quality: dbt for testing, a separate platform to execute anomaly monitoring, and another system to monitor overall ETL pipelines that coordinate their execution.
Teams involved in analytics across Data Engineering, Analytics Engineering, Data Science, Data Analytics also have different needs. While data engineers are predominantly focused on executing pipelines and want to ensure that all necessary jobs work correctly, analytics engineers or data analysts are more concerned about the content of the data itself, deploying data tests and anomaly monitoring. On top of this, governance teams have their expectations to understand the strategic evolution of data quality, for which they frequently profile the data and measure its quality on several governance dimensions.
It leads to fragmentation.
A practical example is a setup with a dbt project with hundreds of tables complemented by a sub-optimally configured anomaly monitoring tool. Dbt executes in hourly jobs, so the entire DAG refreshes once per hour under normal system operations.
On the side, the team has deployed an anomaly monitoring system, and one of its rules is monitoring for freshness, e.g., detecting if the data gets abnormally delayed. These types of monitors are now table stakes in many monitoring tools, and therefore, they will quickly learn a correct pattern of data refreshing every hour.
One day, this pipeline breaks at 2 am. One dbt model didn’t execute correctly due to a timeout issue with the underlying data platform. It causes the entire DAG to fail, with one failed model and hundreds of others skipped.
Thanks to a well-organized dbt alert, the right team gets notified about the failure with the proper context of what failed and the fact that the entire downstream pipeline is affected.
But what about the freshness monitors? Every monitor for every skipped model reports an issue just an hour later. The team is hit with unactionable alerts, which add little value to their issue resolution. dbt job failure has already provided all the context of what models have been skipped.
A more cohesive and strategic approach to testing could have avoided the unpleasant experience of hundreds of unnecessary alerts.
For example:
If we correctly monitor our dbt jobs, why do we want to deploy any freshness anomaly monitors on any tables created by models in these dbt pipelines?
There should be no scenario in which we wouldn’t know about failure—deterministically—by monitoring the jobs. In above scenario an anomaly monitor that behaves probabilistically introduces a potential risk of false alerts and at best, it will learn to predict delays in our scheduler, which we already know and control.
Data ecosystems are becoming increasingly complicated, and testing approaches recommending testing or anomaly monitoring every table lack context of criticality of use cases.
One standard industry advice is monitoring a wide range of tables (or all) for table-level anomaly patterns. One possible reason for this advice is that such monitors are relatively easy to execute at scale. But such a setup is often noisy. Every business has natural anomalies in their data.Data anomalies also tend to cascade across data stacks—An anomaly detected in one table could trigger dozens of other tables to report abnormal data patterns too, without much additional value.
The same applies to testing. If we attempt to test everything, we will likely receive a proportionally higher number of alerts, but we might lack an incentive or urgency to fix them.
Therefore, the desire to have a wholly tested data stack, whether with anomaly monitoring or data testing, is academic. It sounds like a good thing to do in theory.
Today’s data systems practically contain a mixture of use cases, and data flows with different risks and expectations of reliability. The testing and monitoring approach has to take it into account.
Today’s prevalent approach to testing in the industry is focused on models, which is why we see advice such as ‘every model should be tested.’ We also see teams setting targets, such as test coverage, to incentivize teams to make measurable progress towards such goals. While such advice takes some inspiration from software engineering, it lacks nuance. Most of the software engineering code in systems we take inspiration from is in production. If such systems fail, they cause an immediate and direct impact on customers, so their reliability is paramount. But that is not the only type of code software engineers write.
In some cases, engineers write one-off scripts to execute a finite task. Such code will often not become part of the production system and, in some ways, becomes obsolete once the task is done. However, for posterity, scripts are frequently checked in a source code repository and live alongside the production codebase. Still, engineers typically apply much more rigorous testing on code that will have to run in production than ad-hoc code.
This distinction is even more critical in analytics data platforms.
Today’s analytics systems don’t clearly distinguish between production and scripting. In software engineering, the software is typically composed of reusable modules and libraries that can be freely reused in scripts, often executed on an engineer’s local computer. Data teams have to deal with many more constraints. Their systems run on cloud data platforms and must process vast amounts of data to perform experiments. They often have no option but to include their work in the rest of the data in their production data platforms.
This forces data teams to experiment in a way that is much more integrated with their other, often critical data. In other words, the experimental code is mixed in a DAG of models created on each other.
You are an analytics engineer at a mid-sized organization with 500 dbt models. Since your organization has invested in data for a while, many of these models feed into various dashboards and other data applications. Some of these data outputs are very important to your company. One of them is a reverse ETL that feeds data back into Salesforce, which helps your commercial team colleagues prioritize which accounts they reach out to tomorrow. Another essential use case is in marketing. Your demand gen team uses customer CLTV data to feed look-alike models in advertising platforms. These use cases are critical.
You’ve been asked to develop a new version of customer scoring. The analysts created a solid hypothesis on how the new scoring can work and validated with stakeholders that new metrics are indeed meaningful and would be valuable to the commercial teams, but it needs some work; they only wrangled data in dozens of spreadsheets, but it’s fragile and can’t be used as is. That is why you come in to make it more robust. You will need to bring several new data sources into the data stack and write a series of models that process them. This will take at least a week and collaboration with data engineers.
In such a scenario, it’s natural and expected that all this code starts incrementally landing into the dbt project, layer by layer. From this point, it may take weeks before this new pipeline matures enough to replace the current production system.
One day, you get a ping from your colleague on a weekly rota to triage incoming data errors, and some of your models are timing out. The other day, one of the data sources didn’t refresh. The first problem was a mistake when excessive data arrived at the warehouse, which caused problems. In other words, the pipeline skipped one crucial task, and retry didn’t work. You knew this could happen and because its still new non-production pipeline tt’s not a big deal; it’s a work in progress. All this will be solved before the system gets to production, and the team is well aware.
And this is just your project. A dozen others are being developed in parallel.
It’s essential to understand this dynamic.
It’s not uncommon for data platforms to feed into dozens or hundreds of final use cases that need to be operated with vastly different reliability guarantees.
In some cases, the specific use case is a work in progress. In others, it’s an experimental dashboard one of the product analysts put together for their team. In another case, it’s a daily pulse one of the teams in operations sets as a reminder of something nice to know. In contrast, you might have use cases such as an automated marketing pipeline deciding how to allocate hundreds of thousands of dollars monthly. All these use cases are intertwined.
Combine all above and we’ve built a solid recipe for alert overload. Many teams are drowning in large volume of hard-to-action, unclear, or unimportant alerts.
This nicely highlights the challenge: We need a better, more strategic, approach. The best approach to testing will work with several key, to a degree contradicting forces:
Testing software well is an art and skill that software engineers spend years mastering. Data teams are no different. However, the benefits are clear: well-done testing means more reliable systems and more robust data.
There are several vital considerations to make.
The above example could help us inspire an alternative way of looking at our data platform. Instead of seeing it as a network of interconnected models with dozens of outputs, we can start from the end use cases and data products and expand them into pipelines.
A data product pipeline is a direct acyclic graph of its upstream dependencies, including all models, tables, and pipelines that move data from its source into the product.
Consequently, the pipeline can be seen as an end-to-end horizontal slice through the entire data platform, traversing all system layers. This approach brings several benefits, especially if we’ve done diligent work on the definition of data products and their severity:
Another way to look at the centric approach is that besides various technical and workflow benefits, identifying the pipelines, especially the ones feeding into critical data use cases, gives us a lot of focus. As a result, we can evolve our approach to testing, focusing our energy and effort on pipelines that feed highly critical products.
—
Despite the above benefits, the data pipeline-centric approach is rare in the analytics community. At SYNQ, we believe this is mainly because we don’t have the proper tools to work that way. To establish this new approach, we need to solve the following:
At SYNQ, we operate a data platform with hundreds of tables and models and face similar challenges as most data teams: our data platform, ClickHouse, feeds a broad mixture of data use cases, from P1 user-facing systems to exploratory analysis of feature adoption in our product (P4).
Our latest pipeline was for user-facing analytics that we expose to customers to help them understand quality metrics such as the percentage of failed tests, time to resolution, and the number of issues they can analyze by data products or teams.
As a consequence of being customer-facing, this pipeline is P1, e.g., its highest importance, and we treat it as any other production system that is powering our user-facing product.
The end product is a React.js application that interactively exposes data to our customers. Under the hood, it queries multiple ClickHouse Parameterised Views that act as an interface to the FE system. This view leverages data from 4 different source systems (SYNQ micro-services) to calculate the final metrics.
To create clarity, we have defined final views as a data product that anchored this use case into our observability approach.
With the following definition, we’ve used lineage to identify the pipeline.
Understanding our pipeline would allow us to start reasoning about our testing in a particular way—ensuring that data that arrives into final views is robust and reliable, with little room for error.
With an approach to testing anchored around data pipelines, we’ve created a lot more clarity and focus on specific models that we want to test to an elevated degree since they feed the P1 product. This is beneficial for the particular pipeline, but without additional concepts, it could be dangerous in the long term.
As we discussed earlier, it’s widespread and reasonable to expect data product pipelines to be interconnected. However, developing a unique and specific strategy for every pipeline could also be complex and infeasible.
This is why we think about testing in layers.
Data platform layers are a standard mechanism for tackling their complexity. We often hear about raw, staging, mart layers, or medallion architectures with bronze/silver/gold layers.
For data architecture and modeling, the data platform layers help us decide what kind of transformations to apply where. Testing is the same; we want a testing strategy outlining what type of testing to use at every data platform layer.
We do so with several objectives:
The golden rule to strive for (the same as we do when testing in software engineering) is as follows:
For any potential failure mode of our data, we want the minimum, ideally one, test or monitor to fail.
Let’s say we have five customer records with customer_id NULL. This is unexpected. But if more than one test fails, it’s a code smell. It contributes to clutter in alerting, adds more code to maintain, and reduces clarity. One test should have failed.
Testing layers should be tightly coupled with architecture. In another way, design testing into layers that fit your architecture design.
At SYNQ, on the highest level, we slice our data platform into the following layers:
The above model is generic and should apply well to most of today’s typical data platforms.
We’ve finally set enough foundations to move on to the meaty part of the testing and monitoring strategy design: Deciding what types of tests to apply and where.
We will work with the following fundamental principles:
Let’s start with the definition.
The source layer of a data platform is a collection of data assets ingested from third-party systems like CRM, ERP, company products, or other non-data teams like engineers and business operations within or outside the company.
In other words, it’s the layer of data that is under the control of the data team for the first time.
In many organizations, this is directly linked to a concept of dbt sources, as we frequently load data into data platforms from ingest pipelines or cold storage systems, which are primarily not used for analysis to start the modeling activities inside of dbt.
While this definition is practical and applicable to most data teams, some teams have additional pre-processing steps before dbt, and therefore, their data sources are more upstream.
A practical guideline to decide what our sources are would be as follows:
Find a layer of data as close as possible to the source, where you can control the data and add monitoring and tests without dependency on other teams.
dbt sources or the first layers of models in dbt perfectly fit this definition. But so would an upstream platform with well-established testing tooling such as great_expectations.
The above definition of data sources as the upstream layer in the data team’s control has a reason. A source layer of data that fits this definition essentially becomes an interface.
It is an interface with many exciting properties, which are very helpful for testing:
Sources are input we must work with throughout the rest of the pipeline. They are the most upstream assets within data analytics modeling, and so failure to detect errors at this layer often has broad and severe consequences.
Testing sources is a high-leverage activity. We are verifying the quality of data that feeds into every other model in our system, which has the most significant number of downstream dependencies and attributable usage.
This definition is also a reason for the following advice, which we apply internally at SYNQ but also advice to our customers:
When in doubt about where to start with testing, start with sources.
We will combine this advice with our shifted focus on testing pipelines to create a practical prioritization framework for quality improvements.
Previously, we’ve demystified a common technique: put unique and not_null tests on the primary key and consider the source (or model) well-tested. Such a strategy promotes a setup where most data flows through the system untested, creating a false sense of security.
This is particularly important at the source.
One critical decision we will make is what kind of tests we want to apply at the source. To some degree, this is generic, but you should always consider your context critically, especially from the risk perspective: What is the risk of not detecting an issue at the source?
We follow with a list of typical data issues we want to prevent at the source, outline what types of tests are most applicable, and provide a rationale for why and how we should set them up.
data comes from the business in many shapes and forms, and given that the business tends to evolve, and so does the data structure, it is essential to verify that the critical expectations we rely on are met.
When reasoning about how deep to go, especially when the criticality of the use cases for the data being brought to the data platform is not fully known, it’s worth being defensive and testing sources well so they can support use cases with the highest criticality.
Another way to assess the depth of testing sources is to look at the data source from the business perspective:
How central is this data to our business?
I’ll give a few examples that bring this advice to life:
These sources will likely be central to the respective data platforms. Of course, this is a small example to illustrate the principle. Think about how key every source is to your business domain, which will likely correlate with the criticality of use cases built on top of it.
With that in mind, let’s dive into the testing itself. For well-tested sources, we, in general, want to have the following in place:
Does this sound like a lot? It’s indeed more than ‘one test per model,’ but this is what it takes to test sources well. I’ve previously mentioned that mechanical testing is problematic as we are not intentionally designing test suites. But I’d like to nuance it from this point.
By defining a testing strategy for sources like the one above, we have applied strategic nuance—we’ve intentionally decided how we will test sources, deciding what tests we use (and what tests don’t). But with a chosen strategy, we can roll out the above setup at scale, to a degree with automation, making the daunting task of writing all these tests much more approachable.
At this point, it’s worth reflecting on the relationship of source testing with another concept: data contracts.
A data contract is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and its consumers. It is like an API but for data. —https://datacontract.com/
This definition is essential as it clarifies that well-tested sources and data contract definitions are complementary techniques. Data sources are very reliant on source testing. This is because data contract specification and tooling focus on an explicit definition of data contracts, including the definition of quality—which might include a definition of data testing—but data contracts themselves are a description. They don’t enforce the quality itself.
With that in mind, teams that already use data contracts should integrate testing data sources and implementing data contracts into a single workstream. A data contract can act as a prescription for what should be tested while tooling such as dbt or SYNQ can act as an execution environment where these tests get provisioned, executed, and tracked.
It’s precisely because we are still applying a pipeline-centric approach to testing that the volume of tests we will be adding going forward will start to decrease. In other words, we’ve done a lot of testing at source, which we don’t have to repeat.
With a well-established testing strategy for sources, transformations can have much lighter testing suites, where we focus only on what has changed.
This fundamentally differs from model-centric testing, where we look at each model in isolation. Instead, going forward, we consider the tests we’ve already established upstream (at source) and fill the gaps.
Several key types of transformations are essential to test because they tend to be prone to errors:
Given each model in our ecosystem could be a different combination of the above, it could be helpful to define testing techniques for each type of concept, which can then be combined into a final testing suite for the given model.
Cleaning and normalization are often very lightweight data transformations; therefore, testing them could be equally lightweight.
Cleaning and normalization code is also very frequently a source of redundancy. It’s because lightweight transformations are very unlikely (or in some cases impossibly) a source of data errors.
Think about simple changes such as normalizing column names:
select
id as issue_id,
workspace as tenant,
toDateTime(created_at) as started_date
...
from
{{ source('incidents', 'issues') }}
order by
We could retest issue_id
and tenant
or started_date
, but such tests could be redundant to the testing we have already done at the source.
Such examples can be tests we’ve done to verify that the upstream id is not null or unique, that the workspace is always there and has correct reference to the workspaces table, and that created_at exists; all these tests significantly contribute to the reliability of this downstream model, too. For example, the upstream id uniqueness test guarantees that issue_id will be unique, too.
Eliminating such redundant tests has several key benefits:
At the start of 2024, we engaged with a customer looking for ways to save costs on their data platform. They have done an excellent job with models, switching many to incremental materialization or optimizing their underlying SQL structure, but they forgot about testing.
After a quick analysis, we identified a sequence of 3 tables in their DAG, each testing for the uniqueness of the primary key (id) on a large table. The test was verifying the uniqueness of more than a billion records hourly and to make matters worse. These tests were not just slow on their own but also redundant; just one test would create enough safety. By removing redundant downstream tests and changing their frequency, we reduced data warehouse computing cost by 10% in a single commit, saving tens of thousands of dollars per year.
Newly derived columns could be looked at through a simple conceptual lens.
A model that creates derived columns is their source; therefore, we can apply a playbook from testing data at the source to test derived columns.
In other words, derived columns are worth testing deeply, proportional to the complexity of the logic applied to create them. We use various typical data testing methods, such as uniqueness, non-null, accepted values, etc. We should always remember the logic and consider where it can break.
For example, consider the following logic:
CASE
WHEN … THEN ‘active’
WHEN … THEN ‘pending’
ELSE 'inactive'
END AS status
Testing such a column for not_null
values in many scenarios doesn’t make sense as we implicitly ensure some by the ELSE
clause.
Derived columns often contain much more complex logic, which could combine data from many columns that might come on top of join across multiple tables. In such scenarios, bringing a technique we haven’t discussed might be beneficial: unit testing.
Unit testing differs from data testing as it creates a complete testing cycle, from seeding the correct data to executing the logic and asserting the output.
This makes it a great tool of choice when testing multiple scenarios where we seed/test/assert different setups. Doing such testing on transformations that power-derived columns could be an efficient and not-so-complex way to introduce unit tests to your quality strategy.
But keep one crucial thing in mind.
The maintenance cost of unit tests is higher than the maintenance cost of data testing.
When we decide to change the SQL logic, we must update unit test expectations for all scenarios. It’s a natural trade-off: the more diligent reliability guarantees we get, the more work we must do to maintain the tests.
It’s essential to balance robust test coverage and ease of maintenance. Every unit test should have its purpose and be typically introduced in scenarios when more straightforward data tests are insufficient.
Joins are notoriously error-prone, typically due to unexpected duplicates in any of the joining variables.
The scenario is simple: we join two tables on a variable that is — unexpectedly — not unique. As a result, we duplicate rows. This problem is called ‘fanout’.
It’s why testing models with joins is particularly important, especially with the right types of tests that specifically aim to detect the ‘fanout’ problem.
At SYNQ we frequently use one of the following strategies.
In cases where we’re joining data without aggregating or filters, we can apply the row count check.
Consider the following scenario
select
issues.id as issue_id,
incidents.incident_id as incident_id,
...
from
issues
left join incidents on issues.incident_id = incidents.id
he issues
table is a base table of the join. In our domain, one issue
can be part of no or one incident
. This means that under no circumstance we should join multiple items. But it would be naive to assume that this important business assumption is always met. This is why we need a test.
Given we’ve constructed our join mainly to enrich issues
with additional data from incidents
without additional logic we can write a simple check. Number of rows in our new table should match number of records in issues
table. In other words we verify that no additional rows were created.
One of the benefits of the row count test is that we can execute it as comparison of select count(*) from orders
and select count(*) from orders_with_incidents
. Executing such query is typically cheap, as data platform can use underlying metadata potentially without scanning any data from the underlying storage.
Alternative approach is to verify cardinality of columns involved in the join. As we discussed, the assumption is that an issue
can only be part of one incident
. This could be direclty verified by one of the tests in dbt_utils
:
# models/issues.yml
version: 2
models:
- name: issues
tests:
- dbt_utils.cardinality_equality:
compare_model: ref('incidents')
compare_column: id
column: incident_id
This test will fail if the number of distinct values in incident_id
is different from the number of distinct values in id
in incidents
table.
The above test has a potential to be even more accurate way to prevent the fanout problem, but it has a caveat. Compared with row count test it has a lot more complex SQL logic. This might not be a problem for small models, but it might be a challenge if you’re dealing with large number of rows.
Aggregations are another essetial building block of data products. Whether it’s a simple count of rows or a complex aggregation, aggregations typically combine data from multiple rows into more meaningful measures.
Data models with aggregation logic have a number of interesting testing challenges:
This means that correctly verifying aggregations might be a complex task, which we can approach with number of techniques.
The final layer is frequently called data mart which describes a set of datasets that are purposely assembled for a given set of use cases. The exact structure varies from company to company. In some cases marts align around functions such as marketing, sales, product. In other cases marts align around purpose of data like fraud analytics, customer health, etc.
Whatever organisation you choose the marts should have one thing in common — they should contain data products that are ready to be consumed from processes in the business. This also means that we ensure that they work reliably, with testing.
To test data products, we can shift the approach again. Since we’ve done the work of methodologically working through the layers of the system and tested reconciliation with operations, sources and transformations well we no longer need to test the basics.
Instead of focusing on technical aspects of data, when testing data products, we verify the logic encoded in SQL transformations.
Testing software well is always a challenge. This is no different in data and other software systems. This is also why over time the software industry developed a whole range of testing techniques — unit testing, integration testing, contract testing, behavioral testing, end-to-end testing, smoke testing etc.
In software, unit testing is very valuable tool that complements the development process. We build functionality with unit tests in small increments and continue enhancing our tests as we build our software. The flow goes like this:
If you follow this process rigorously, its called TDD (test-driven development). I personally don’t think its worth following religiously. Sometimes its better to code first, sometimes its better to start with a test. The important bit is that we consider writing tests an integral part of building software.
But unit testing has its limits.
Most software consists of large number of units (functions / modules / services). It has many inputs and outputs which interact in non trivial ways. To test all units in isolation is only possible if we mock inputs of each tested unit. This is often impractical and potentially very fragile. It’s because mocked interfaces that drive our tests tend to change and mock that mimics changed interface is a faulty interface. Our tests might continue to confirm our software units work as expected, but the system as a whole doesn’t work correctly anymore. Once units are deployed to production together, the system fails, at the interface.
This is also why software engineers use whole range of other testing techniques beyond unit tests. One such testing technique is called integration testing.
In integration tests we don’t focus on granular units of software, but as the name hints, we focus on testing the units together, integrated. The integration test is therefore intended to verify large pieces of software together. In some cases it could be large modules with many units, in other cases we can integration test functionality of the entire system altogether.
Integration tests of course also have their downsides. They are often slower and more complex to setup, as they test larger system components, which might involve setup of databases and other heavy software components.
The key is to combine unit and integration tests well. They both have their function in the testing strategy and work the best together.
While integration testing is not really formalised and widely adopted in data platforms and SQL systems, we can take the inspiration from this approach and apply it in our data product tests.
Since final data products are directly linked to a specific use case, we can verify the accuracy of our data system by mimicking queries specific to the use case.
In other words, we run real-world queries and assert that response we got is correct. This type of testing could be seen close to integration testing. We are not mimicking any interfaces, we are querying data from the final layer of transformation, which will be only correct if all upstream models function correctly too.
Beyond generic tests dbt ships with so-called singular tests. Singular test is simply a SQL query that should isolate faulty data.
In the context of our incident management system that produces information about mean time of resolution of issues we could write integration test that verifies that under no circumstances our system can return time to resolution of any issue that is negative.
Conceptually it could be as simple as:
select * from incidents where time_to_resolution < 0
Another key metric we produce in our final data product is measurement of percentage of failed tests. For example if for any reason our logic would count more failing tests than the tests that are actually running (which is data joined from multiple sources), we would want to know about it.
select
workspace,
toStartOfDay(created_at) as day,
pct_test_failure
from incidents
where pct_test_failure < 0 or pct_test_failure > 100
group by workspace, day
We’ve created this test after identifying logical issue with some of our calculations, which under some circumstances would return invalid failure rate. We’ve for example counted that 10 out of 7 tests have failed, which is clearly not possible.
Tests like above are very useful to catch such logical issues that can appear anywhere in the data system. The test the system from outside-in. In some way, mimicking the queries that are actually run in the business.
As outlined another good use of these tests is to verify regressions.
In software engineering, when issue is reported the engineers often spend considerable time to replicate the issue. This means thay can replay the faulty scenario and write a test that will fail. This could be done with these data product tests too. This approach would act as final confirmation that the issue is fixed and as a regression guard.
Almost all analytical systems provide information about history. Whether it is historical revenue, product usage or customer acquisition, we often analyse data over periods of time.
This attribute of analytics system could be leveraged for testing.
Suppose our analytics system has measured 10,000 resolved incidents. The resolved status is actually an attribute of incident that logically combines multiple fects. We could test that the number of resolved incidents is consistent over time by comparing live data from our system querying historical period with a historical snapshot.
We could verify our business logic by testing historical consistency by comparing live data from our system querying historical period with a historical snapshot.
The test would be as simple as:
select
concat('Expected 10000 resolved incidents but found ', toString(countIf(status = 'resolved'))) as error_message
from
incidents
where
year = 2024 and
workspace = 'acme'
having
countIf(status = 'resolved') != 10000
If the historical consistency of this data is not correct, we would want to know about it.
Note: Tests for historical consistency need to be used with caution. They are excellent for catching issues that would unintentionally change historical data, but are also expensive to maintain in case when calculations are actually changing. In such case its difficult to find and verify the new correct asserted value.
In this final section we’ve covered how to test data products. We’ve taken inspiration from software engineering and focused on end-to-end verification of our data products (and entire data system). This approach complements our previous work on testing sources and transformations and complementes them with final verification that our data systems work reliably.
As the data stack grows in complexity, it’s no longer possible for one person to keep everything in their head, and more often than not, the person who notices a problem is not the right person to fix it. Simultaneously, the number of upstream and downstream dependencies has exploded, making it challenging to locate the right upstream owner or notify impacted stakeholders. Well-defined ownership and incident management processes can help with this by clarifying who’s responsible, and how they’re notified and streamlining the process around incident and issue response.
While there’s no one-fit-answer to ownership and incident management, working through the four steps below will set you up well no matter if you’re a 5-person data team or a data team in the Fortune 500.
Ownership can be daunting, as it’s both a technical and cultural challenge. If ownership works well, boundaries of responsibility are clear, and ownership is brought into action – both within and outside of the data team. If not, it’s only sporadically defined and rarely actioned.
Whether you’re just getting started with ownership or have existing owner processes in place already, we recommend thinking through these steps.
In the ideal world, you’d neatly group your stack into well-defined areas with clear boundaries. But in reality, ownership lines can get blurry, so don’t be discouraged if you can’t easily assign ownership to all assets. Data rarely stops or ends with the data team. Instead, data is ingested from 1st and 3rd party data sources, loaded and transformed in the data warehouse, and exposed to end-users in a BI tool or use cases such as ML/AI.
This is how we manage ownership at SYNQ:
owner tag
and dbt groups
. We enforce that ownership tags are set using CI checks.More than 50% of our data assets above are already encapsulated into data products making the ownership seamless to define.
The data lineage of an internal data product at SYNQ. Ownership can be mapped and overlaid across all dependencies
The result is that we manage ownership of data across the entire company and not just as something that’s owned by the analytics engineering team.
Use groups that already exist in your company, such as Google Groups or Slack team channels formed around existing teams. These will always be up-to-date as people leave or join without you having to manage groups for data ownership specifically
Below are step-by-step instructions for how to define ownership.
Data products should be your starting point–while you may not want to set ownership for all your data assets, we recommend you at least do it for important data. Having assets with clear ownership makes it more likely that the right people act on issues or are notified if the data they use is wrong. If you’ve defined data products with well-established priority levels, the highest-priority data products are a great place to start.
Ownership definitions will closely follow your data product definitions. For example
Set ownership at a too-high level and you risk that no individuals take responsibility. For example, by defining the owner as “data-team” you risk the definition is too broad to act on. Setting ownership on an individual level creates a lot of accountability but little scalability. You run the risk that people move around to different teams, go on holidays, or leave the company.
At SYNQ, we assign ownership based on teams and their associated Slack channels. Where possible, we use dbt groups so we only have to keep ownership metadata such as Slack channel updated in one place. As all alerts on input sources go to one channel, we also assign individuals based on sources they own, so that they’re tagged in Slack to bring attention to these issues.
Use this option if you’ve already organized your dbt project or data warehouse schemas to resemble your ownership structure, such as marketing, finance, and operations. With folder-based ownership, you typically need less time to get set up, and as you add new data models, they, by design, fall into an existing owner group, reducing your upkeep. If you work with non-technical stakeholders who don’t contribute to your code base, such as business analysts or data stewards, this approach makes it easier for them.
dbt has built-in support for designating owners using the meta: owner tag. Owners are displayed in dbt Docs, making it easy for everyone to see who’s responsible.
models:
- name: users
meta:
owner: "analytics-engineering"
You can extend this to dbt sources to define ownership to upstream teams. If issues happen on sources before any data transformations, it indicates that upstream teams should own the issue.
An added benefit is that this approach lets you use CI checks, such as check-model tags from the pre-commit dbt package, to ensure that each data model has an owner tag assigned.
With dbt 1.5, dbt launched support for groups. Groups are helpful for larger dbt projects where you want to encapsulate parts of the internal logic only to be accessible to members of that group – similar to how you’d only expose certain end-points in a public API to end-users. If a model’s access property is private, only owners within its group can reference it.
models/marts/finance/finance.yml
groups:
- name: finance
owner:
# 'name' or 'email' is required; additional properties allowed
email: finance@acme.com
slack: finance-data
github: finance-data-team
models/schema.yml
models:
- name: finance_private_model
access: private
config:
group: finance
# in a different group!
- name: marketing_model
config:
group: marketing
There are situations where you want to manage ownership across multiple tools – from databases to multiple data warehouses and dashboarding tools. This is useful when you want to find dashboards owned by specific teams or build out capabilities to notify downstream impacted stakeholders when you have a data incident. Managing cross-tool ownership in code can be difficult as there’s often no coherent way to define this. Tools such as a data catalog or data reliability platform are built for this.
Data ownership doesn’t have to stop with the data team. Below, we’ll look at ways you can notify: (1) the data team, (2) upstream teams, and (3) business stakeholders.
One of the top pitfalls we see is when teams spend a lot of time mapping out and defining ownership, but let it sit stale on a Confluent page that gradually gets out of sync with reality.
Managing ownership within the data team is the most straightforward. Your team is in control; typically, the tools are within the stack you manage. You can use your existing ownership definitions to ensure the right owner knows about the right issue. The two most effective ways to do this, assuming you use a communication tool like Slack, is by tagging owners and routing alerts based on your ownership definitions:
As a rule of thumb, you’ll see the most impact once your data team moves past a handful of people, and everybody no longer has full visibility into all data assets.
We typically see two kinds of upstream teams that produce data and need to be alerted differently: technical teams, such as engineering, and non-technical teams, such as a SalesOps team owning Salesforce customer data.
a. Technical teams – the alerts you send don’t need to look different from those in the example from the data team above. If you’ve placed tests at your sources and detected issues before any data transformations, engineers should be able to connect the dots between the source and the error message and trace back the issues to their systems. For larger teams with a clear split between teams that ingest data (e.g., data platform) and teams that produce data (e.g., frontend engineers), it can be helpful to compliment the error message with details about what event or service it relates to.
b. Non-technical teams – bringing ownership of quality of source systems to non-technical teams is underrated. Too often, tedious input errors such as an incorrect customer amount or a duplicate employee_id end up on the data team to debug, triage, and find the right owner. With the right context, these teams can start owning this without the data team being involved.
One thing is starting to send alerts to an upstream team. But getting them to consistently act on them is another challenge. “Before we started to send alerts to our operations and lending team about faulty customer records, we got our COO to buy into the initiative. Only then were we able to ensure that the team was incentivized to act on and prioritize alerts from the data team” – Rupert, data lead at LendInvest
Sometimes, you can alert your stakeholders to notify them of issues proactively with alerts similar to those you send to the data team. This works best if your stakeholders are data-savvy teams, such as a group of analysts in a business domain.
Sending Slack alerts to your marketing director that five rows are failing a unique test on the orders data model is not the best idea.
The best way to notify impacted stakeholders is most often for the person with the relevant context to “declare” it unless the stakeholder is technical. In most cases, we recommend that the data owner notify the stakeholder directly and link to an ongoing incident page so they can follow along on the issue resolution. Another way is displaying issues directly in dashboards so end-users are aware of issues – but this can be more risky and difficult to interpret for non-technical users and recommend you only use this with caution.
At SYNQ, our Technical Account Managers rely on a ‘Usage Report’ data product to see the usage across our customers’ workspaces. As our Technical Account Managers are technical stakeholders, alerts on or upstream of this data product are sent directly to the #tech-ops channel. This helps them be the first to know if data is unreliable and not make decisions based on incorrect information.
If you spam a Slack channel with alerts, chances are people will stop paying attention to them. Preventing alert overload is solved throughout the data reliability workflow – from architecture design to monitoring and testing deployment, and alerting & ownership rules. Not all issues have to be sent to the #data-team Slack channel. A better workflow is to be deliberate with what’s sent to the main alerting channel. Issues on less critical data assets can be sent to a different channel, or not sent as alerts at all, and managed in weekly backlog review.
With adequate ownership in place, you’re well-positioned to start streamlining responses to issues and adopting incident management for more severe issues.
If done well and combined with well-defined ownership definitions, this has several benefits–(1) Time to resolution is slowed as the issues are brought to the right owners with the relevant context. (2) Important issues are prioritized based on established incident management response processes. (3) Time spent resolving data issues is reduced as ownership lines are less blurred. (4) You start building an institutionalized way or adopting learnings from incidents and symmetrically improve.
There’s no set way to manage issues and incident response, and you should always look to adapt to other ways of working in your companies. With that being said when adopting incident management for rapid responses in data teams, we recommend working through these five steps
Start by defining expectations from the data team for what it means to be on call. For smaller data teams, this may be the sole data person being the “data responder duty” for the week. For others, it may mean that relevant owners address issues within a predefined SLA as they come up. For teams owning core business processes, this may involve being paged or notified outside business hours, closely tied to the SLA definitions of your data products. We recommend you consider them across these groups:
If the existing ownership activation rules you’ve set up are not working out of hours, you can create a “@slack-responder” group and only have the people on-call in it to avoid tagging the entire data team when there are issues out of hours.
If you’re not specific about expectations for on-call (e.g., we only look at issues within business hours), people will start adopting different expectations which can create an uneven workload across the team.
At SYNQ, our data platform powers core in-product functionality such as our Quality Analytics tab. If there are issues here, we strive to detect the issue within an hour and resolve it within a few hours.
An incident begins when something goes wrong—whether it’s a critical dbt job failing, a table no longer receiving new data, or an SQL test for data integrity breaking down. At this point, it’s still just an issue, and you should be able to achieve these three objectives:
Without sufficient testing and monitoring in place, you’ll be caught on the backfoot, only learning about issues when you’ve critical system failures or issues detected directly by stakeholders.
The chapter Testing & Monitoring goes into more detail about how to set up the right monitoring to help you avoid this.
When an alert is triggered, you should assess the situation thoroughly. The alert may not provide all the context needed, and a system failure might be connected to other relevant issues. At this stage, you should have an understanding of all other issues and their connectivity. Internally at
SYNQ, we always aim to be able to answer three questions:
Example summary card–system-wide issue impacting critical data products
Answering these questions provides the context needed to evaluate the incident’s urgency and decide whether it warrants a full incident response. This triage step helps separate critical issues from minor ones that don’t require full incident management.
An efficient triage workflow should alert relevant team members using your ownership definitions and offer a comprehensive view of failures and their context. This enables data engineers and analysts to assess issue severity and interrelations effectively.
Linking incidents to data products, or your other business-critical datasets is especially useful, as it highlights the potential impact of each issue and automatically identifies affected datasets, streamlining the triage process. This is a process that can otherwise be nearly impossible to do, especially if you’ve many hundreds of data assets downstream of an issue.
Once you’re able to answer these questions, we recommend that you set clear expectations levels closely tied to your on-call setup. At SYNQ, we use the following benchmarks for our MTTD (mean time to detect) and MTTR (mean time to resolve).
SYNQ’s internal MTTD and MTTR metrics
If your data team or engineering team already uses an incident management tool like PagerDuty, Opsgenie, or incident.io, we recommend you link your data-related issues to these, so that an alert can be automatically linked to existing workflows, as well as bring on core metadata around the impact directly to the existing platforms.
While there’s no universal approach to handling data-related incidents, a structured and well-documented incident management process significantly helps make the response smoother and more effective.
Not all issues should be incidents but erring on the side of declaring too many rather than too few incidents gives you a tracable log of issues, where if something goes wrong, you can go back and look at what happened last time. You can always adjust the incident severity and prioritize it accordingly.
Creating a dedicated space for communication (such as a document, a Slack channel, or an incident in an external tool) ensures that stakeholders are kept in the loop and avoids having to switch between multiple apps and tabs, allowing you to focus your efforts on identifying and resolving the root cause of the issue.
At SYNQ, we use these steps for coordinating key communication and root cause analysis.
Each of these steps will be much easier with well-defined data products and ownership in place–you’ll know who to communicate or escalate issues to, you’ll be able to see if important data products are impacted downstream, and have a log to trace down previous related incidents, and how they were solved last time.
It can be tempting to set an incident aside once resolved, but doing so may mean missing valuable insights and failing to establish processes to prevent it from reoccurring—or leaving someone else to grapple with a similar issue later on.
For lower-severity incidents, such as a test failure, follow-up checklists with assigned task owners can help ensure that guardrails are put in place to prevent recurrence. For more critical incidents, like a critical data product failure, a postmortem review can offer deeper insights and guide process improvements.
INC-123: Data Product Incident Title
ℹ️ Key Information ⏱️ Timestamps
+--------------------+-------------+ +-----------------------+---------------+
| Data Product | … | | Reported at | … |
| Priority | P1, P2, … | | Impact started at | … |
| SLA | 99.9%, … | | Resolved at | … |
+--------------------+-------------+ +-----------------------+---------------+
👪 Ownership & Teams ⏳ Durations
+--------------------+-------------+ +-----------------------+---------------+
| Product Owner | … | | Time to Identify | … |
| Slack Channel | #team-name | | Time to Fix | … |
+--------------------+-------------+ +-----------------------+---------------+
🪞 Related Incidents 🔗 Useful Links
+--------------------------+ +--------------------------+
| INC-456 | | Incident Homepage |
| INC-789 | | Slack Channel |
+--------------------------+ +--------------------------+
🖊️ Summary
+------------------------------------------------------------+
| Summary of incident impact on "Marketing Attribution" data |
| product affecting downstream assets… |
+------------------------------------------------------------+
📆 Incident Timeline
+---------------------+------------------+--------------------------+
| Date | Time | Event |
+------------+--------+----------------------------------------------+
| 2024-01-01 | 12:00 | Reported by Data Engineer |
| 2024-01-01 | 12:30 | Priority escalated to P1 |
| 2024-01-02 | 12:00 | Incident Resolved |
+------------+--------+----------------------------------------------+
⬆️ Root Cause ⬇️ Mitigators
+--------------------------+ +----------------------------+
| - Missing upstream checks | | - Add source freshness |
| - No quality assurance | | - Raise priority of source |
| in source system | | data product |
+--------------------------+ +----------------------------+
👀 Risks ⏩ Follow-up Actions
+--------------------------+ +----------------------------+
| - Low ownership of | | - Add review with core |
| source data quality | | engineering team |
| - Weak completeness SLIs | | - Raise priority of issue |
+--------------------------+ +----------------------------+
Example template for a post-mortem of a critical data incident
Additionally, a periodic review of incident trends can reveal patterns—such as recurring errors in specific data assets—or highlight imbalances such as specific upstream teams being responsible for a disproportionally large amount of issues.
You should tie these metrics together with your wider data reliability workflow and be able to measure metrics such as
In the chapter Continuous improvement, we’ll look more into how you can establish feedback loops and learning processes to continuously enhance data reliability practices.
While it’s common for engineering teams to have established a set of metrics to monitor the performance, uptime, and velocity over time, it’s less common for data teams. It is increasingly important to be able to report on the SLA, performance, and uptime of your data as you take on business-critical data products–but even if the main output of your team is dashboards and analysis for decision-making, it’s still a good idea to establish benchmarks for when data should be ready and feedback loops for learning.
Here are some indicators that it’s time to put metrics in place
If it’s only the data team that cares about the metrics you picked, you’ve likely picked the wrong ones. Get buy-in across stakeholders who are impacted by the metric and understand how they’re impacted. For example, a product or account management team may be directly impacted if a customer-facing dashboard is down and contractually committed with an SLA towards the customers. Work closely with these teams on identifying use cases and tie the metrics back to them.
With your use case in mind, you should assess a list of metrics that you can track. It helps to group them into key areas–if your goal is to improve the SLA of key data products, focus on High-level metrics and SLIs. If your goal is to improve the usability of data, focus on Usability-related metrics.
Measuring metrics such as test model test coverage without a clear end goal in mind can create a false sense of security and lead teams to optimize toward the wrong goals
If you’re just starting, you likely have little to no metrics at all. Some metrics will be easier to get – for example, if you have existing tests and monitors in place, you’ll be able to get the group these into SLIs. On the other hand, if you don’t have an established incident management process in place, tracking incident mean time to resolution may not be the right place to start.
Start with just a few metrics based on the use case you have in mind. If you support a business-critical data product such as a customer-facing dashboard, you’ll likely want to be able to track the coverage and quality score/SLA. It’s important to consider both – if you’re only tracking quality score/SLA without considering the coverage of data controls, you’ll establish a false sense of security of the actual quality of the underlying data.
With this in mind, start decomposing your key metrics into dimensions. In the example below, you can see that SLA is only satisfied for 4 of the last 12 weeks. But this is largely due to the Revenue Forecasting data product consistently falling below the SLA target, giving you a good sense of where to focus and improve.
Think about SLIs as groups of data quality controls. By grouping the SLIs you can zoom in on if there are specific areas that are causing the SLA to fall behind. The six SLI areas we identified earlier provide a good starting point for grouping your existing data controls and are also the ones we’ve decided to use internally at SYNQ.
accepted_values
test for valid statuses, custom SQL checks for calculated metrics).not_null
test for critical columns, row count checks).relationships
test to check foreign key integrity, unique test across datasets).dbt source freshness
test, custom timestamp lag checks).accepted_values
test for categorical data, regex-based custom tests for formatting).With clearly established SLIs, you can go a step further to understand what’s causing the SLA to fall behind for our Revenue Forecasting data product.
With these insights at hand, the next step is clear – you should focus on the timeliness of data, especially for sources feeding into the Revenue Forecasting data product to reach your SLA goals. In our case, we equally weigh all SLIs as components to calculate the SLA. In some cases, you want to set different SLI levels for each SLI. For example, for an ML model, fresh data may be less important causing you to accept a 95% threshold in terms of times that data is loaded in time while you have a lower tolerance for completeness or accuracy issues, causing you to set the SLI target to 99.9%.
You may already have the data available to start measuring the key metrics, or you may be starting from scratch uncovering where data lives in source systems, or starting by defining processes to define the metrics.
As you build out the metrics, do it with the following four principles in mind
Below are some ways how you can obtain the data based on the tools you use.
Internally at SYNQ, we’ve automated the SLA, coverage, and SLI tracking, so that we can monitor and report on the uptime on all data products at any given time. This helps make sure that monitoring uptime is not an afterthought, but instead something we review regularly.
The 2024 MAD (ML, AI & Data) Landscape gives a good overview of all tools and vendors across data and AI tooling.
You’ll want to put the insights you uncover from monitoring data quality into action. Whether it’s to improve a particular area, share with stakeholders how you’re improving, or something else.
While there’s no one-fit-all solution, we’ve seen these work well.
Automated accountability with a weekly email digest – being the person having to slide into other teams’ Slack channels to tell them that their data quality is not great is not always fun (we’ve been there). Scheduling an automated weekly email with the quality score over time and per owner domain and data product is a great way to bring accountability without one person having to point fingers (It does wonders when people see their team scoring lower than their peers).
Be religious about including metadata – the most common reason we see data quality initiatives failing is that everybody owns data quality, and thus, nobody feels responsible. Only by enforcing metadata such as data product definitions and owner or domain can you hold people accountable for data quality in their area. Build it into key processes such as using the check-model tags CI checks to enforce that certain tags are present.
Beware of the broken windows theory – the broken windows theory can be traced back to criminology and suggests that if you leave a window broken in a compound, everything else starts to fall apart. If residents start seeing that things are falling apart, they stop caring about other things. We can draw the same analogy to data quality.
If you’ve got many failing tests, it’s often a symptom that the signal-to-noise ratio is too low or that you don’t implement tests in the right places. Don’t let failing data checks sit around. Instead, set aside dedicated time, such as “fix-it Fridays” every other week, to work on these types of issues and remove data checks that are no longer needed.
If you’re only sporadically looking at your data quality metrics, it’s harder to establish a benchmark and systemically track improvements. The data team at Lunar meets with key C-level stakeholders every 3 months to update them on data quality KPI progress, new initiatives, and any regulatory risks.
Create run books for data quality – if you’re in a larger team, include clear steps around addressing each data quality dimension so it’s clear for everyone. For example, if the Timeliness score is low, you can recommend steps such as adding a dbt source freshness check or an automated freshness monitor.
If you’ve made it this far, you’ve understood the key components of building a reliable data stack – from defining data use cases as products, setting ownership & severity, deploying strategic tests & monitors, and establishing quality metrics.
Maintaining a reliable, high-quality data platform is not a one-off exercise but requires continuous investment.
New data use cases should be defined as data products with ownership and severity clearly defined.
Tests and monitors should be evaluated on an ongoing basis based on quality metrics, adding new checks in cases where issues are caught by stakeholders, removing low signal-to-noise tests, and keeping teams that score low on quality metrics accountable.
Taking these steps will put your team among the top 10% of data teams.
Good luck!
If you have questions about specific chapters or want to book a free consulting session with the authors of this guide, Petr Janda or Mikkel Dengsøe, find time here.