Setting Expectations
A data product is a group of data assets structured based on their use case. If you define your data products well, everything follows. Ownership becomes clear, it guides your testing and monitoring strategy, and you can manage and maintain SLAs in a way that’s consistent with data consumers’ expectations.
The building blocks of a data product – ownership, prioritization, code & logic, tests, documentation, and metrics.
A data product rarely works in isolation but most often relies on input, either other data products or data directly from operational systems. Therefore, expectations for the data product should be set depending on its entire lineage. In this chapter, we’ll look at everything you need to know to get started with data products – from identifying and defining them to determining their priority and setting SLAs.
Case Study: How data products helped Aiven untangle their spaghetti lineage
The data team at Aiven, the open-source AI & data platform, struggled with untangling dependency across their data stack with more than 900 dbt models. Circular dependencies meant that it was difficult to make sense of the lineage which slowed down root cause analysis and made it difficult to make system-design decisions. It was particularly difficult for their core data products with hundreds of dependencies such as the ARR calculation, which is one of the most important metrics provided by the data team.
By encapsulating data products across data-producing and consuming domains, they could clearly understand the lineage, and instead of having to understand hundreds of upstream tables, they could visualize and understand the line through the lens of just a handful of data products.
With this at hand, they can trace back an issue to a faulty upstream data product and easily identify its owner for escalation, or see which data products are impacted downstream from an upstream failure–all without everyone needing to understand all the internals of each data product.
Another benefit is that system design issues stand out, so for example, if a circular dependency is introduced, it’s much easier to understand it across a few dozen data products instead of a spaghetti lineage of hundreds of data models.
Identifying your data products
If you can identify the most critical business processes your data team supports, those are most often the data products you should identify. If you’re unsure which ones they are, look for these signals such as what was impacted in your most recent critical data failure, or which data assets have the highest usage across your company.
Here are some examples of what can make up a data product:
- A set of dbt models and metrics within a specific dbt folder, like a finance mart
- A group of dbt models linked by an exposure, for instance, models used by a customer lifetime value (CLTV) model that powers marketing automation
- A selected collection of dashboards in a BI tool, such as core KPI reporting
At SYNQ, we think about our data products as either producer or consumer products. Producer data products are read from operational systems (e.g., API or SalesForce data) and owned by data or platform engineers. Encapsulating them in data products provides an easy-to-understand getaway for downstream teams to escalate issues upstream without grasping the full complexity of the internals of the data products. Consumer data products directly expose data to consumers and read from other data products or assets (e.g., CLTV or Marketing Attribution Model), often owned by data analysts or scientists. Encapsulating these in data products gives everyone a good understanding of the direct impact of issues and which stakeholders should be notified.
The number of data products is highly dependent on your company’s size and complexity. Some companies have dozens or hundreds of data products while others have just a handful. Start small, by identifying a handful of your most important data products, and then build out from there.
Case study: Establish the right granularity when identifying data products
The data team at Aiven started with high-level products such as Sales and Marketing but realized they needed to go a step deeper to have the most impact. “If the Marketing data product has an issue, that may be fine. However, if the Attribution data product within Marketing has an issue, we must immediately jump on it. This is the level of detail our data products need to be able to capture.” - Stijn, Data Engineering Manager
Defining your data products
Once you’ve identified your data products, the next step is to define them. When defining your data products, we recommend following these five steps:
- Data products should be defined as close to where they’re used as possible to reflect the experience of the data consumer
- Data products should take into account upstream dependencies as far upstream as possible to give a complete overview
- Data products should have an owner responsible for the operations during their entire lifecycle such as continuous monitoring to ensure quality and availability
- Data products should have a priority assigned indicating their importance (P1, P2, …)
- Data products should have a description so that people with less context can understand it’s use case
As you scale your data products, you can group them into domains to make it easier to manage them. We recommend you keep data products as focused as possible – for example, it’s never a good idea having a data product consist of many hundreds of assets as it will create a web of interdependencies.
Case study: Group data products into related domains
The data team at Aiven groups data products into business groups such as Sales, Finance, and Marketing as well as Core and API for more technical data products. “With this at hand, we can monitor the health across different groups to get a high-level overview, while also zooming into specific data products. This is also helpful for us when mapping priority–for example, our Attribution Marketing product is P1 while our Market Research product is P3.” - Stijn, Data Engineering Manager
The definitions of data products should follow existing workflows. For example, if you already have a process for defining asset ownership and priority, this will also fit data product definitions.
At SYNQ, we take a pragmatic approach. Producer data products are defined based on specific assets such as Postgres tables and dbt sources. Intermediate data products are defined in code through dbt metadata tags and groups. Consumer data products are defined through their folder structure in dbt and Looker which resembles their use cases.
version: 2
data_products:
- name: marketing_attribution
description: >
This data product includes models for tracking marketing attribution
across various channels. It powers the Marketing Attribution Dashboard
and is critical for assessing campaign performance and optimizing
marketing spend.
owner:
name: marketing_data_team
slack_channel: "#marketing-data-team"
priority: P1
assets:
- name: dim_marketing_campaigns
- name: fct_channel_performance
- name: fct_attribution_model
A data product consisting of key marketing assets with priority, owner, and Slack channel defined in a dbt yml file
While you’ll want visibility into the data product’s dependencies as far upstream as possible, we don’t recommend managing this manually. It’s not uncommon to have hundreds of dependencies upstream of a data product, and new dependencies can be added without the data product owner being aware of it. Instead, you should rely on automated lineage toolings such as those from a data catalog, dbt, or a data reliability platform.
Determining the priority of data products
As you identify your data products, you should carefully consider their priority. This serves as a guiding principle for key data product workflows: How soon should issues be addressed, do they require an on-call schedule and what’s a reasonable SLA.
Determining the priority of your data products is significantly easier if you’ve defined your data products at the right level of granularity. For example, if you’ve defined an entire product as Marketing, many people will have a different take on its priority. But if you’ve split Marketing into Market Research, Attribution Reporting, and CLTV calculations, some will naturally be more important than others.
At SYNQ, we use the following priority levels:
- Product operations (P1) when a user-facing operational system runs queries against the dataset. The data is either directly displayed to the customer or powers product internals (such as our anomaly engine), with a direct impact on the functionality of our product.
- Client exports (P2) when a given dataset is shared with any customer in an “offline” or ad-hoc manner, which isn’t operational but could still impact the customer directly.
- Business critical workflows (P2) when analytical data feeds other operational systems with indirect impact on customers such as marketing automation or critical business decisions such as health scoring.
- Business intelligence (P3) which includes standardized reports and datasets that we use for company strategic and tactical decisions.
- Every other use case is classified as Others (P4).
Defining severity based on the use case. P1 is critical. P4 is exploratory
The priority of your data products must be closely linked to expectations for how to deal with data incidents. We’ll look more into how to define the severity of data issues in the chapter Ownership & Alerting.
Case study: Get exec buy-in for setting data product priority at fintech scaleup
One pitfall when defining data product priority – especially in larger data teams with many stakeholders – is that they’ll have widely different opinions on what’s important. You should establish a set of boundaries for how these are defined. After all, not everything can be P1. We recommend working with senior stakeholders to agree on what the priority is and get their buy-in.
The Danish fintech Lunar established a data governance framework with exec buy-in to define critical data elements across the company. “Every three months, we meet with the chief risk officer, chief technology officer, and bank CEO to update them on the latest developments, risks, and opportunities. This helps everyone contribute to and have a stake in our data quality“ - Bjørn, Data Manager
Establishing SLAs for your data products
An SLA (Service Level Agreement) is a contract that defines the expected service level between a provider and consumer, including remedies if expectations aren’t met. For example, a P1 SLA for a critical customer-facing dashboard might guarantee 99.9% uptime. If the dashboard is stale or has other data quality issues, the response time is 30 minutes with a two-hour resolution. In contrast, a less critical P3 dashboard may have 95% uptime and slower response times, like 48 hours for resolution.
We don’t recommend measuring the entire system with metrics such as model test coverage, as these can give a false sense of security and lead your team to optimize toward the wrong goals.
Measuring data product SLAs directly ties to the experience your data consumers have and is therefore a much better metric for the quality of your team’s output.
Measuring data product SLAs
We recommend that you consider two metrics to get the most complete picture.
- Coverage – what % of assets have the required data controls in place
- Quality score/SLA – what % of controls in place are passing successfully
Only by looking at both metrics can you confidently say whether the data product SLA is meeting the expectations you’ve set. Both metrics should be monitored across the lineage of the data product to include the status of any upstream dependencies.
Case Study: Data product SLA metrics should include all upstream dependencies
The data team at Lunar reports on data quality KPIs of their most important data products to the C-level every 3 months. “As a regulated company, we need to be able to demonstrate to regulators that we have sufficient data controls and an ability to trace these across all dependencies. This also gives us something demonstrable we can show to our regulators.” - Bjørn, Data Manager
Coverage–define the data control expectations (coverage) for your data products based on their priority. This helps align everyone around expectations and the level of monitoring that’s sufficient. For example, for P1 data products you may agree on expectations as:
- All sources must have freshness checks – either explicitly or implicitly through self-learning monitors
- Key metrics must have relevant accuracy checks in place
- Row count must be tested before and after joins are performed
- Key fields must have
not_null
anduniqueness
tests
You can read more in the chapter Testing & Monitoring for our recommendations on how to monitor your data products.
Quality score/SLA–the best way to measure the SLA of a data product is to look at all data controls through the lens of different KPIs, also known as Service Level Indicators (SLIs). Grouping your SLIs into meaningful areas helps isolate problematic areas and fits well into the type of tests you can perform in tools like dbt. (1) Accuracy—does the data reflect real-world facts (2) Completeness—is all required data present and available (3) Consistency—is the data uniform across systems or sources (4) Uniqueness—are there no duplicate records in the dataset (5) Timeliness—is the data updated and fresh (6) Validity—does the data conform to the required formats and business rules.
To measure the SLA of data products you can sum up the performance of the SLIs. SLA is then calculated as sum(errored SLIs) / sum(SLIs)
.
Some data products may have different sensitivity to different SLIs. For example, for an ML model training set, accurate and complete data may be more important than timely data. In these cases, you should define separate Service Level Objectives (SLOs) for each SLI.
A benefit of the approach above is that once you’ve set up adequate data controls, monitoring, and measurement are fully automated and objective. We recommend that you also monitor the number of declared incidents by their priority as these give you a more subjective view into when a data product was impacted by an issue that was escalated to an incident.
See the chapter Continuous Improvement for more information about how to select your metrics and obtain the relevant data.
Internally at SYNQ, we rely on automated grouping for all our SLIs. This means that new data controls are automatically grouped and counted towards the data product SLA and coverage as they’re added.
Setting expectations for SLA levels and remediation
For a given month, to keep different SLA levels given 1,000 data checks on and upstream of the data product, this is the level of failures you can tolerate
- 99.9% reliability: 1 failure, equivalent to 43 minutes of downtime.
- 99.5% reliability: 5 failures, equal to 3.6 hours of downtime.
- 99% reliability: 10 failures, around 7.2 hours of downtime.
- 95% reliability: 50 failures, equivalent to 1.5 days of downtime.
- 90% reliability: 100 failures, up to 3 days of downtime.
Case Study: Shalion contractually commits data product SLAs to external customers
For Shalion – the eCommerce data solution platform – data is their product. If they provide unreliable insights to end-users through their customer-facing dashboards, this directly impacts customer trust and retention. Therefore, the data team’s next step is to contractually commit to SLAs around the accuracy and correctness of their data products and thoroughly measure the time it takes to detect and notify impacted stakeholders of issues.
In the ideal world, you would be able to guarantee 99.9% SLA of all your P1 data products. In reality, this may be a steep goal to reach. In some cases, the SLA may be something you commit to contractually such as the uptime of a customer-facing dashboard. But in most cases, it will serve as a guideline, and help align data consumers and producers around a common goal.
You should align your incident management processes closely to the priority and SLA of your data products. A 24-hour response time on a P2 data product with a 90% SLA may be OK. But you may need to jump on issues on P1 data products much faster than that.
We’ll look more into this in the chapter Ownership & Alerting.