Continuous Improvement
While it’s common for engineering teams to have established a set of metrics to monitor the performance, uptime, and velocity over time, it’s less common for data teams. It is increasingly important to be able to report on the SLA, performance, and uptime of your data as you take on business-critical data products–but even if the main output of your team is dashboards and analysis for decision-making, it’s still a good idea to establish benchmarks for when data should be ready and feedback loops for learning.
Here are some indicators that it’s time to put metrics in place
- Business-critical data – Your team now owns data products like customer-facing dashboards where any downtime impacts customers directly.
- Data quality perception – You’re hearing complaints about data “unreliability” or slow dashboard readiness without being able to systematically pinpoint issues.
- Inconsistent data quality – You’re seeing inconsistencies across data teams and want to establish consistent, higher standards.
- External accountability – You need to objectively assess data quality and dependencies for regulators or external board members.
- Low signal-to-noise controls – You want to understand and improve the ratio of data control alerts that indicate real business issues.
Tip: Get buy-in outside the data team for your metrics
If it’s only the data team that cares about the metrics you picked, you’ve likely picked the wrong ones. Get buy-in across stakeholders who are impacted by the metric and understand how they’re impacted. For example, a product or account management team may be directly impacted if a customer-facing dashboard is down and contractually committed with an SLA towards the customers. Work closely with these teams on identifying use cases and tie the metrics back to them.
Picking the right metrics
With your use case in mind, you should assess a list of metrics that you can track. It helps to group them into key areas–if your goal is to improve the SLA of key data products, focus on High-level metrics and SLIs. If your goal is to improve the usability of data, focus on Usability-related metrics.
Measuring metrics such as test model test coverage without a clear end goal in mind can create a false sense of security and lead teams to optimize toward the wrong goals
Metric Group | Metric | Description |
---|---|---|
High-Level Coverage | % of assets with required data controls in place | Percentage of assets with required data controls in place |
Quality Score / SLA | % of SLIs passing | Calculated as (passed SLIs) / (total SLIs) |
Accuracy | Data reflects real-world facts | |
Completeness | All required data is present and available | |
Consistency | Data is uniform across systems and sources | |
Uniqueness | No duplicate records exist within the dataset | |
Timeliness | Data is updated and remains fresh | |
Validity | Data conforms to required formats and business rules | |
Usability | Ownership Defined | % of assets with a defined owner |
Priority Level | % of assets with an assigned priority level | |
Data Product Association | % of assets belonging to a data product | |
Description | % of assets with a description | |
Active Users | Number of users actively interacting with the asset | |
Dashboard Load Time | Average time for dashboards to load | |
Operational Metrics | Mean Time to Resolution | Average time to resolve incidents |
Mean Time to Detection | Average time to detect an incident | |
Number of Incidents | Total incidents impacting data products | |
Number of Issues | Total issues logged (not escalated to incidents) | |
Issue-to-Incident Rate | The signal-to-noise ratio for different data controls |
Tip: Consider the metric availability
If you’re just starting, you likely have little to no metrics at all. Some metrics will be easier to get – for example, if you have existing tests and monitors in place, you’ll be able to get the group these into SLIs. On the other hand, if you don’t have an established incident management process in place, tracking incident mean time to resolution may not be the right place to start.
Selecting your North Star metrics
Start with just a few metrics based on the use case you have in mind. If you support a business-critical data product such as a customer-facing dashboard, you’ll likely want to be able to track the coverage and quality score/SLA. It’s important to consider both – if you’re only tracking quality score/SLA without considering the coverage of data controls, you’ll establish a false sense of security of the actual quality of the underlying data.
With this in mind, start decomposing your key metrics into dimensions. In the example below, you can see that SLA is only satisfied for 4 of the last 12 weeks. But this is largely due to the Revenue Forecasting data product consistently falling below the SLA target, giving you a good sense of where to focus and improve.
Establishing service level indicators (SLIs)
Think about SLIs as groups of data quality controls. By grouping the SLIs you can zoom in on if there are specific areas that are causing the SLA to fall behind. The six SLI areas we identified earlier provide a good starting point for grouping your existing data controls and are also the ones we’ve decided to use internally at SYNQ.
- Accuracy: Ensures data correctly represents real-world facts (e.g.,
accepted_values
test for valid statuses, custom SQL checks for calculated metrics). - Completeness: Confirms all necessary data is present (e.g.,
not_null
test for critical columns, row count checks). - Consistency: Verifies uniform data across sources (e.g.,
relationships
test to check foreign key integrity, unique test across datasets). - Uniqueness: Ensures no duplicate entries exist (e.g., unique test on primary key columns).
- Timeliness: Checks data freshness and update frequency (e.g.,
dbt source freshness
test, custom timestamp lag checks). - Validity: Confirms data adheres to formats and rules (e.g.,
accepted_values
test for categorical data, regex-based custom tests for formatting).
With clearly established SLIs, you can go a step further to understand what’s causing the SLA to fall behind for our Revenue Forecasting data product.
With these insights at hand, the next step is clear – you should focus on the timeliness of data, especially for sources feeding into the Revenue Forecasting data product to reach your SLA goals. In our case, we equally weigh all SLIs as components to calculate the SLA. In some cases, you want to set different SLI levels for each SLI. For example, for an ML model, fresh data may be less important causing you to accept a 95% threshold in terms of times that data is loaded in time while you have a lower tolerance for completeness or accuracy issues, causing you to set the SLI target to 99.9%.
Tracking and obtaining the metrics
You may already have the data available to start measuring the key metrics, or you may be starting from scratch uncovering where data lives in source systems, or starting by defining processes to define the metrics.
As you build out the metrics, do it with the following four principles in mind
- Metrics – select metrics that fit the business outcome you’re optimizing for
- Action – the insights your metrics provide should lead to action
- Segment – metrics should be segmentable by key dimensions (owner, data product, …)
- Trend – your metrics should be measured consistently and measurable over time
Below are some ways how you can obtain the data based on the tools you use.
High-level Group | Metrics | How to Obtain |
---|---|---|
High-level | Coverage, Quality Score/SLA | Export from dbt artifacts, dashboards from data observability tools |
Specific Quality (SLIs) | Accuracy, Completeness, Consistency, Uniqueness, Timeliness, Validity | dbt test results, dbt artifacts, data quality monitoring tools (e.g., SYNQ, Great Expectations) |
Usability | % Ownership Defined, % Priority Level, % Belonging to a Data Product, % Descriptions, Number of Active Users | Data catalog exports (e.g., Atlan, Collibra), manual assessments, usage logs |
Operational Metrics | Mean Time to Resolution, Number of Incidents, Number of Issues | Incident management tools (e.g., PagerDuty, Opsgenie), internal ticketing system reports |
Internally at SYNQ, we’ve automated the SLA, coverage, and SLI tracking, so that we can monitor and report on the uptime on all data products at any given time. This helps make sure that monitoring uptime is not an afterthought, but instead something we review regularly.
The 2024 MAD (ML, AI & Data) Landscape gives a good overview of all tools and vendors across data and AI tooling.
Operationalizing insights
You’ll want to put the insights you uncover from monitoring data quality into action. Whether it’s to improve a particular area, share with stakeholders how you’re improving, or something else.
While there’s no one-fit-all solution, we’ve seen these work well.
Automated accountability with a weekly email digest – being the person having to slide into other teams’ Slack channels to tell them that their data quality is not great is not always fun (we’ve been there). Scheduling an automated weekly email with the quality score over time and per owner domain and data product is a great way to bring accountability without one person having to point fingers (It does wonders when people see their team scoring lower than their peers).
Be religious about including metadata – the most common reason we see data quality initiatives failing is that everybody owns data quality, and thus, nobody feels responsible. Only by enforcing metadata such as data product definitions and owner or domain can you hold people accountable for data quality in their area. Build it into key processes such as using the check-model tags CI checks to enforce that certain tags are present.
Beware of the broken windows theory – the broken windows theory can be traced back to criminology and suggests that if you leave a window broken in a compound, everything else starts to fall apart. If residents start seeing that things are falling apart, they stop caring about other things. We can draw the same analogy to data quality.
If you’ve got many failing tests, it’s often a symptom that the signal-to-noise ratio is too low or that you don’t implement tests in the right places. Don’t let failing data checks sit around. Instead, set aside dedicated time, such as “fix-it Fridays” every other week, to work on these types of issues and remove data checks that are no longer needed.
Case Study: Present metrics to stakeholders with a regular cadence
If you’re only sporadically looking at your data quality metrics, it’s harder to establish a benchmark and systemically track improvements. The data team at Lunar meets with key C-level stakeholders every 3 months to update them on data quality KPI progress, new initiatives, and any regulatory risks.
Create run books for data quality – if you’re in a larger team, include clear steps around addressing each data quality dimension so it’s clear for everyone. For example, if the Timeliness score is low, you can recommend steps such as adding a dbt source freshness check or an automated freshness monitor.
Data Product reliability workflow for continuous improvement
If you’ve made it this far, you’ve understood the key components of building a reliable data stack – from defining data use cases as products, setting ownership & severity, deploying strategic tests & monitors, and establishing quality metrics.
Maintaining a reliable, high-quality data platform is not a one-off exercise but requires continuous investment.
New data use cases should be defined as data products with ownership and severity clearly defined.
Tests and monitors should be evaluated on an ongoing basis based on quality metrics, adding new checks in cases where issues are caught by stakeholders, removing low signal-to-noise tests, and keeping teams that score low on quality metrics accountable.
Taking these steps will put your team among the top 10% of data teams.
Good luck!
If you have questions about specific chapters or want to book a free consulting session with the authors of this guide, Petr Janda or Mikkel Dengsøe, find time here.