Continuous Improvement

While it’s common for engineering teams to have established a set of metrics to monitor the performance, uptime, and velocity over time, it’s less common for data teams. It is increasingly important to be able to report on the SLA, performance, and uptime of your data as you take on business-critical data products–but even if the main output of your team is dashboards and analysis for decision-making, it’s still a good idea to establish benchmarks for when data should be ready and feedback loops for learning.

Here are some indicators that it’s time to put metrics in place

  • Business-critical data – Your team now owns data products like customer-facing dashboards where any downtime impacts customers directly.
  • Data quality perception – You’re hearing complaints about data “unreliability” or slow dashboard readiness without being able to systematically pinpoint issues.
  • Inconsistent data quality – You’re seeing inconsistencies across data teams and want to establish consistent, higher standards.
  • External accountability – You need to objectively assess data quality and dependencies for regulators or external board members.
  • Low signal-to-noise controls – You want to understand and improve the ratio of data control alerts that indicate real business issues.

Tip: Get buy-in outside the data team for your metrics

If it’s only the data team that cares about the metrics you picked, you’ve likely picked the wrong ones. Get buy-in across stakeholders who are impacted by the metric and understand how they’re impacted. For example, a product or account management team may be directly impacted if a customer-facing dashboard is down and contractually committed with an SLA towards the customers. Work closely with these teams on identifying use cases and tie the metrics back to them.

Picking the right metrics

With your use case in mind, you should assess a list of metrics that you can track. It helps to group them into key areas–if your goal is to improve the SLA of key data products, focus on High-level metrics and SLIs. If your goal is to improve the usability of data, focus on Usability-related metrics.

Measuring metrics such as test model test coverage without a clear end goal in mind can create a false sense of security and lead teams to optimize toward the wrong goals

Metric GroupMetricDescription
High-Level Coverage% of assets with required data controls in placePercentage of assets with required data controls in place
Quality Score / SLA% of SLIs passingCalculated as (passed SLIs) / (total SLIs)
AccuracyData reflects real-world facts
CompletenessAll required data is present and available
ConsistencyData is uniform across systems and sources
UniquenessNo duplicate records exist within the dataset
TimelinessData is updated and remains fresh
ValidityData conforms to required formats and business rules
UsabilityOwnership Defined% of assets with a defined owner
Priority Level% of assets with an assigned priority level
Data Product Association% of assets belonging to a data product
Description% of assets with a description
Active UsersNumber of users actively interacting with the asset
Dashboard Load TimeAverage time for dashboards to load
Operational MetricsMean Time to ResolutionAverage time to resolve incidents
Mean Time to DetectionAverage time to detect an incident
Number of IncidentsTotal incidents impacting data products
Number of IssuesTotal issues logged (not escalated to incidents)
Issue-to-Incident RateThe signal-to-noise ratio for different data controls

Tip: Consider the metric availability

If you’re just starting, you likely have little to no metrics at all. Some metrics will be easier to get – for example, if you have existing tests and monitors in place, you’ll be able to get the group these into SLIs. On the other hand, if you don’t have an established incident management process in place, tracking incident mean time to resolution may not be the right place to start.

Selecting your North Star metrics

Start with just a few metrics based on the use case you have in mind. If you support a business-critical data product such as a customer-facing dashboard, you’ll likely want to be able to track the coverage and quality score/SLA. It’s important to consider both – if you’re only tracking quality score/SLA without considering the coverage of data controls, you’ll establish a false sense of security of the actual quality of the underlying data.

With this in mind, start decomposing your key metrics into dimensions. In the example below, you can see that SLA is only satisfied for 4 of the last 12 weeks. But this is largely due to the Revenue Forecasting data product consistently falling below the SLA target, giving you a good sense of where to focus and improve.

chapter6

Establishing service level indicators (SLIs)

Think about SLIs as groups of data quality controls. By grouping the SLIs you can zoom in on if there are specific areas that are causing the SLA to fall behind. The six SLI areas we identified earlier provide a good starting point for grouping your existing data controls and are also the ones we’ve decided to use internally at SYNQ.

  • Accuracy: Ensures data correctly represents real-world facts (e.g., accepted_values test for valid statuses, custom SQL checks for calculated metrics).
  • Completeness: Confirms all necessary data is present (e.g., not_null test for critical columns, row count checks).
  • Consistency: Verifies uniform data across sources (e.g., relationships test to check foreign key integrity, unique test across datasets).
  • Uniqueness: Ensures no duplicate entries exist (e.g., unique test on primary key columns).
  • Timeliness: Checks data freshness and update frequency (e.g., dbt source freshness test, custom timestamp lag checks).
  • Validity: Confirms data adheres to formats and rules (e.g., accepted_values test for categorical data, regex-based custom tests for formatting).

With clearly established SLIs, you can go a step further to understand what’s causing the SLA to fall behind for our Revenue Forecasting data product.

chapter6

With these insights at hand, the next step is clear – you should focus on the timeliness of data, especially for sources feeding into the Revenue Forecasting data product to reach your SLA goals. In our case, we equally weigh all SLIs as components to calculate the SLA. In some cases, you want to set different SLI levels for each SLI. For example, for an ML model, fresh data may be less important causing you to accept a 95% threshold in terms of times that data is loaded in time while you have a lower tolerance for completeness or accuracy issues, causing you to set the SLI target to 99.9%.

Tracking and obtaining the metrics

You may already have the data available to start measuring the key metrics, or you may be starting from scratch uncovering where data lives in source systems, or starting by defining processes to define the metrics.

As you build out the metrics, do it with the following four principles in mind

  1. Metrics – select metrics that fit the business outcome you’re optimizing for
  2. Action – the insights your metrics provide should lead to action
  3. Segment – metrics should be segmentable by key dimensions (owner, data product, …)
  4. Trend – your metrics should be measured consistently and measurable over time

Below are some ways how you can obtain the data based on the tools you use.

High-level GroupMetricsHow to Obtain
High-levelCoverage, Quality Score/SLAExport from dbt artifacts, dashboards from data observability tools
Specific Quality (SLIs)Accuracy, Completeness, Consistency, Uniqueness, Timeliness, Validitydbt test results, dbt artifacts, data quality monitoring tools (e.g., SYNQ, Great Expectations)
Usability% Ownership Defined, % Priority Level, % Belonging to a Data Product, % Descriptions, Number of Active UsersData catalog exports (e.g., Atlan, Collibra), manual assessments, usage logs
Operational MetricsMean Time to Resolution, Number of Incidents, Number of IssuesIncident management tools (e.g., PagerDuty, Opsgenie), internal ticketing system reports

Internally at SYNQ, we’ve automated the SLA, coverage, and SLI tracking, so that we can monitor and report on the uptime on all data products at any given time. This helps make sure that monitoring uptime is not an afterthought, but instead something we review regularly.

The 2024 MAD (ML, AI & Data) Landscape gives a good overview of all tools and vendors across data and AI tooling.

Operationalizing insights

You’ll want to put the insights you uncover from monitoring data quality into action. Whether it’s to improve a particular area, share with stakeholders how you’re improving, or something else.

While there’s no one-fit-all solution, we’ve seen these work well.

Automated accountability with a weekly email digest – being the person having to slide into other teams’ Slack channels to tell them that their data quality is not great is not always fun (we’ve been there). Scheduling an automated weekly email with the quality score over time and per owner domain and data product is a great way to bring accountability without one person having to point fingers (It does wonders when people see their team scoring lower than their peers).

Be religious about including metadata – the most common reason we see data quality initiatives failing is that everybody owns data quality, and thus, nobody feels responsible. Only by enforcing metadata such as data product definitions and owner or domain can you hold people accountable for data quality in their area. Build it into key processes such as using the check-model tags CI checks to enforce that certain tags are present.

Beware of the broken windows theory – the broken windows theory can be traced back to criminology and suggests that if you leave a window broken in a compound, everything else starts to fall apart. If residents start seeing that things are falling apart, they stop caring about other things. We can draw the same analogy to data quality.

If you’ve got many failing tests, it’s often a symptom that the signal-to-noise ratio is too low or that you don’t implement tests in the right places. Don’t let failing data checks sit around. Instead, set aside dedicated time, such as “fix-it Fridays” every other week, to work on these types of issues and remove data checks that are no longer needed.

 Case Study: Present metrics to stakeholders with a regular cadence

If you’re only sporadically looking at your data quality metrics, it’s harder to establish a benchmark and systemically track improvements. The data team at Lunar meets with key C-level stakeholders every 3 months to update them on data quality KPI progress, new initiatives, and any regulatory risks.

Create run books for data quality – if you’re in a larger team, include clear steps around addressing each data quality dimension so it’s clear for everyone. For example, if the Timeliness score is low, you can recommend steps such as adding a dbt source freshness check or an automated freshness monitor.

Data Product reliability workflow for continuous improvement

If you’ve made it this far, you’ve understood the key components of building a reliable data stack – from defining data use cases as products, setting ownership & severity, deploying strategic tests & monitors, and establishing quality metrics.

chapter6

Maintaining a reliable, high-quality data platform is not a one-off exercise but requires continuous investment.

New data use cases should be defined as data products with ownership and severity clearly defined.

Tests and monitors should be evaluated on an ongoing basis based on quality metrics, adding new checks in cases where issues are caught by stakeholders, removing low signal-to-noise tests, and keeping teams that score low on quality metrics accountable.

Taking these steps will put your team among the top 10% of data teams.

Good luck!

If you have questions about specific chapters or want to book a free consulting session with the authors of this guide, Petr Janda or Mikkel Dengsøe, find time here.

chapter6