We’ve had to solve more data issues than we can count. Some have been minor issues that didn’t have any material business impact, and others have meant that we had to drop everything to fix a critical issue.
Dealing with these two types of issues requires a very different approach, but too often, this is not explicitly defined. This leads to negative side effects such as important issues not being acted on fast enough or non-important data issues derailing the data team.
In this article, we’ll walk through
- Things to consider when setting severity levels
- How to act on different severity levels
- How to get started
Things to consider when setting severity levels
Well-designed severity levels create clear expectations for what happens when there is an issue and ensure that everyone is on the same page.
How to assess the severity of an issue
We recommend using three parameters to assess the severity of an issue:
- Critical use case: Does the data have a business-critical use case
- Downstream impact: How many downstream assets and users are affected
- Magnitude: What is the impact on the underlying data
You should aim to be able to assess all three within 5 minutes of being made aware of an issue.
There are a few more steps to take to confidently understand the severity of an issue.
Think of data importance as a chain
It’s not enough to know if an issue happens on a data asset with a business-critical use case. You should also know if the issue is on the critical path. Any issue on a data asset that sits upstream of business-critical data is on the critical path and should be treated as such.
Severity should be managed across the stack
To be able to confidently assess the severity of an issue, you need to look across the stack. For example, you may miss that a data issue impacts a dashboard that half the company uses if you don’t consider the impact on the BI layer. Or you may miss that a data model has a critical use case because it’s being used for a Hightouch sync from the marketing team to decide which users to send a campaign to.
Automate your severity levels
Whenever you have a data issue, the last thing you want to do is to have to manually go through all the steps above. Instead, it should be automated and built into your existing workflows.
For example, if you receive a Slack alert about a test failure, it should highlight if the issue is on a data model with a critical use case or if there are any critical use cases upstream. It should also show the total number of assets impacted downstream and give you a sense of the issue’s magnitude.
How to act on issues based on severity levels
To act quickly and with confidence on data issues, you should be able to answer the following questions
With this checklist in mind, we recommend following these steps to act confidently on issues.
Be systematic when acting on different levels of severity
Have guidelines for how data issues should be dealt with. For example
- Low: Add to the backlog to fix it by the end of week. E.g. Non-critical issue with low downstream impact
- Medium: Let stakeholders know and fix the issue by the end of the day. E.g. Non-critical issue and high downstream impact
- High: Stop everything you’re doing to fix the issue right away. E.g critical use case with high downstream impact
Clearly define owners and expectations
It should be clear who should look at an issue. Ideally, you have well-defined ownership that maps individuals or teams to data assets, and these teams are well organised around how to address issues.
But what to do if an issue happens on the weekend or after work hours? It’s fine if you’ve decided that you’re not acting on issues outside of work hours, but you should be explicit about it. If you’re not explicitly making a choice, you’re implicitly making some people who care the most take on these without being rewarded for it.
Have guidelines for who to communicate to
If you’ve defined clear owners, they often know best which stakeholders to notify when there’s an issue. You should have clear guidelines for how this works. For example, you may have a rule to notify the leadership team of issues with high magnitude on business-critical data. Or you may have a Slack group with your dashboard users where you announce issues impacting their dashboards. Whichever way you do it, be consistent and proactive so stakeholders don’t learn about data issues themselves and start losing trust.
How to get started with severity levels
Four practical steps to getting started with severity levels.
- Explicitly define your business-critical data but start small. If you ask different stakeholders, they may all say that their data is important, but then you’re back where you started. Instead, start by defining the smallest possible amount of data assets that are most critical for your business
- Be consistent: Be consistent in how you interpret what’s important, so you avoid different teams making their own interpretations
- Automate severity as part of your workflow. Understanding if a critical data asset is impacted or what the downstream impact is should be automated and fit into your existing workflows
- Don’t overthink it: Avoid making this a month-long project. Instead, get started by defining your handful of most critical data assets and systematise how you act depending on the severity
If you’ve done this well, you should find yourself in a position where severity levels help you save time and where you proactively address and catch more critical data issues.