Building critical data products? Sign up for our upcoming guide

— Written by Mikkel Dengsøe in Articles — 1/20/2023

Designing severity levels for data issues

Things to consider when setting severity levels, how to act based on different severity levels and simple steps to getting started

We’ve had to solve more data issues than we can count. Some have been minor issues that didn’t have any material business impact, and others have meant that we had to drop everything to fix a critical issue.

Dealing with these two types of issues requires a very different approach, but too often, this is not explicitly defined. This leads to negative side effects such as important issues not being acted on fast enough or non-important data issues derailing the data team.

In this article, we’ll walk through

  • Things to consider when setting severity levels
  • How to act on different severity levels
  • How to get started

Things to consider when setting severity levels

Well-designed severity levels create clear expectations for what happens when there is an issue and ensure that everyone is on the same page.

How to assess the severity of an issue

We recommend using three parameters to assess the severity of an issue:

  1. Critical use case: Does the data have a business-critical use case
  2. Downstream impact: How many downstream assets and users are affected
  3. Magnitude: What is the impact on the underlying data
63d8c89f4bb5ec76f035a915_Screenshot 2023-01-31 at 08 51 42

You should aim to be able to assess all three within 5 minutes of being made aware of an issue.‍

There are a few more steps to take to confidently understand the severity of an issue.

Think of data importance as a chain

It’s not enough to know if an issue happens on a data asset with a business-critical use case. You should also know if the issue is on the critical path. Any issue on a data asset that sits upstream of business-critical data is on the critical path and should be treated as such.

63b6ba598a44f0f66112badd_F2UxvTOeW1fq0FCE9MgtJ_z4N7fURy-5pGFt8Jk-27xh-mq2c3i5vOxR_iGcRw60XoVCt33P7KUM0n0nTK0k1YXB199Zs_fnxkZr7S16RhQR0nX3NqauNaPlRQJK-yzdOZi-YIqx_pGYMSoccGuUiH_Pl9c72Hir8KGuh3AxsS5HG6Dkq9kBbfcrp7Gsdw

Severity should be managed across the stack

To be able to confidently assess the severity of an issue, you need to look across the stack. For example, you may miss that a data issue impacts a dashboard that half the company uses if you don’t consider the impact on the BI layer. Or you may miss that a data model has a critical use case because it’s being used for a Hightouch sync from the marketing team to decide which users to send a campaign to.

Automate your severity levels

Whenever you have a data issue, the last thing you want to do is to have to manually go through all the steps above. Instead, it should be automated and built into your existing workflows.

For example, if you receive a Slack alert about a test failure, it should highlight if the issue is on a data model with a critical use case or if there are any critical use cases upstream. It should also show the total number of assets impacted downstream and give you a sense of the issue’s magnitude.

How to act on issues based on severity levels

To act quickly and with confidence on data issues, you should be able to answer the following questions

63b6ba59ccc6850c318531e0_8ZwZJS4jGa_gJ1OajNwqoK7_IxVlXgY2u-9UCZiUiQZubxmCCX7If9salLVkGOmEXQVO0B_m60b1mDca_49Nl0dSgDlWxgab2e3g4qU-OscOpYDPio_EG6GarLKKx5g_TksRiDklASqOZcGC0bKHoiLY5Ps6iiybuLsm4_XyIUlatdkHZh30EGhe20Nv3A

With this checklist in mind, we recommend following these steps to act confidently on issues.

Be systematic when acting on different levels of severity

Have guidelines for how data issues should be dealt with. For example

  • Low: Add to the backlog to fix it by the end of week. E.g. Non-critical issue with low downstream impact
  • Medium: Let stakeholders know and fix the issue by the end of the day. E.g. Non-critical issue and high downstream impact
  • High: Stop everything you’re doing to fix the issue right away. E.g critical use case with high downstream impact

Clearly define owners and expectations

It should be clear who should look at an issue. Ideally, you have well-defined ownership that maps individuals or teams to data assets, and these teams are well organised around how to address issues.

But what to do if an issue happens on the weekend or after work hours? It’s fine if you’ve decided that you’re not acting on issues outside of work hours, but you should be explicit about it. If you’re not explicitly making a choice, you’re implicitly making some people who care the most take on these without being rewarded for it.

Have guidelines for who to communicate to

If you’ve defined clear owners, they often know best which stakeholders to notify when there’s an issue. You should have clear guidelines for how this works. For example, you may have a rule to notify the leadership team of issues with high magnitude on business-critical data. Or you may have a Slack group with your dashboard users where you announce issues impacting their dashboards. Whichever way you do it, be consistent and proactive so stakeholders don’t learn about data issues themselves and start losing trust.

How to get started with severity levels

Four practical steps to getting started with severity levels.

63b6ba59e43b6b2966890d73_6fuBUR_cafiCk5lSKhGcZiVs9kQYocxTBmtIThf4G88Mq9oA8e2W9ibK-ugCC-F0iAPImgZRJ80WMQ-w8-q_IVW1-1JWML5V0_Hq0OJpD-s1OHCJg6YdpeOK5AmOZFbj9mipCUP8BbyiC1s5t-j72MB62vSYUQnw69uYhWyfmBS9sY2-1-tpUEbEHGvMtA
  1. Explicitly define your business-critical data but start small. If you ask different stakeholders, they may all say that their data is important, but then you’re back where you started. Instead, start by defining the smallest possible amount of data assets that are most critical for your business
  2. Be consistent: Be consistent in how you interpret what’s important, so you avoid different teams making their own interpretations
  3. Automate severity as part of your workflow. Understanding if a critical data asset is impacted or what the downstream impact is should be automated and fit into your existing workflows
  4. Don’t overthink it: Avoid making this a month-long project. Instead, get started by defining your handful of most critical data assets and systematise how you act depending on the severity

If you’ve done this well, you should find yourself in a position where severity levels help you save time and where you proactively address and catch more critical data issues.