As the data stack grows in complexity, it’s no longer possible for one person to keep everything in their head, and more often than not, the person who notices a problem is not the right person to fix it. Simultaneously, the number of upstream and downstream dependencies has exploded, making it challenging to locate the right upstream owner or notify impacted stakeholders. Well-defined ownership and incident management processes can help with this by clarifying who’s responsible, and how they’re notified and streamlining the process around incident and issue response.
While there’s no one-fit-answer to ownership and incident management, working through the four steps below will set you up well no matter if you’re a 5-person data team or a data team in the Fortune 500.
Ownership can be daunting, as it’s both a technical and cultural challenge. If ownership works well, boundaries of responsibility are clear, and ownership is brought into action – both within and outside of the data team. If not, it’s only sporadically defined and rarely actioned.
Whether you’re just getting started with ownership or have existing owner processes in place already, we recommend thinking through these steps.
In the ideal world, you’d neatly group your stack into well-defined areas with clear boundaries. But in reality, ownership lines can get blurry, so don’t be discouraged if you can’t easily assign ownership to all assets. Data rarely stops or ends with the data team. Instead, data is ingested from 1st and 3rd party data sources, loaded and transformed in the data warehouse, and exposed to end-users in a BI tool or use cases such as ML/AI.
This is how we manage ownership at SYNQ:
owner tag
and dbt groups
. We enforce that ownership tags are set using CI checks.More than 50% of our data assets above are already encapsulated into data products making the ownership seamless to define.
The data lineage of an internal data product at SYNQ. Ownership can be mapped and overlaid across all dependencies
The result is that we manage ownership of data across the entire company and not just as something that’s owned by the analytics engineering team.
Use groups that already exist in your company, such as Google Groups or Slack team channels formed around existing teams. These will always be up-to-date as people leave or join without you having to manage groups for data ownership specifically
Below are step-by-step instructions for how to define ownership.
Data products should be your starting point–while you may not want to set ownership for all your data assets, we recommend you at least do it for important data. Having assets with clear ownership makes it more likely that the right people act on issues or are notified if the data they use is wrong. If you’ve defined data products with well-established priority levels, the highest-priority data products are a great place to start.
Ownership definitions will closely follow your data product definitions. For example
Set ownership at a too-high level and you risk that no individuals take responsibility. For example, by defining the owner as “data-team” you risk the definition is too broad to act on. Setting ownership on an individual level creates a lot of accountability but little scalability. You run the risk that people move around to different teams, go on holidays, or leave the company.
At SYNQ, we assign ownership based on teams and their associated Slack channels. Where possible, we use dbt groups so we only have to keep ownership metadata such as Slack channel updated in one place. As all alerts on input sources go to one channel, we also assign individuals based on sources they own, so that they’re tagged in Slack to bring attention to these issues.
Use this option if you’ve already organized your dbt project or data warehouse schemas to resemble your ownership structure, such as marketing, finance, and operations. With folder-based ownership, you typically need less time to get set up, and as you add new data models, they, by design, fall into an existing owner group, reducing your upkeep. If you work with non-technical stakeholders who don’t contribute to your code base, such as business analysts or data stewards, this approach makes it easier for them.
dbt has built-in support for designating owners using the meta: owner tag. Owners are displayed in dbt Docs, making it easy for everyone to see who’s responsible.
models:
- name: users
meta:
owner: "analytics-engineering"
You can extend this to dbt sources to define ownership to upstream teams. If issues happen on sources before any data transformations, it indicates that upstream teams should own the issue.
An added benefit is that this approach lets you use CI checks, such as check-model tags from the pre-commit dbt package, to ensure that each data model has an owner tag assigned.
With dbt 1.5, dbt launched support for groups. Groups are helpful for larger dbt projects where you want to encapsulate parts of the internal logic only to be accessible to members of that group – similar to how you’d only expose certain end-points in a public API to end-users. If a model’s access property is private, only owners within its group can reference it.
models/marts/finance/finance.yml
groups:
- name: finance
owner:
# 'name' or 'email' is required; additional properties allowed
email: finance@acme.com
slack: finance-data
github: finance-data-team
models/schema.yml
models:
- name: finance_private_model
access: private
config:
group: finance
# in a different group!
- name: marketing_model
config:
group: marketing
There are situations where you want to manage ownership across multiple tools – from databases to multiple data warehouses and dashboarding tools. This is useful when you want to find dashboards owned by specific teams or build out capabilities to notify downstream impacted stakeholders when you have a data incident. Managing cross-tool ownership in code can be difficult as there’s often no coherent way to define this. Tools such as a data catalog or data reliability platform are built for this.
Data ownership doesn’t have to stop with the data team. Below, we’ll look at ways you can notify: (1) the data team, (2) upstream teams, and (3) business stakeholders.
One of the top pitfalls we see is when teams spend a lot of time mapping out and defining ownership, but let it sit stale on a Confluent page that gradually gets out of sync with reality.
Managing ownership within the data team is the most straightforward. Your team is in control; typically, the tools are within the stack you manage. You can use your existing ownership definitions to ensure the right owner knows about the right issue. The two most effective ways to do this, assuming you use a communication tool like Slack, is by tagging owners and routing alerts based on your ownership definitions:
As a rule of thumb, you’ll see the most impact once your data team moves past a handful of people, and everybody no longer has full visibility into all data assets.
We typically see two kinds of upstream teams that produce data and need to be alerted differently: technical teams, such as engineering, and non-technical teams, such as a SalesOps team owning Salesforce customer data.
a. Technical teams – the alerts you send don’t need to look different from those in the example from the data team above. If you’ve placed tests at your sources and detected issues before any data transformations, engineers should be able to connect the dots between the source and the error message and trace back the issues to their systems. For larger teams with a clear split between teams that ingest data (e.g., data platform) and teams that produce data (e.g., frontend engineers), it can be helpful to compliment the error message with details about what event or service it relates to.
b. Non-technical teams – bringing ownership of quality of source systems to non-technical teams is underrated. Too often, tedious input errors such as an incorrect customer amount or a duplicate employee_id end up on the data team to debug, triage, and find the right owner. With the right context, these teams can start owning this without the data team being involved.
One thing is starting to send alerts to an upstream team. But getting them to consistently act on them is another challenge. “Before we started to send alerts to our operations and lending team about faulty customer records, we got our COO to buy into the initiative. Only then were we able to ensure that the team was incentivized to act on and prioritize alerts from the data team” – Rupert, data lead at LendInvest
Sometimes, you can alert your stakeholders to notify them of issues proactively with alerts similar to those you send to the data team. This works best if your stakeholders are data-savvy teams, such as a group of analysts in a business domain.
Sending Slack alerts to your marketing director that five rows are failing a unique test on the orders data model is not the best idea.
The best way to notify impacted stakeholders is most often for the person with the relevant context to “declare” it unless the stakeholder is technical. In most cases, we recommend that the data owner notify the stakeholder directly and link to an ongoing incident page so they can follow along on the issue resolution. Another way is displaying issues directly in dashboards so end-users are aware of issues – but this can be more risky and difficult to interpret for non-technical users and recommend you only use this with caution.
At SYNQ, our Technical Account Managers rely on a ‘Usage Report’ data product to see the usage across our customers’ workspaces. As our Technical Account Managers are technical stakeholders, alerts on or upstream of this data product are sent directly to the #tech-ops channel. This helps them be the first to know if data is unreliable and not make decisions based on incorrect information.
If you spam a Slack channel with alerts, chances are people will stop paying attention to them. Preventing alert overload is solved throughout the data reliability workflow – from architecture design to monitoring and testing deployment, and alerting & ownership rules. Not all issues have to be sent to the #data-team Slack channel. A better workflow is to be deliberate with what’s sent to the main alerting channel. Issues on less critical data assets can be sent to a different channel, or not sent as alerts at all, and managed in weekly backlog review.
With adequate ownership in place, you’re well-positioned to start streamlining responses to issues and adopting incident management for more severe issues.
If done well and combined with well-defined ownership definitions, this has several benefits–(1) Time to resolution is slowed as the issues are brought to the right owners with the relevant context. (2) Important issues are prioritized based on established incident management response processes. (3) Time spent resolving data issues is reduced as ownership lines are less blurred. (4) You start building an institutionalized way or adopting learnings from incidents and symmetrically improve.
There’s no set way to manage issues and incident response, and you should always look to adapt to other ways of working in your companies. With that being said when adopting incident management for rapid responses in data teams, we recommend working through these five steps
Start by defining expectations from the data team for what it means to be on call. For smaller data teams, this may be the sole data person being the “data responder duty” for the week. For others, it may mean that relevant owners address issues within a predefined SLA as they come up. For teams owning core business processes, this may involve being paged or notified outside business hours, closely tied to the SLA definitions of your data products. We recommend you consider them across these groups:
If the existing ownership activation rules you’ve set up are not working out of hours, you can create a “@slack-responder” group and only have the people on-call in it to avoid tagging the entire data team when there are issues out of hours.
If you’re not specific about expectations for on-call (e.g., we only look at issues within business hours), people will start adopting different expectations which can create an uneven workload across the team.
At SYNQ, our data platform powers core in-product functionality such as our Quality Analytics tab. If there are issues here, we strive to detect the issue within an hour and resolve it within a few hours.
An incident begins when something goes wrong—whether it’s a critical dbt job failing, a table no longer receiving new data, or an SQL test for data integrity breaking down. At this point, it’s still just an issue, and you should be able to achieve these three objectives:
Without sufficient testing and monitoring in place, you’ll be caught on the backfoot, only learning about issues when you’ve critical system failures or issues detected directly by stakeholders.
The chapter Testing & Monitoring goes into more detail about how to set up the right monitoring to help you avoid this.
When an alert is triggered, you should assess the situation thoroughly. The alert may not provide all the context needed, and a system failure might be connected to other relevant issues. At this stage, you should have an understanding of all other issues and their connectivity. Internally at
SYNQ, we always aim to be able to answer three questions:
Example summary card–system-wide issue impacting critical data products
Answering these questions provides the context needed to evaluate the incident’s urgency and decide whether it warrants a full incident response. This triage step helps separate critical issues from minor ones that don’t require full incident management.
An efficient triage workflow should alert relevant team members using your ownership definitions and offer a comprehensive view of failures and their context. This enables data engineers and analysts to assess issue severity and interrelations effectively.
Linking incidents to data products, or your other business-critical datasets is especially useful, as it highlights the potential impact of each issue and automatically identifies affected datasets, streamlining the triage process. This is a process that can otherwise be nearly impossible to do, especially if you’ve many hundreds of data assets downstream of an issue.
Once you’re able to answer these questions, we recommend that you set clear expectations levels closely tied to your on-call setup. At SYNQ, we use the following benchmarks for our MTTD (mean time to detect) and MTTR (mean time to resolve).
SYNQ’s internal MTTD and MTTR metrics
If your data team or engineering team already uses an incident management tool like PagerDuty, Opsgenie, or incident.io, we recommend you link your data-related issues to these, so that an alert can be automatically linked to existing workflows, as well as bring on core metadata around the impact directly to the existing platforms.
While there’s no universal approach to handling data-related incidents, a structured and well-documented incident management process significantly helps make the response smoother and more effective.
Not all issues should be incidents but erring on the side of declaring too many rather than too few incidents gives you a tracable log of issues, where if something goes wrong, you can go back and look at what happened last time. You can always adjust the incident severity and prioritize it accordingly.
Creating a dedicated space for communication (such as a document, a Slack channel, or an incident in an external tool) ensures that stakeholders are kept in the loop and avoids having to switch between multiple apps and tabs, allowing you to focus your efforts on identifying and resolving the root cause of the issue.
At SYNQ, we use these steps for coordinating key communication and root cause analysis.
Each of these steps will be much easier with well-defined data products and ownership in place–you’ll know who to communicate or escalate issues to, you’ll be able to see if important data products are impacted downstream, and have a log to trace down previous related incidents, and how they were solved last time.
It can be tempting to set an incident aside once resolved, but doing so may mean missing valuable insights and failing to establish processes to prevent it from reoccurring—or leaving someone else to grapple with a similar issue later on.
For lower-severity incidents, such as a test failure, follow-up checklists with assigned task owners can help ensure that guardrails are put in place to prevent recurrence. For more critical incidents, like a critical data product failure, a postmortem review can offer deeper insights and guide process improvements.
INC-123: Data Product Incident Title
ℹ️ Key Information ⏱️ Timestamps
+--------------------+-------------+ +-----------------------+---------------+
| Data Product | … | | Reported at | … |
| Priority | P1, P2, … | | Impact started at | … |
| SLA | 99.9%, … | | Resolved at | … |
+--------------------+-------------+ +-----------------------+---------------+
👪 Ownership & Teams ⏳ Durations
+--------------------+-------------+ +-----------------------+---------------+
| Product Owner | … | | Time to Identify | … |
| Slack Channel | #team-name | | Time to Fix | … |
+--------------------+-------------+ +-----------------------+---------------+
🪞 Related Incidents 🔗 Useful Links
+--------------------------+ +--------------------------+
| INC-456 | | Incident Homepage |
| INC-789 | | Slack Channel |
+--------------------------+ +--------------------------+
🖊️ Summary
+------------------------------------------------------------+
| Summary of incident impact on "Marketing Attribution" data |
| product affecting downstream assets… |
+------------------------------------------------------------+
📆 Incident Timeline
+---------------------+------------------+--------------------------+
| Date | Time | Event |
+------------+--------+----------------------------------------------+
| 2024-01-01 | 12:00 | Reported by Data Engineer |
| 2024-01-01 | 12:30 | Priority escalated to P1 |
| 2024-01-02 | 12:00 | Incident Resolved |
+------------+--------+----------------------------------------------+
⬆️ Root Cause ⬇️ Mitigators
+--------------------------+ +----------------------------+
| - Missing upstream checks | | - Add source freshness |
| - No quality assurance | | - Raise priority of source |
| in source system | | data product |
+--------------------------+ +----------------------------+
👀 Risks ⏩ Follow-up Actions
+--------------------------+ +----------------------------+
| - Low ownership of | | - Add review with core |
| source data quality | | engineering team |
| - Weak completeness SLIs | | - Raise priority of issue |
+--------------------------+ +----------------------------+
Example template for a post-mortem of a critical data incident
Additionally, a periodic review of incident trends can reveal patterns—such as recurring errors in specific data assets—or highlight imbalances such as specific upstream teams being responsible for a disproportionally large amount of issues.
You should tie these metrics together with your wider data reliability workflow and be able to measure metrics such as
In the chapter Continuous improvement, we’ll look more into how you can establish feedback loops and learning processes to continuously enhance data reliability practices.