Ownership & Alerting
As the data stack grows in complexity, it’s no longer possible for one person to keep everything in their head, and more often than not, the person who notices a problem is not the right person to fix it. Simultaneously, the number of upstream and downstream dependencies has exploded, making it challenging to locate the right upstream owner or notify impacted stakeholders. Well-defined ownership and incident management processes can help with this by clarifying who’s responsible, and how they’re notified and streamlining the process around incident and issue response.
While there’s no one-fit-answer to ownership and incident management, working through the four steps below will set you up well no matter if you’re a 5-person data team or a data team in the Fortune 500.
- Getting started with ownership
- Defining ownership
- Activating ownership by alerting the right people with the right context
- Adopting incident management in your data team for rapid response
Getting started with ownership
Ownership can be daunting, as it’s both a technical and cultural challenge. If ownership works well, boundaries of responsibility are clear, and ownership is brought into action – both within and outside of the data team. If not, it’s only sporadically defined and rarely actioned.
Whether you’re just getting started with ownership or have existing owner processes in place already, we recommend thinking through these steps.
- Integrate metadata–before you begin, you need a central place where you define ownership. In the most basic approach, this may be using a tool like dbt’s built-in metadata. In a more sophisticated approach, this may be bringing all your data assets together – from source systems to data warehouse tables and BI tools – in a tool such as a data catalog
- Define data products–start by identifying your most important data assets as data products. After all, that’s where the stakes are highest and should be your starting point for defining ownership
- Assign ownership–assign ownership based on responsible teams or individuals, ideally using existing ownership structures such as Google Groups
- Deploy data controls–with ownership definitions in place, strategically place monitors based on the owners’ domain knowledge of the data they own
- Notify relevant owners–activate ownership through relevant alerting or escalations to incident management tooling
Defining ownership
In the ideal world, you’d neatly group your stack into well-defined areas with clear boundaries. But in reality, ownership lines can get blurry, so don’t be discouraged if you can’t easily assign ownership to all assets. Data rarely stops or ends with the data team. Instead, data is ingested from 1st and 3rd party data sources, loaded and transformed in the data warehouse, and exposed to end-users in a BI tool or use cases such as ML/AI.
This is how we manage ownership at SYNQ:
- Well-defined ownership at the input layer–we ensure that ownership at sources is clearly defined so that escalation paths to upstream engineering teams are unambiguous, and so that they can be notified directly of issues on source systems before any data transformations are done.
- Clearly defined boundaries on staging marts–within the staging and transformation layer, our analytics engineers assign ownership of models based on dbt metadata definitions using the
owner tag
anddbt groups
. We enforce that ownership tags are set using CI checks. - Stakeholder ownership on consumer-facing marts–for our consumer-facing data marts and products, we’ve organized them into mart folders based on their use case and assigned relevant owners such as our Technical Account Management team who get notified if there are issues with data they rely on.
More than 50% of our data assets above are already encapsulated into data products making the ownership seamless to define.
The data lineage of an internal data product at SYNQ. Ownership can be mapped and overlaid across all dependencies
The result is that we manage ownership of data across the entire company and not just as something that’s owned by the analytics engineering team.
Tip: Use existing owner groups
Use groups that already exist in your company, such as Google Groups or Slack team channels formed around existing teams. These will always be up-to-date as people leave or join without you having to manage groups for data ownership specifically
Below are step-by-step instructions for how to define ownership.
Defining ownership based on data products
Data products should be your starting point–while you may not want to set ownership for all your data assets, we recommend you at least do it for important data. Having assets with clear ownership makes it more likely that the right people act on issues or are notified if the data they use is wrong. If you’ve defined data products with well-established priority levels, the highest-priority data products are a great place to start.
Ownership definitions will closely follow your data product definitions. For example
- The Marketing Attribution Data Product will be owned by marketing data
- The Users Data Product will be owned by analytics engineering
- The ARR Data Product will be owned by finance data
Tip: Ownership should be set at the right level
Set ownership at a too-high level and you risk that no individuals take responsibility. For example, by defining the owner as “data-team” you risk the definition is too broad to act on. Setting ownership on an individual level creates a lot of accountability but little scalability. You run the risk that people move around to different teams, go on holidays, or leave the company.
At SYNQ, we assign ownership based on teams and their associated Slack channels. Where possible, we use dbt groups so we only have to keep ownership metadata such as Slack channel updated in one place. As all alerts on input sources go to one channel, we also assign individuals based on sources they own, so that they’re tagged in Slack to bring attention to these issues.
Defining ownership using metadata
Use existing folder structures to tie into your existing architecture design
Use this option if you’ve already organized your dbt project or data warehouse schemas to resemble your ownership structure, such as marketing, finance, and operations. With folder-based ownership, you typically need less time to get set up, and as you add new data models, they, by design, fall into an existing owner group, reducing your upkeep. If you work with non-technical stakeholders who don’t contribute to your code base, such as business analysts or data stewards, this approach makes it easier for them.
Use dbt owner meta tags to manage ownership through code
dbt has built-in support for designating owners using the meta: owner tag. Owners are displayed in dbt Docs, making it easy for everyone to see who’s responsible.
models:
- name: users
meta:
owner: "analytics-engineering"
You can extend this to dbt sources to define ownership to upstream teams. If issues happen on sources before any data transformations, it indicates that upstream teams should own the issue.
An added benefit is that this approach lets you use CI checks, such as check-model tags from the pre-commit dbt package, to ensure that each data model has an owner tag assigned.
Use dbt groups to enable intentional collaboration
With dbt 1.5, dbt launched support for groups. Groups are helpful for larger dbt projects where you want to encapsulate parts of the internal logic only to be accessible to members of that group – similar to how you’d only expose certain end-points in a public API to end-users. If a model’s access property is private, only owners within its group can reference it.
models/marts/finance/finance.yml
groups:
- name: finance
owner:
# 'name' or 'email' is required; additional properties allowed
email: finance@acme.com
slack: finance-data
github: finance-data-team
models/schema.yml
models:
- name: finance_private_model
access: private
config:
group: finance
# in a different group!
- name: marketing_model
config:
group: marketing
Defining cross-tool ownership
There are situations where you want to manage ownership across multiple tools – from databases to multiple data warehouses and dashboarding tools. This is useful when you want to find dashboards owned by specific teams or build out capabilities to notify downstream impacted stakeholders when you have a data incident. Managing cross-tool ownership in code can be difficult as there’s often no coherent way to define this. Tools such as a data catalog or data reliability platform are built for this.
Notifying the right people
Data ownership doesn’t have to stop with the data team. Below, we’ll look at ways you can notify: (1) the data team, (2) upstream teams, and (3) business stakeholders.
One of the top pitfalls we see is when teams spend a lot of time mapping out and defining ownership, but let it sit stale on a Confluent page that gradually gets out of sync with reality.
Managing alerts within the data team
Managing ownership within the data team is the most straightforward. Your team is in control; typically, the tools are within the stack you manage. You can use your existing ownership definitions to ensure the right owner knows about the right issue. The two most effective ways to do this, assuming you use a communication tool like Slack, is by tagging owners and routing alerts based on your ownership definitions:
- Tagging owners – associate owners with Slack handles to tag groups or individuals and drive awareness of issues.
- Routing alerts – tie Slack channels with ownership and send alerts to the relevant team’s channel. This is a great way to overcome alert overload in the central channel.
As a rule of thumb, you’ll see the most impact once your data team moves past a handful of people, and everybody no longer has full visibility into all data assets.
Notifying upstream teams
We typically see two kinds of upstream teams that produce data and need to be alerted differently: technical teams, such as engineering, and non-technical teams, such as a SalesOps team owning Salesforce customer data.
a. Technical teams – the alerts you send don’t need to look different from those in the example from the data team above. If you’ve placed tests at your sources and detected issues before any data transformations, engineers should be able to connect the dots between the source and the error message and trace back the issues to their systems. For larger teams with a clear split between teams that ingest data (e.g., data platform) and teams that produce data (e.g., frontend engineers), it can be helpful to compliment the error message with details about what event or service it relates to.
b. Non-technical teams – bringing ownership of quality of source systems to non-technical teams is underrated. Too often, tedious input errors such as an incorrect customer amount or a duplicate employee_id end up on the data team to debug, triage, and find the right owner. With the right context, these teams can start owning this without the data team being involved.
Case study: The cultural challenge of upstream ownership
One thing is starting to send alerts to an upstream team. But getting them to consistently act on them is another challenge. “Before we started to send alerts to our operations and lending team about faulty customer records, we got our COO to buy into the initiative. Only then were we able to ensure that the team was incentivized to act on and prioritize alerts from the data team” – Rupert, data lead at LendInvest
Notifying stakeholders
Sometimes, you can alert your stakeholders to notify them of issues proactively with alerts similar to those you send to the data team. This works best if your stakeholders are data-savvy teams, such as a group of analysts in a business domain.
Sending Slack alerts to your marketing director that five rows are failing a unique test on the orders data model is not the best idea.
The best way to notify impacted stakeholders is most often for the person with the relevant context to “declare” it unless the stakeholder is technical. In most cases, we recommend that the data owner notify the stakeholder directly and link to an ongoing incident page so they can follow along on the issue resolution. Another way is displaying issues directly in dashboards so end-users are aware of issues – but this can be more risky and difficult to interpret for non-technical users and recommend you only use this with caution.
At SYNQ, our Technical Account Managers rely on a ‘Usage Report’ data product to see the usage across our customers’ workspaces. As our Technical Account Managers are technical stakeholders, alerts on or upstream of this data product are sent directly to the #tech-ops channel. This helps them be the first to know if data is unreliable and not make decisions based on incorrect information.
Tip: Beware of alert overload
If you spam a Slack channel with alerts, chances are people will stop paying attention to them. Preventing alert overload is solved throughout the data reliability workflow – from architecture design to monitoring and testing deployment, and alerting & ownership rules. Not all issues have to be sent to the #data-team Slack channel. A better workflow is to be deliberate with what’s sent to the main alerting channel. Issues on less critical data assets can be sent to a different channel, or not sent as alerts at all, and managed in weekly backlog review.
Adopting incident management
With adequate ownership in place, you’re well-positioned to start streamlining responses to issues and adopting incident management for more severe issues.
If done well and combined with well-defined ownership definitions, this has several benefits–(1) Time to resolution is slowed as the issues are brought to the right owners with the relevant context. (2) Important issues are prioritized based on established incident management response processes. (3) Time spent resolving data issues is reduced as ownership lines are less blurred. (4) You start building an institutionalized way or adopting learnings from incidents and symmetrically improve.
There’s no set way to manage issues and incident response, and you should always look to adapt to other ways of working in your companies. With that being said when adopting incident management for rapid responses in data teams, we recommend working through these five steps
- Getting the data team on call
- Detecting issues
- Triaging issues
- Handling the incident
- Post-incident analysis
Getting the data team on call
Start by defining expectations from the data team for what it means to be on call. For smaller data teams, this may be the sole data person being the “data responder duty” for the week. For others, it may mean that relevant owners address issues within a predefined SLA as they come up. For teams owning core business processes, this may involve being paged or notified outside business hours, closely tied to the SLA definitions of your data products. We recommend you consider them across these groups:
- Overseeing data-related incidents (e.g., P1 data product errors or failures in dbt core pipelines)
- Managing smaller failures such as a dbt test warning to ensure they’re addressed promptly and maintain their relevance
If the existing ownership activation rules you’ve set up are not working out of hours, you can create a “@slack-responder” group and only have the people on-call in it to avoid tagging the entire data team when there are issues out of hours.
Tip: Be explicit about on-call expectations
If you’re not specific about expectations for on-call (e.g., we only look at issues within business hours), people will start adopting different expectations which can create an uneven workload across the team.
At SYNQ, our data platform powers core in-product functionality such as our Quality Analytics tab. If there are issues here, we strive to detect the issue within an hour and resolve it within a few hours.
Detecting issues
An incident begins when something goes wrong—whether it’s a critical dbt job failing, a table no longer receiving new data, or an SQL test for data integrity breaking down. At this point, it’s still just an issue, and you should be able to achieve these three objectives:
- Detect the issue promptly through predefined tests and monitors
- Alert the appropriate person through the ownership model you’ve defined
- Provide relevant context for resolution
Without sufficient testing and monitoring in place, you’ll be caught on the backfoot, only learning about issues when you’ve critical system failures or issues detected directly by stakeholders.
The chapter Testing & Monitoring goes into more detail about how to set up the right monitoring to help you avoid this.
Triaging issues
When an alert is triggered, you should assess the situation thoroughly. The alert may not provide all the context needed, and a system failure might be connected to other relevant issues. At this stage, you should have an understanding of all other issues and their connectivity. Internally at
SYNQ, we always aim to be able to answer three questions:
- Scope–Is this an isolated failure, or is it related to other issues?
- Impact–What is the potential effect on critical data products and assets?
- Severity–Does this involve data being unavailable, corrupted, or unreliable?
Example summary card–system-wide issue impacting critical data products
Answering these questions provides the context needed to evaluate the incident’s urgency and decide whether it warrants a full incident response. This triage step helps separate critical issues from minor ones that don’t require full incident management.
An efficient triage workflow should alert relevant team members using your ownership definitions and offer a comprehensive view of failures and their context. This enables data engineers and analysts to assess issue severity and interrelations effectively.
Linking incidents to data products, or your other business-critical datasets is especially useful, as it highlights the potential impact of each issue and automatically identifies affected datasets, streamlining the triage process. This is a process that can otherwise be nearly impossible to do, especially if you’ve many hundreds of data assets downstream of an issue.
Once you’re able to answer these questions, we recommend that you set clear expectations levels closely tied to your on-call setup. At SYNQ, we use the following benchmarks for our MTTD (mean time to detect) and MTTR (mean time to resolve).
Severity | MTTD (mean time to detect) | MTTR (mean time to resolve) |
---|---|---|
P1 | 1h | asap (hours) |
P2 | 12h | 1d |
P3 | 24h | 3d |
P4 | 24h | 7d |
SYNQ’s internal MTTD and MTTR metrics
If your data team or engineering team already uses an incident management tool like PagerDuty, Opsgenie, or incident.io, we recommend you link your data-related issues to these, so that an alert can be automatically linked to existing workflows, as well as bring on core metadata around the impact directly to the existing platforms.
Handling the incident
While there’s no universal approach to handling data-related incidents, a structured and well-documented incident management process significantly helps make the response smoother and more effective.
Tip: When to declare an incident
Not all issues should be incidents but erring on the side of declaring too many rather than too few incidents gives you a tracable log of issues, where if something goes wrong, you can go back and look at what happened last time. You can always adjust the incident severity and prioritize it accordingly.
Creating a dedicated space for communication (such as a document, a Slack channel, or an incident in an external tool) ensures that stakeholders are kept in the loop and avoids having to switch between multiple apps and tabs, allowing you to focus your efforts on identifying and resolving the root cause of the issue.
At SYNQ, we use these steps for coordinating key communication and root cause analysis.
- Inform stakeholders promptly, focusing on the owners of critical data assets impacted.
- Organize data team efforts by clearly tracking who is responsible for each part of the issue.
- Attach GitHub pull requests to monitor the fixes made and their deployment status.
- Look for similar past incidents to learn from how they were handled.
- Document key steps taken and insights gained during the root cause analysis.
Each of these steps will be much easier with well-defined data products and ownership in place–you’ll know who to communicate or escalate issues to, you’ll be able to see if important data products are impacted downstream, and have a log to trace down previous related incidents, and how they were solved last time.
Post-incident analysis
It can be tempting to set an incident aside once resolved, but doing so may mean missing valuable insights and failing to establish processes to prevent it from reoccurring—or leaving someone else to grapple with a similar issue later on.
For lower-severity incidents, such as a test failure, follow-up checklists with assigned task owners can help ensure that guardrails are put in place to prevent recurrence. For more critical incidents, like a critical data product failure, a postmortem review can offer deeper insights and guide process improvements.
INC-123: Data Product Incident Title
ℹ️ Key Information ⏱️ Timestamps
+--------------------+-------------+ +-----------------------+---------------+
| Data Product | … | | Reported at | … |
| Priority | P1, P2, … | | Impact started at | … |
| SLA | 99.9%, … | | Resolved at | … |
+--------------------+-------------+ +-----------------------+---------------+
👪 Ownership & Teams ⏳ Durations
+--------------------+-------------+ +-----------------------+---------------+
| Product Owner | … | | Time to Identify | … |
| Slack Channel | #team-name | | Time to Fix | … |
+--------------------+-------------+ +-----------------------+---------------+
🪞 Related Incidents 🔗 Useful Links
+--------------------------+ +--------------------------+
| INC-456 | | Incident Homepage |
| INC-789 | | Slack Channel |
+--------------------------+ +--------------------------+
🖊️ Summary
+------------------------------------------------------------+
| Summary of incident impact on "Marketing Attribution" data |
| product affecting downstream assets… |
+------------------------------------------------------------+
📆 Incident Timeline
+---------------------+------------------+--------------------------+
| Date | Time | Event |
+------------+--------+----------------------------------------------+
| 2024-01-01 | 12:00 | Reported by Data Engineer |
| 2024-01-01 | 12:30 | Priority escalated to P1 |
| 2024-01-02 | 12:00 | Incident Resolved |
+------------+--------+----------------------------------------------+
⬆️ Root Cause ⬇️ Mitigators
+--------------------------+ +----------------------------+
| - Missing upstream checks | | - Add source freshness |
| - No quality assurance | | - Raise priority of source |
| in source system | | data product |
+--------------------------+ +----------------------------+
👀 Risks ⏩ Follow-up Actions
+--------------------------+ +----------------------------+
| - Low ownership of | | - Add review with core |
| source data quality | | engineering team |
| - Weak completeness SLIs | | - Raise priority of issue |
+--------------------------+ +----------------------------+
Example template for a post-mortem of a critical data incident
Additionally, a periodic review of incident trends can reveal patterns—such as recurring errors in specific data assets—or highlight imbalances such as specific upstream teams being responsible for a disproportionally large amount of issues.
You should tie these metrics together with your wider data reliability workflow and be able to measure metrics such as
- Mean time to resolution – what’s the average time from an issue is detected to it being resolved, broken down by severity
- # issues and incidents by team – are there specific teams or team members who have an outsized number of incidents. And how do these trace back (e.g., are some source systems systematically triggering incidents outside the data team’s control)
- Issue to incident rate – to identify if there are low signal data controls in place that could potentially be removed
In the chapter Continuous improvement, we’ll look more into how you can establish feedback loops and learning processes to continuously enhance data reliability practices.