A service I wish to will into existance would help us better leverage our observability work, our traces and metrics and logs, in incident response during an outage.
- I want the service to be part of a channel where we’re working on an outage, in Slack or Microsoft Teams or in some incident response tool, so the agent can see what is being worked on and get context of what we’re looking at.
- I want the service to continue helping as we close the incident and proceed to followup on issues noted.
- I want the service to guide us to do more extensive followup after an incident, not just the most proximate causes.
I want it to offer proactive help during the stressful work for the team, for example:
- Unprompted interjection of new information.
- Handle comms and quell panic.
- Guide us to declare the end of the Incident.
- Help draft the Postmortem.
- Make the Postmortem better.
1. Unprompted interjection of new information
I want the agent to call out outlier metrics/traces/etc which we don’t appear to already be aware of. It is important to set the threshold appropriately: repeatedly interrupting us for things we already know would be unwelcome, interjecting about something relevant which we don’t appear to have noticed could be a lifesaver. I think this part of the functionality is already being addressed, for example by Honeycomb Canvas. It is a natural progression for an observability tool.
2. Handle comms and quell panicSomeone needs to reflect pithy summaries of what is happening to a Slack or Teams channel for the rest of the company, to provide reassurance and quell panic. We write SOPs to emphasize Comms, but people in the thick of it get focused on the firefighting and sometimes neglect to do so. Summarizing the state of the response and adjusting its level of detail for a broader and less technical audience seems like something a suitable LLM could do.
3. Guide us to declare the end of the IncidentSometimes a problem just trails off once we’ve addressed enough of its causes, but we may not decide to close the Incident until well after the point where keeping it open is really warranted. I'd like the service to let us know that things appear to be on the path to normalcy and estimate how long before the changes it is seeing would bring us back into the usual range.
4. Help draft the PostmortemAfter the incident, the agent should help us write the Postmortem. Tooling often only focuses on the straightforward parts of that: the summarized description of the problem and a timeline, especially if it can annotate the timeline with graphs of impacted metrics and relevant details.
That would be dandy and would save us time, and help get the Postmortem out with less delay — which is important, to be sure. We want to let people give feedback and contribute to the Postmortem while everything is still fresh. I think this part of the functionality is already being addressed, for example incident.io
but also...
5. Make the Postmortem betterVastly more valuable in the Postmortem would be to help us extract more actionable followups, not just the immediate triggers but as many things which could be better as we can find. If the agent can see negative trends in metrics or traces which do not appear to be a direct result of what we’ve identified as the root cause, being able to implement more fixes and improve the system's robustness without needing to suffer through another incident first would be very valuable.
Most of the incident-focused products available and work that I've seen focuses on the description and the timeline and supporting data in a Postmortem, which is dandy and saves time in the writing of it, but those are part of the Archaeology when what I really want to focus on is the Future. Even things we might not be positioned to take on for a while, we could still try to address proactively before they happen again.
Postmortem Culture
We do our best to design in redundancy and robustness and build reliable systems, but we always end up responding to failures which we didn't adequately control for and improve the reliability of the system over time as it operates. One of the primary tools to do this is the Postmortem, where we describe a problem which happened and list off what we are going to do about it.
We want to learn as much from every incident as we can. We want to address as many weaknesses in our system as we can, without having an outage for every one of them. If we can identify more things which went wrong, things which were perhaps not the primary problem but nonetheless still a problem, we accelerate the process of improvement. Making maximal use of Postmortems to improve the system is Postmortem Culture.
Every outage starts with stepping on a rake and being hit in the face. We should be able to look past the rake which just hit us in the face, and look around for nearby rakes which we haven't stepped on yet. Postmortem Culture is the rakes we did not step on.
Existing Products
1. Honeycomb Canvas is an existing product in this space, particularly the live assistance during an incident using observability data.
2. incident.io is another product in this space, especially in helping to draft postmortems — the Archaeology part of the postmortem, at least. incident.io is evolving from an on-call and incident management tool, not an OpenTelemetry collector. The assistance it can currently provide during the incident is more in looking for patterns with prior incidents.
3. There are a few products which describe themselves as a Virtual SRE team, though I don't really like that term. An experienced SRE team is a hugely valuable resource and the tools I've seen are at best automating a small part of what SRE would do. I'll write more about these kinds of products as I learn more about them.