2 minute read

Incidents may be one of the best measures of maturity, effectiveness, and progress in any highly operational environment, including but not limited to security operations and technology operations (including site reliability engineering, or SRE). However, incident management done right can be an invaluable tool that you can point at virtually any problem- or failure-prone system to make it better.

What you can learn from your incidents

If you have defined incident severity levels coupled with the most basic incident management practices–tracking and classifying incidents, and handling them with some consistency–they will quickly become an invaluable way to learn and measure:

  • what’s going wrong
  • where it’s going wrong
  • how often it’s going wrong

If you’re doing more mature incident management, and in particular if you’re performing post-incident analysis, your incident data should help you understand:

  • why things are going wrong (root causes)
  • where things are going wrong repeatedly (patterns or hot spots)
  • signs that something is about to go wrong (leading indicators)
  • how adept you are at responding when things go wrong (performance)
  • whether you’re learning from and continuously improving based on all of the above (trending)

These are overly simplistic, but very representative of the things you can expect to learn from your incidents.

Continuous improvement through incident management

Start small, and report regularly on as many aspects of your incidents as you can. If all that you know is 1) that an incident occurred and 2) you can classify the incident based on defined severity levels, then you can start to report on this information, driving transparency and discussion. Make your first goal being able to account for incidents and share high-level data amongst your team.

Over time, you can mature your overall incident management practices. As you begin to perform more frequent and more thorough post-incident analysis, you can do things like:

  • capture playbooks to make your response to classes of incidents more repeatable
  • set goals to address areas of importance, ranging from things like improving your ability to observe system state and detect incidents in the first place to improving the performance of your response teams
  • evaluate trends and set thresholds for different levels of escalation and prioritization

Using incidents to build resilience (in creative ways)

One neat thing about incidents is that they can and will be defined based on the types of things that you care about controlling. Some common types of incidents are security incidents (e.g., intrusions or insider threats) and operational incidents (e.g., outages or system degradation). In global organizations like airlines, incident management may involve detecting and responding to anything ranging from personnel availability to severe weather to geopolitical events.

However, if you’re experiencing issues related to cost, you may declare an incident when certain cost-driving events, such as wild auto-scaling events or excessive data ingestion take place. If your business depends heavily on in-person presence, you may declare incidents based on weather events, global pandemic, and more.

Because you can be flexible in how you define incidents and their severity while still being consistent in how they’re handled, organizations with great incident management practices will build valuable muscle for identifying and clearly defining critical events of all types, then leverage their incident-related systems and processes to develop organizational resilience.