Incidents: An organizational Swiss Army knife

Incidents may be one of the best measures of maturity, effectiveness, and progress in any highly operational environment, including but not limited to security operations and technology operations (including site reliability engineering, or SRE). However, incident management done right can be an invaluable tool that you can point at virtually any problem- or failure-prone system to make it better.

What you can learn from your incidents

If you have defined incident severity levels coupled with the most basic incident management practices–tracking and classifying incidents, and handling them with some consistency–they will quickly become an invaluable way to learn and measure:

  • what’s going wrong
  • where it’s going wrong
  • how often it’s going wrong

If you’re doing more mature incident management, and in particular if you’re performing post-incident analysis, your incident data should help you understand:

  • why things are going wrong (root causes)
  • where things are going wrong repeatedly (patterns or hot spots)
  • signs that something is about to go wrong (leading indicators)
  • how adept you are at responding when things go wrong (performance)
  • whether you’re learning from and continuously improving based on all of the above (trending)

These are overly simplistic, but very representative of the things you can expect to learn from your incidents.

Continuous improvement through incident management

Start small, and report regularly on as many aspects of your incidents as you can. If all that you know is 1) that an incident occurred and 2) you can classify the incident based on defined severity levels, then you can start to report on this information, driving transparency and discussion. Make your first goal being able to account for incidents and share high-level data amongst your team.

Over time, you can mature your overall incident management practices. As you begin to perform more frequent and more thorough post-incident analysis, you can do things like:

  • capture playbooks to make your response to classes of incidents more repeatable
  • set goals to address areas of importance, ranging from things like improving your ability to observe system state and detect incidents in the first place to improving the performance of your response teams
  • evaluate trends and set thresholds for different levels of escalation and prioritization

Using incidents to build resilience (in creative ways)

One neat thing about incidents is that they can and will be defined based on the types of things that you care about controlling. Some common types of incidents are security incidents (e.g., intrusions or insider threats) and operational incidents (e.g., outages or system degradation). In global organizations like airlines, incident management may involve detecting and responding to anything ranging from personnel availability to severe weather to geopolitical events.

However, if you’re experiencing issues related to cost, you may declare an incident when certain cost-driving events, such as wild auto-scaling events or excessive data ingestion take place. If your business depends heavily on in-person presence, you may declare incidents based on weather events, global pandemic, and more.

Because you can be flexible in how you define incidents and their severity while still being consistent in how they’re handled, organizations with great incident management practices will build valuable muscle for identifying and clearly defining critical events of all types, then leverage their incident-related systems and processes to develop organizational resilience.

Visibility, observability, detection, and mitigation in cybersecurity

The concepts of visibility, observability, detection, and mitigation are foundational to cybersecurity–security architecture and detection engineering in particular–and technology operations in general. They’re useful for communicating at almost every level, within technical teams but also to organizational peers and leadership.

I’ve found myself thinking about and explaining the relationship between them time and again, specifically as a model for how we enable and mature security operations.

alt

Visibility

Visibility is knowing that an asset exists.

Of these terms, visibility is the lowest common denominator. For assets that are visible and that we’re responsible for protecting, we then want to consider gaining observability.

Observability

Observability indicates that we’re collecting some useful combination of activity, state, or outputs from a given system.

Practically speaking, it means that we’re collecting system, network, application, and other relevant logs, and ideally storing them in a central location such as a log aggregator or SIEM.

We can’t detect what we can’t see, so observability is a prerequisite for detection. And while it may not be required for all mitigations (e.g., you may purchase a point solution that mitigates a particular condition but provides no observability), there’s a strong argument to be made that observability is foundational to any ongoing security operations function, of which mitigation activities are a key part.

Lastly, observability as a metric may be the most important of all. As a security leader, I can think of few metrics that are more useful for gauging whether assets or attack surface are manageable–simply knowing about an asset means little if you have no insight into where, how, and by whom it’s being used.

Read more in-depth thoughts on observability here.

Detection

Detection is the act of applying analytics or intelligence to observable events for the purpose of identifying things of interest. In the context of cybersecurity, we detect threats.

Detection happens in a number of ways (not an exhaustive list, and note that there are absolutely relationships between these):

  • Turnkey or out-of-the-box detection, typically in the form of an alert from a security product
  • Detection engineering, a process where the output is analytics aimed to make detection repeatable, measurable, and scalable
  • Threat hunting, a process where the output is investigative leads that must be run to ground (and should feed detection engineering)
  • More . . .

There are plenty of ways to measure detection. We have outcome-based measures like time to detect (MTTD), respond, and recover (together, the MTTRs). And we have internal or maturity measures, like detection coverage.

Mitigation

Mitigation is easiest to think of as a harm reduction measure taken in response to a detected threat.

Mitigations can be preventative, detective, or response measures. Mitigation in the form of early-stage prevention, such as patching a software vulnerability or wholesale blocking of undesirable data, is ideal but not always possible. There are plenty of cases where the best you can do is respond faster or otherwise contain damage.

In practice, mitigation is an exercise in looking at the things that happen leading up to and following threat detection:

  • Where did it come from?
  • What did it do?
  • What’s its next move?

And then figuring out how and at what points you’re able to disrupt the threat, in whole or in part.

An open source catalog of offensive security tools Permalink

From Gwendal Le Coguic (@gwen001 / @gwendallecoguic), offsec.tools is a fairly wide-ranging collection of offensive security tools. At the time of publication, it includes close to 700 tools, though some very popular free tools (e.g., mimikatz, impacket) are missing, and the project’s appetite for cataloging commericial tools (e.g., Pegasus, FinFisher, etc.) is unclear.

Roundup of commercial spyware, digitial forensics technology use by governments Permalink

From the Carnegie Endowment for International Peace:

The dataset provides a global inventory of commercial spyware & digital forensics technology procured by governments. It focuses on three overarching questions: Which governments show evidence of procuring and using commercial spyware? Which private sector companies are involved and what are their countries of origin? What activities have governments used the technology for?

The leaderboard is interesting, albeit predictable.