What is incident management?

incident management, or incident response management, describes a process in which incidents that threaten or disrupt a company’s IT services are identified and resolved. As part of IT Service Management (ITSM), incident management aims to maintain the operation of services or, if they are taken offline, to restore them as quickly as possible. The primary goal of incident management s to minimize the impact on the business caused by an incident. In this article, we will look at the various phases and best practices of incident management that can help you reduce unnecessary and costly downtimes in your company.

What is an IT incident?

According to the Information Technology Infrastructure Library (ITIL), an "incident" is an "unplanned interruption to an IT service or a reduction in the quality of an IT service." Based on this description, the term "incident" can be broadly defined – from a degradation of network quality to insufficient storage space or a cyberattack. The detection of security-related incidents and the response to them is referred to as Security incident management or incident response management.

Why incident management?

There are numerous approaches to incident management: policies, tools, and SLAs (Service-Level Agreements) can vary significantly from company to company. In general, IT teams try to prevent problems and incidents through regular software updates, event monitoring, and other methods. An Incident Response plan is typically established to quickly address and resolve incidents and determine their root cause to prevent them in the future.

incident management is crucial because service disruptions can be extremely costly, amounting to hundreds of thousands of euros per hour – not to mention regulatory fines and customer losses.

Security incident response

What does security incident response include?

Responding to security incidents involves the process of detecting, analyzing, and resolving security threats or incidents in real-time. This involves a combination of investigation and analysis – either computer-assisted or by personnel – to minimize or prevent negative impacts on the company.

The process usually begins with the security system informing the incident response team of an incident. The response team investigates and analyzes the incident to determine its authenticity and scope. The incident's impact is assessed, and a damage containment plan is developed as part of Incident Response Management.

What is a security incident?

A security incident can be anything from an active threat to a data security breach. Incidents can occur both inside and outside a company. For instance, an employee might cause an incident by accessing a gambling website on their work computer, or a vendor might download data they are not authorized to access. Malware attacks are also considered security incidents.

Besides troubleshooting, responding to an incident also includes preventive measures to thwart future attacks or incidents. For example, after the notorious Heartbleed and EternalBlue attacks, administrators of affected companies immediately secured and checked their systems and IT infrastructure to prevent malicious attackers from gaining access and compromising the systems again.

What is incident management according to ITIL?

The response to security incidents is a specific process within the broader Incident Response Management. According to ITIL, incident management deals with any "unplanned interruption of an IT service or reduction in the quality of an IT service." Disruptions or incidents can be caused by human or technical failure, security breaches, or various other events. The goal of incident management is to identify the cause of an incident, understand its impact and urgency, and establish a response to restore normal operations as quickly as possible.

How does Security Incident Response relate to incident management?

Security Incident Response is a similar process to incident management but specifically applies to security incidents. A security incident can be an attempted intrusion, a policy violation, a malware infection, or another event that poses a threat to computer security. When an organization identifies a security incident, the incident response team – sometimes referred to as the CSIRT (Computer Security Incident Response Team) – assesses the scope, determines the necessary remediation steps, and executes them. Effective remediation of security incidents is essential to prevent or mitigate damages and liabilities arising from security incidents.

What are the phases of the Incident Response lifecycle?

The National Institute of Standards and Technology (NIST) describes four phases in the lifecycle of an incident response:

Preparation: the first phase helps companies identify risks to their systems and data, outline problem management strategies, and establish mechanisms for handling security incidents. This may include conducting a formal risk assessment or implementing tools and processes for incident analysis and containment. It also involves prioritizing threats, forming and training an incident response team, and creating an Incident Response plan in accordance with NIST lifecycle guidelines.
Detection and analysis: in this phase, the service operations department sets up systems for proactive monitoring, detection, prioritization, and analysis of high-priority incidents. The goal is to identify any irregular and suspicious threats or activities in the network environment that could disrupt workflows. Detection and analysis of incidents typically involve a combination of manual investigations and security tools that automate security processes. Automation and effective execution in this phase often help minimize the spread of the incident, resulting in less noticeable impact.
Containment, rradication, and recovery: the third phase deals with addressing security incidents. Containment aims to prevent further damage from an incident. For example, malware incidents can be halted by disconnecting the affected server from the network and implementing firewall rules to block the attacker. Security administrators, support staff, or Incident Managers remove the threat at the scene by eradicating the malware from the infected server and ensuring it is not present elsewhere in the system. Finally, support staff restore the system to its pre-infection state and re-establish service quality by reloading applications or restoring data from backups.
Post-incident activities: the fourth phase involves steps to prevent similar incidents from recurring. Based on the data collected from the incident and post-mortem meetings, the company determines how the incident occurred, which preventive measures need to be reinforced or introduced, how monitoring and notification processes can be improved, and how helpdesk and service requests, remediation, and recovery processes can be optimized. This phase also involves addressing compliance and regulatory issues.

Overall, the concept of the four phases is based on a solid knowledge base. The effectiveness of phase three heavily depends on the success of phases one and two. To provide optimal protection through incident management and ensure the restoration of IT services in the company, all four phases must be successfully implemented together.

incident response lifecycle diagram

How to create a modern security incident response plan

Effective response to security incidents depends on having a strategy for effective Incident Response Management in place before an incident occurs. The ISO/IEC Standard 27035 outlines a five-part process for Incident Response Management:

Preparation for incident handling.
Identification and reporting of potential security incidents.
Assessment of incidents and decision on actions to be taken.
Response to incidents through containment, investigation, and remediation.
Documentation of key findings and lessons learned from each incident.

Each organization executes this plan in its own way. However, there are several best practices that can help tailor the response to security incidents to your company’s needs:

Inventory of all assets: determine which systems and data are most crucial for your business operations and establish the order in which they need to be investigated and restored following a security incident.
Assemble a Security Incident Response team: assign roles and responsibilities to team members and ensure inclusion of representatives from departments outside of IT, such as finance, operations, and legal, to communicate with the appropriate people during a security incident.
Search for security alerts: define what constitutes a security incident for your company so that you know what to look for. Develop policies on how they should be detected and reported.
Create an action plan for security incidents: this should include a list of all relevant tasks and the people responsible for them, based on the threat or specific incident. Test the plan to determine its effectiveness and refine it as needed.
Evaluate your team’s response: by analyzing the successes and failures of a response in terms of service delivery, you can improve the plan for the next incident.

How are threats assessed and the appropriate response level determined?

The assessment of threats and the response to critical incidents varies from company to company. However, there are several best practices for categorizing and prioritizing incidents that can provide a framework for an effective and efficient process within Incident Response Management:

Identification: once an incident is confirmed, start collecting evidence. This includes analyzing log files and other data sources to identify compromised or infected endpoints.
Investigation: after gathering all evidence related to an incident, piece it together to understand the attacker's path. Following the course of the incident can also help determine the attacker’s goal.
Remediation: visualizing the attack path enables you to identify the most business-critical targets and prioritize your responses accordingly. You can remove the malware using the information gathered during the prioritization phase and restore infected systems in the order of priorities for your business operations.

Cybersecurity tools can support and even make the assessment process more effective. Automation and orchestration can relieve security teams or Incident Managers from the time-consuming task of data analysis and collection, allowing them to focus on investigating and resolving critical incidents.

Incident management systems

What role does DevOps play in incident management?

DevOps supports security monitoring in software applications and the development environment through Incident Response Management. While ITIL provides information for incident management for ITSM, there is no official guide for DevOps teams. Instead, incident management in this context is based on the core principles of DevOps: overcoming organizational barriers, improving collaboration and transparency, and focusing on lean processes. This process can be summarized in a few steps:

Detection: DevOps incident response teams jointly identify system vulnerabilities and plan responses to potential incidents. Incident Managers also set up various monitoring tools and notification systems and maintain runbooks that describe what to do when an incident is detected.
Response: most DevOps incident management teams receive their information from monitoring tools, assess the severity and impact of the incident, and follow the runbook to escalate the issue to the appropriate contacts via the proper communication channels.
Remediation: the Incident Manager works with the relevant teams to fix the problem, restore systems and data, and return the application to normal operation.
Analysis: in this "wrap-up phase," the incident management team meets to share insights in a "blameless post-incident review." The goal is to improve systems and prevent similar incidents from recurring.
Readiness: incident management teams assess their readiness for the next incident, applying lessons learned from the blameless post-incident review. They adjust their monitoring and notification tools, update runbook processes and team responsibilities, discuss potential workarounds, and implement permanent fixes for the resolved issue in the development pipeline.

What is a blameless post-incident review?

Blameless post-incident reviews are a crucial part of the incident lifecycle and Incident Response Management. DevOps teams need an open analysis of their incident response process to continuously improve operational efficiency. The blameless post-incident review allows for this analysis by examining both the technical and human shortcomings of their response efforts.

In a blameless post-incident review, members of the incident response team and others involved or affected by the incident come together to better understand the incident and prevent it from recurring. The review aims to identify tools and processes that can be improved, rather than assigning blame. This not only enables on-call staff and Incident Managers to act without hesitation during an incident but also leads to more innovative ideas and better applications.

What techniques are there for Incident Response Management for major security incidents within systems?

A prepared attack plan as part of successfully implemented incident management is the best way to endure the stress and uncertainty of a major incident. While ITIL provides a detailed Major incident management guide, the following steps also present a general framework for approaching any incident:

Collect all facts: before taking action, it is important to understand the nature and scope of the incident. Quickly determine which services and users are affected, the potential business impact, who is dealing with the issue, who needs to be notified, and whether the problem raises compliance or legal concerns.
Communicate with the right people: in the event of an incident, you need a list of contact information for the relevant individuals. Beyond the incident management team members, communicate with other stakeholders throughout the company, the user base of the affected service, and any relevant regulatory authorities.
Develop an action plan: based on the collected facts, the key teams must determine and implement the optimal response to the incident. The Incident Manager must coordinate all team activities and ensure the response plan is executed efficiently and in accordance with Incident Response Management guidelines.
Keep all parties informed: while teams work on the problem, the Incident Manager must regularly check on the status to ensure all deadlines are met. Simultaneously, they must proactively update other stakeholders on the progress.
Request approvals for emergency changes: once a solution for the incident is found, conduct tests to ensure it works. If needed, the Incident Manager must initiate the emergency change process so that response teams can quickly implement the fix.
Inform stakeholders that the problem is resolved: once the corrective action is identified and verified, a small user control group checks if the service functions properly. The incident response team then notifies everyone that the incident has been resolved.
Conduct a brief review: take the time to briefly recap with the teams what actions they took and what lessons they learned, ideally while the event is still fresh in everyone’s minds. Schedule a blameless post-incident review for a deeper evaluation once everything is running smoothly again.

Downtime risks: what costs can an incident actually cause?

According to a 2020 ITIC survey on the hourly cost of downtime, 40% of the companies surveyed indicated that a single hour of downtime can cost between $1 million and over $50 million – excluding legal fees, fines, or compliance penalties.

Data suggests that any interruption in work productivity, including downtime, can have massive impacts. A UC Irvine study shows that it takes about 23 minutes to regain concentration after a productivity interruption. While the actual costs associated with downtime due to an incident vary from company to company, it is generally known that a single system outage can cost a company millions of dollars – not including the related costs due to missed business opportunities, decreased productivity, and damaged reputation.

System outages due to incidents are unavoidable for any company, but their frequency and impact can be significantly reduced by shifting incident management from a reactive to a proactive approach in Incident Response Management.

cost of system outage graph

What does MTTD/MTTR mean?

MTTD stands for "Mean Time To Detect" or "Mean Time To Discover," and MTTR stands for "Mean Time To Respond." Both are metrics used to quantify the effectiveness of a team's incident management processes.

MTTD: this is a key performance indicator for Incident Response Management. It measures how long a problem or incident exists before the organization or responsible parties become aware of it. A shorter MTTD indicates that the company experiences less downtime and disruption than it would with a longer MTTD. Moreover, the lower the MTTD, the lower the costs incurred by the organization due to downtime. Companies identify issues either through end-users reporting a failure to the service desk or through the various monitoring and management tools used in incident management.

MTTR: this stands for the average time required to repair and restore a component or system to functionality. It measures the maintenance level of a company’s equipment and the efficiency of the team in resolving IT incidents. MTTR starts the moment a fault or incident is detected and includes diagnosis time, repair time, testing, and all other activities until the service is back to normal. The combination of MTTR and MTTD encompasses the duration of a cyber incident.

MTTR is important because it is a strong indicator of the costs of IT incidents. The higher the MTTR of an IT team, the greater the risk of significant downtime during IT disruptions, leading to business interruptions, lower customer satisfaction, and revenue loss.

First Steps

Examples of incident management tools you can implement to take protective measures

An Incident Response Management platform is the first line of defense in an incident. It provides essential support at every stage of the incident management process, including incident identification, logging, diagnosis and investigation, escalation, and problem resolution. Numerous platforms are available, and the one suitable for your incident management largely depends on the size and scope of your organization, compliance requirements, and budget considerations.

Implementing an effective incident response management strategy

The first step to implementing an effective incident management plan is to establish an incident response team, consisting of internal or external staff or a mix of both. Next, you need to define what constitutes an incident for your company and conduct a threat analysis by assessing potential threats, risks, and infrastructure failures.

Subsequently, you can design response plans for various scenarios, train employees or Incident Managers, and practice through simulated breaches to continuously improve your incident response.

Conclusion: effective incident management is essential for any company

With the growing complexity of IT environments and the increasing number and sophistication of threats, companies face unprecedented risks. Effective incident response management can mitigate this risk by enabling quicker detection and resolution of incidents. While outages and other incidents are inevitable for any company, incident management is the most effective way to initiate an immediate response and prevent costly downtimes that can jeopardize your company's reputation and bottom line.

Weitere wissensartikel

Jodocus knowledge base

Risk management: identifying and successfully managing risks

Jodocus knowledge base

Effective project management – fundamentals, success factors, methods, and tools

Jodocus knowledge base

Backlog: types, items, prioritization, and significance in agile project management

Jodocus knowledge base

Project management with Gantt charts: structure, usage, and benefits

Jodocus knowledge base

Data backup and recovery – the life insurance for your information

Jodocus knowledge base

Shared responsibility: definition and best practices for SaaS providers and customers

Jodocus knowledge base

Scrum: methodology, roles, and principles of the Scrum Manifesto

Jodocus knowledge base

Efficient project management with Kanban: methodology, benefits, values, and principles

Jodocus knowledge base

What is incident management?

CONTACT US

Products

Jira Jira Align Jira Service Management Confluence Trello Opsgenie

Solutions

IT Service Management Enterprise Service Management Digital Workplace Knowledge Management Test Management Scaling Agile

Consulting

Success Stories Process Digitization Process Optimization Licence Consulting Trainings Cloud Migration

Company

About us Blog Partner Imprint Privacy Policy Contact