Technology leaders, can describe the steps you took to resolve a significant IT system failure?

Question

When a significant IT system failure strikes, the expertise of technology leaders is crucial in navigating the crisis. From the perspective of a Chief Technology Officer and a Director of IT, we've compiled four insightful strategies. These range from conducting regular system vulnerability checks to following a clear incident management playbook, offering a roadmap for resolving such daunting challenges.

Rubens Basso · Answer

In order to be effective during a significant system failure, you should do a thorough check of system vulnerabilities regularly to ensure that you are proactive in your defense against IT failures. Check where there are potential gaps in protection within your system by using scanning tools. You can also utilize monitoring tools to detect signs of potential system failures or security flaws. Being on top of your system can greatly shorten recovery time or lessen potential damage in case there is an IT system incident in the future.

Craig Bird · Answer

Addressing a significant IT system failure involves a multi-staged approach to ensure a quick resolution. Our first move was to alert our Incident Response Team (IRT), which comprises IT, cybersecurity, and communications experts. 
Next, we conducted a thorough analysis to identify the root cause, isolating the affected systems to prevent further damage and contain the issue.
Communication is critical to our response. We ensured transparent and consistent updates were provided to all stakeholders, including employees, clients, and partners. This helped manage their expectations while our technical teams worked on resolving the issue.
Post-resolution, we conducted a comprehensive review of the incident to identify lessons learned and areas for improvement. This involved updating our incident response protocols and investing in additional training for the future.

David Albaugh · Answer

IT systems fail. Sometimes catastrophically. We don't know when, we don't know how, and we don't know which device will fail next. Most IT departments lack an on-staff psychic, and my desk has no room for a crystal ball, or even a Magic 8-Ball. The trick is to be prepared for such catastrophes in advance so that when the catastrophe hits, recovery takes as little time as possible. Everything short of your building burning down can be prepped for and recovered from. While you can't predict what is going to fail, or when, disaster recovery is an important part of the job. Identify which systems are critical for your company to keep ticking over and provide redundancies for each of those systems. If you've got the budget, you can even recover quickly from your building burning down.

Aaron Larue · Answer

When there's a significant system failure, there are actually two things that failed: your system and the internal processes you use to build and maintain that system. 
When leading engineering teams, I care a lot about the process we use to identify, address, and iterate on our failures. Bugs will happen, systems will go down, issues are inevitable. How you respond to these issues, and how you communicate with your executive team and other stakeholders throughout the incident, can make a huge difference when it comes to how the rest of the company perceives the system failure. 
We have a playbook that is simple but helps drive clear ownership and communication. First, we make it easy for anyone in the company to let us know if there is a technical issue. This must be low-friction; we want it to be easy for anyone to alert us of a potential system failure or technical issue. Second, we respond to every submission within 10 minutes, either confirming there's no issue or identifying a potential concern. Third, for submissions that are legitimate, we have a dedicated on-call engineer who is responsible for triaging the issue. Ideally, when it's minor, this on-call engineer can address the issue themselves, and that's the end of the process. However, in some cases, issues get escalated as an 'incident', where other engineers or IT support professionals are roped in to analyze the issue and find a fix. 
For each incident, we make sure there's a clear incident owner on the engineering team who is responsible for coordinating the rest of the team's efforts, communicating status updates to leadership, and ensuring the issue is managed through to completion and the system is restored. These incident owners are also responsible for creating an 'incident report', which covers the '5 Whys' that describe the root cause for the system failure. On a monthly basis, we have a blameless meeting where we do a retrospective, reviewing each incident report as a team, and using that to identify learnings and incorporate new best practices or procedures to make our systems more stable.

How Do You Handle Significant It System Failures?

How Do You Handle Significant It System Failures?

Conduct Regular System Vulnerability Checks

Implement a Multi-Staged Incident Response

Prepare for Disaster Recovery

Follow a Clear Incident Management Playbook