Introduction

Whether it was caused by a fire, flood or other natural disaster, or resulted from a malicious criminal act, an IT system outage can cripple your business. Technology is essential to almost all operational processes today, so experiencing downtime means you can’t answer customer inquiries, develop new products, run production lines, ship your product, or keep your employees productive. System outages are costly and stress-inducing at best. In the worst-case scenario, they can cause firms to close their doors permanently.

The Definitive Guide to Recovering from IT System OutagesAccording to research firm Gartner, downtime costs companies an average of $5,600 per minute, with hourly costs ranging from $140,000 to $540,000, depending on the business’s size and vertical. Gartner also found that 43 percent of small to midsize businesses (SMBs) shut down immediately in the wake of a “major loss” of data, with as many as 51 percent ceasing operations within two years of such an event.

IT outages can and do affect businesses of all sizes in all industries. Cloud computing giant Google spent over $13 billion on data center infrastructure in 2019, but still saw multiple network-wide system outages. Smaller businesses with leaner technology budgets are naturally more vulnerable to hardware failures, but even the newest and most advanced computers, servers, and storage devices can still break down unexpectedly.

 

it-outages-coverTo get the list of the 7 key questions to ask your managed service provider or CTO, download your personal copy of this eBook by completing the form to the right. 

 

59 percent of unplanned downtime caused by human errorIn fact, the majority of unplanned downtime—59 percent of it—is caused by human error, and this rate has remained consistent over the past few years. Even as the inherent reliability of hardware continues to increase, IT systems are growing more complex, and thus present their users with additional opportunities to make mistakes. People are imperfect by nature, and all businesses are at significant risk of system outages caused by employee errors and accidents.

What’s changing, however, is the amount of data loss and downtime that’s now being caused by cyberattacks. Whereas hardware failures and human errors have held steady positions as top causes of data loss since 2014, malicious acts now account for a share of the problem that’s 11 percentage points greater than it was five years ago. Ransomware alone is estimated to have cost global businesses more than $11.5 billion in 2019, and it’s said that an attack takes place every 14 seconds. Forecasters predict that by the close of 2020, ransomware will be attacking businesses once every 11 seconds.

The sobering reality is that today’s companies are more likely to experience major IT system outages than ever before, even as their day-to-day operations become increasingly dependent upon technology. The reason is simple: as the amounts of downtime caused by human error, natural disaster, and hardware failure remain relatively stable, cybercriminals continue to become more sophisticated and resourceful, and their attacks are better targeted and more likely to cause harm.

It’s incumbent upon business leaders who value their customers’ trust and who wish to protect their company’s reputation—and safeguard its future—to develop a plan for managing these risks.

 

The Power of Preparation

Simply put, there’s nothing that an ill-prepared organization can do to avoid significant costs, process interruptions, and employee anxiety or panic in case of a major system outage.

The benefits of advance planning for how your company will handle downtime and disasters are twofold. Well-prepared firms will experience far less downtime, lower costs, and fewer business disruptions. At the same time, they’re less likely to experience significant system outages in the first place.

Business continuity and disaster recovery (BCDR) planning is crucial for boosting your organization’s resilience. Having a well thought-out, intentional, and best practice-based BCDR plan in place can increase employees’ confidence, protect your business’s reputation, and improve your ability to manage risks. For instance, in the case of ransomware attacks, businesses with comprehensive BCDR plans in place are 92 percent less likely to experience significant downtime than those that don’t have them.

The “business continuity” component of BCDR planning involves establishing step-by-step procedures that your employees can follow in order to return your business to regular operations as soon as possible in case of a natural disaster, IT system outage, or other catastrophic event. These steps may include temporary manual replacements for technology-dependent workflows, but it’s also important to outline the human resources and third-party services you’ll need to call in for help, as well as the specific functions they’ll perform.

In contrast, disaster recovery planning consists of implementing technologies and best practices to ensure business-critical IT systems get up and running again as quickly as possible after a crisis, and that data loss is minimized or prevented.

The most important element in any business’s disaster recovery strategy is conducting regular backups of critical systems. No matter your business’s size, the fact that you have reliable backups that are stored offsite and in isolation from your central IT environment will dramatically reduce your risk of incurring significant costs or damages in case of a system outage or cyberattack.

graphic-describing-RPO-and-RTOBackups alone don’t constitute a complete disaster recovery strategy, however. Besides maintaining copies of your business-critical data in a secure secondary location, you should also be testing your recovery procedures regularly. It takes time to restore data and applications from backup, so it’s essential to ensure that the recovery process can be completed quickly enough to protect the continuity of your business’s operations. When planning and testing your recovery procedures, you’ll want to keep track of two important metrics.

  1. Recovery time objective (RTO) designates how long your business can survive without access to the data or application in question. It’s a measure of the maximum amount of time it can take to restore the system to full working operations.
  2. Recovery point objective (RPO) describes how much data your business can afford to lose, and thus dictates how frequently backup copies of your data should be made.

Like all mechanical systems, backup hardware devices don’t last forever. Repeated use will eventually cause tape and spinning disk (HDD) drives to malfunction, and even solid state or all-flash storage arrays can be written to a finite number of times. Thus an important element of testing backups is ascertaining the ongoing health of the systems, and making sure that they’ll work when needed.

With the advent of cloud computing, reliable backup and recovery infrastructures have become more accessible and affordable for businesses of all sizes. It used to be that only the largest of enterprises could bear the cost of building full-scale redundant systems that could automatically take over computing capabilities in case of primary system failure. But the cloud’s resource-sharing models have made backup and recovery as-a-service options cost-effective for even the smallest of businesses, and easy for those with small IT staffs to take advantage of. But these systems must be put in place before disaster strikes in order to solve the problem.

the power of preparation in IT Systems Planning

Key elements every comprehensive business continuity and disaster recovery plan should include:

  1. Established recovery time objectives (RTOs) and recovery point objectives (RPOs) for your backup and restore strategy.
  2. Regularly scheduled testing of backups regularly to ensure that restores will be successful, that RTOs and RPOs can be met, and that failover and failback can be accomplished without significant loss of productivity or downtime.
  3. Training for employees on workflows and procedures
  4. Clear designation of roles and responsibilities among stakeholders.
  5. Well-defined criteria for when to launch the planned actions.
  6. Step-by-step procedures outlining what to do in case of a natural disaster, hardware failure, cyber attack, and outages resulting from various other causes.
  7. Schedules for reviewing, re-evaluating and updating the plan on an ongoing basis.

When Downtime Strikes: 5 Critical Steps to Take for the Fastest-Possible Recovery

In most cases, how long it takes a business to recover from a full-scale IT system outage—and thus avoid devastating financial losses, repetitional damage, and other severe consequences— is a function of how well prepared that business was to face the event. There’s a simple linear relationship between preparedness and downtime: the better your disaster recovery plan, the fewer disruptions you’ll experience.

Businesses that don’t have a well-developed plan in place will incur higher costs and more disruptions, and should engage an IT service provider with extensive experience helping companies in this situation mitigate their losses. Generally speaking, the less prepared a business is, the more specialized expertise it will require to get their systems up and running again.

5-Critical-Steps-to-Take-for-the-Fastest-Possible-Recovery

With that said, here are the five steps that every organization suffering downtime should take when moving towards recovery:

#1: Determine the cause and scope of the problem.

In some situations, such as a hardware failure, fire, flood, or natural disaster, this will be obvious. Hardware damage will be readily apparent, and its extent will be clearly defined. 

When businesses are struck by ransomware or are the victims of other malware-based cyberattacks, however, it’s essential to get expert assistance in this area. Cybercriminals deliberately employ cloak-and-dagger tactics whenever possible. Ransomware is frequently engineered to re-infect machines after their hard drives have been re-imaged and data restored from backups, or designed to infiltrate and corrupt backup systems that aren’t properly isolated from primary networks. Identifying which individual endpoint device was the original source of infection is key to thwarting these sorts of persistence strategies. Accurately identifying the exact strain of malware that’s involved in a cyberattack is also critical to ensuring that all infectious elements have truly been wiped from your systems.

During this phase, your team will also need to gather documentation on the software, hardware, and configuration settings that are in place in your IT environment. This information will be essential in forensic investigations, and will also be useful when you’re assessing whether systems or resources should be immediately replaced.

#2: Establish or call in your incident response team.

An essential part of disaster recovery is defining the roles and responsibilities that members of your team—or any external consultants you’re working with—will need to assume in times of crisis. The first few hours after an incident begins to unfold are times of uncertainty and stress for all your business’s employees. Giving everyone clear directions and a well-defined set of steps to follow can reduce stress and panic.

The duties involved in responding to a major security incident are wide-ranging. Employees in multiple departments (including marketing and PR as well as IT) will have critical roles to play, and the company’s leadership should stay involved and informed at all times. Depending on the nature and scope of the incident, you may need to involve external cybersecurity consultants and law enforcement officials, and may need to communicate with employees, customers and the general public. At a minimum, your technical team should include a cybersecurity expert, a networking expert, a senior system administrator, and at least two people who can manage desktop remediation.

An experienced IT service provider can lead you in managing the full process.

#3: Evaluate your options.

Within the first 24 hours of any major IT systems outage, you’ll need to assess the availability and reliability of the backups you have in place. If you’ve got recent and granular snapshots of all affected data on hand and the problem is limited in scope or was immediately contained, you could be back in business the next day.

Conversely, if you don’t have backups at all, or if your backup systems were also compromised in the incident, full recovery may take months. 

This phase should include a thorough cost-benefit analysis of all available recovery options. In some cases, alternatives that might seem more expensive initially (such as replacing all affected hardware) might actually save you money over the longer term by reducing the number of hours of emergency IT services you’ll need to consume, and enabling you to upgrade to better-performing, more secure, and more resilient systems that will reduce your risks in the future. 

Be sure to evaluate the extent of your insurance coverage, and what your policy will or won’t pay for. Some cybersecurity insurance companies will cover—and may even encourage—paying ransoms to criminals in exchange for their promise to restore your data. Bear in mind, however, that although data restoration rates in cases where victims have elected to pay the criminals are improving, you never have any guarantee that the encryption key you’re provided will actually work.  The decision to pay ransom or not should take the reputation of the criminals into consideration.

#4: Develop and execute the plan that will get your business back on its feet the most quickly.

Which plan is right for your business depends on the cause of your IT systems outage, your budget, and your tolerance for downtime. There’s no one-size-fits-all plan that’ll work for all companies in all industries.

There are, however, industry-wide best practices and protocols that should guide your recovery efforts. In the case of a cyberattack, for instance, responding to a major incident requires the right tools and technologies, the expertise to know how to use them, and an understanding of the proper protocols to follow. The more readily available these critical elements are when disaster strikes, the less time your recovery will take.

For many companies, there may be a silver lining to IT disasters. Often cyberattacks or hardware failures give those who suffer them the opportunity to modernize technology infrastructure, move to cloud-based services, upgrade software, and improve resilience overall. A crisis’s short-term costs may actually translate into long-term business benefits that ultimately increase efficiency and productivity.

#5: Evaluate—and retain value from—lessons learned from the incident.

The old saying “what doesn’t kill you makes you stronger” is clearly applicable here. Though a significant percentage of SMBs do go out of business in the weeks and months following a severe IT system outage, those that do not often emerge from the incident with more resilient systems and better processes in place.

The key to making lemonade from the lemons of disaster recovery is reliable documentation and honest discussion. Be sure to keep records of all activities that take place during the incident response and recovery process so that you can assess what worked well and what could have gone better. It’s especially important to conduct a post-mortem session where your team can assess the lessons you’ve learned from the incident, and analyze how best to incorporate them into your business continuity and recovery plan for the future.

7-Key-Questions-to-Ask-Your-Managed-Service-Provider

7 Key Questions to Ask Your Managed Service Provider (MSP) or CTO.

it-outages-coverTo get the list of the 7 key questions to ask your managed service provider or CTO, download your personal copy of this eBook by completing the form to the right. 

 

Conclusion

The most important step your business can take to safeguard its future from the potentially devastating consequences of unplanned downtime is to develop a comprehensive business continuity and disaster recovery plan. Such preparedness can make the difference between success and business failure, especially in today’s landscape of increasingly sophisticated cybersecurity threats. 

But BCDR planning isn’t always straightforward or easy. A managed IT service provider with specific experience in your industry can help you lay the groundwork for real resilience and preparedness. The MSP can guide you in selecting backup and recovery solutions that fit your business needs and budget, can help you choose cybersecurity technologies that will protect the hardware and equipment you rely on, and can assist you in building business processes for increased resilience.

Here at CNS Partners, we have more than 20 years of experience working with midsize organizations in the manufacturing sector. We have a deep understanding of the specific IT security challenges you face—including the recent escalation of the threat posed by ransomware—and we know how important it is to control quality and optimize production in your facilities.  

To learn more about the ways we’ve helped other companies just like yours build resilience into their IT environments cost-effectively, download CNS Partner's Expert Guide to High-Performing IT Systems today.

Expert Guide to High-Performing IT Systems

Request an IT Service Assessment

Request