Network cable IT outage concept

You may have heard the joke about how "Somebody tripped over a power cable in the hosting center."

Joking aside, a study sponsored by Emerson Network Power in 2016 showed that 22 percent of outages are caused by human error, and that percentage has remained stable for several years. (This is not counting outages caused by deliberate human action such as hacking or sabotage). Other studies put the percentage as high as 75%. Some of those outages even literally are caused by accidentally unplugging something.

 

So, what are the primary causes of human error?

 

1. Power problems. Which includes, yes, accidentally unplugging something. It can also include plugging too much equipment into one circuit, causing a fuse to blow or a circuit breaker to flip. A lot of IT equipment is dual-corded, which can increase errors, as can worn out labels on circuits. This makes it all too easy to power down the wrong server. In some cases, major outages have been caused by somebody turning off the UPS and nobody noticing it until main power goes off. Do you remember the 3 day outage of British Airways' computers that caused the delay or cancellation of 1,000 flights? Caused by somebody disconnecting a power supply.

2. Software updates. System administrators tend to hate them, and in some cases problems can be caused by not performing updates, especially security updates. Problems can also be caused by not checking system requirements and installing an update legacy equipment cannot handle. The most common human error, though, is installing an untested update or not properly backing up before installing the update. In May of 2017, Starbucks installed an update to their point of sale systems, which went wrong and turned many Starbucks into cash only businesses for half a day.

3. New installations. Switching sites over from an old server to a new one is always going to be a high risk process. When it goes wrong it can result in significant downtime. New installations are complex and cannot be automated. It can happen to the best of us: When SalesForce did a site switch of their servers in 2016, the database lost file integrity and they had to restore from backups.

4. Not having a disaster recovery plan. Not planning for downtime is the easiest way to make sure downtime lasts longer. Even small companies should have an IT disaster recovery plan to allow them to get back on their feet.

5. Not keeping (or checking) backups. Automated backups handle ninety percent of situations, but sometimes a human has to step in. When an automated backup system fails, a lot of the time administrators don't notice until they need their backups and discover they are corrupted or non-existent.

6. Failing to anticipate load. This was how Lowes and Macy's both had major outages on Black Friday, 2017. They did not realize just how much strain would be put on their servers and didn't set up the needed extra capacity. This resulted in a load-related failure of their payment servers. Lowes' website went down altogether for 21 minutes, an eternity in e-commerce time.

7. Routine maintenance issues. Remember when AWS went down for hours on February 28, 2017? It took with it half the internet. AWS status report was on the affected servers. Both DownDetector and IsItDownRightNow.com went down from the load. Some people were unable to use their smart devices. The cause? A typo. Literally, a typo made by the person taking down some servers for maintenance.

8. Failures of computer hygiene. A worker opens the wrong email or the wrong file and all of the servers go down. Many cyber attacks rely on somebody making a mistake or not paying attention to get in.

9. Mistakes with climate control. Flipping the temperature from Fahrenheit to Celsius can rapidly cause equipment to overheat. Many datacenters now avoid this by eliminating human control over the thermostat.

office employees

 

So, how can you reduce the human factor? Automating everything possible is always a good idea, but here are some other things to consider:

 

1. Use monitoring systems such as Binarycanary.com to alert staff immediately when a system goes offline. The first step to resolving an outage is knowing when it happens. Even for downtime that is not caused by human error, a common reason for problems not being resolved is the correct personnel not being notified. You can't rely on employees or customers to notice a problem right away. BinaryCanary.com helps make sure that you are instantly aware if a key system goes down.

2. Security training for all employees. Everyone who accesses your system should be properly drilled on how to spot phishing attempts, not opening unsolicited attachments, etc. Sending the occasional spoofed email from IT can help spot the people who are vulnerable and need a reminder.

3. Improved datacenter design. Better labels can prevent the wrong server or cabinet from being unplugged. Wiring things in the easiest and most intuitive way will make life easier for everyone who does maintenance in the datacenter. Color coding plugs and plug inserts is a great way to make sure everything gets and stays plugged in to the right power systems. Using red cords for mission-critical equipment that should never be unplugged has helped some people. Secure plugs which require a key to unplug are available and might be useful in some circumstances. 

4. Better training. Making training mandatory for everyone who goes in or near a datacenter is a good idea. Set a good example by taking at least some of the courses yourself. Training is expensive, but it has a high ROI when it prevents an incident. Hold refresher courses at least annually, ideally twice a year.

5. Limit access. If people have no business being in the server room, keep them out of the server room. This reduces mistakes and also helps improve physical datacenter security, keeping out thieves.

6. Have and test a disaster recovery plan. Having a plan is not good enough. You need to test it and retest it periodically, especially if something has changed.

7. Keep all of your software up to date whilst performing updates carefully to help prevent downtime caused by a bad upgrade.

Human error is a major cause of website and IT outages, and likely always will be. Minimizing it is best done with training and planning to make sure that people make fewer mistakes and fix them faster.

 

Free Trial

Sources:

https://www.computerweekly.com/news/2240179651/Human-error-most-likely-cause-of-datacentre-downtime-finds-study

https://www.greenhousedata.com/blog/despite-automation-human-error-is-a-top-cause-of-downtime-how-to-avoid-it

https://www.cloudendure.com/blog/7-outages-system-downtime-incidents-q2-2017/

https://www.pcworld.com/article/3068699/salesforce-outage-continues-in-some-parts-of-the-us.html

https://blog.bluematador.com/major-downtime-blunders-2017

http://www.raritan.com/blog/detail/preventing-human-error-in-a-data-center

https://blog.newcloudnetworks.com/5-ways-to-prevent-human-error-disasters