IT organizations spend a lot of time trying to eliminate single points of failure in their systems. Equally problematic is the “single person of failure” problem, which can have consequences that are just as disastrous.
At the recent DevOps Days Silicon Valley conference, 10th Magnitude consultant Sasha Rosenbaum defined the single person of failure problem, outlined the symptoms of the problem and offered solutions for fixing it.
Who is the “Single Person of Failure”?
Building highly available systems is complicated and expensive. Despite IT organizations’ best efforts, most systems have at least a few single points of failure. But let’s say it’s possible to make all of your company’s software and infrastructure perfect—so there are no single points of failure, everything is backed up, every server is load balanced, every database is in a cluster, etc.
Even in this technically perfect scenario, the potential for disaster exists because you might have forgotten about a dependency—let’s call that dependency “Bob.” Bob is the IT person who knows everything and manages everything, and everybody comes to Bob to get what they need. But one day Bob goes on vacation, something goes wrong and no one knows what to do because Bob is the only person with the knowledge or tools to fix the problem. Bob is the single person of failure.
Many IT teams have ended up with at least one “Bob” who puts one or more systems at risk. It’s not that “Bob” is a bad person—he or she could have ended up in the single person of failure position for a number of reasons:
- A lack of IT budget that made it difficult to hire more people to help manage systems
- A tight market for IT talent that made it difficult to find more qualified people to hire
- A desire to implement more control and consistency across systems
It’s natural for IT organizations to drift into the single person of failure problem because of these factors. But humans are not highly available—unlike technology, they have limited uptime, have to deal with regular interruptions and can’t be replicated. So IT organizations have to guard against falling into the single person of failure problem and take steps to fix it when it arises.
Three Signs of a “Single Person of Failure” Problem
How do you recognize that you have a single person of failure problem? There are three scenarios that can help you recognize the issue.
#1 – “Keys to the Kingdom” Scenario
In this scenario, there is a single person serving as an administrator on a production server or some part of the system. He or she has sole access to that server or system.
How does this scenario happen? The problem might have developed organically with the growth of the company, if no one ever thought about adding more people to the team to manage systems. Sometimes people just take over on their own. They mean well by trying to enforce consistency, but they end up creating a bigger problem.
This problem can be solved with role-based access, where a role is defined to encompass a group of people (the group must have more than one person for this to work!). For example, access can be granted to a Production Administrators group. Another way to approach the problem is to make sure that whoever is on call has access to everything they need to solve any issue that might arise.
#2 – “Beware the Expert” Scenario
This scenario can be summed up in a quote that will be familiar to many IT practitioners: “This issue will take 15 minutes to fix and 8 hours to explain.” A lack of willingness or time to transfer knowledge to others on the team can lead to a single person of failure problem. This means that there is some part of your system that is not documented, not automated and can only be fixed by relying on the knowledge in the single person of failure’s head.
This scenario offers a great opportunity to involve juniors on the IT team in solving the problem (“juniors” meaning new hires who haven’t caught the that’s-how-it’s-always-been virus). They will ask tough questions and aren’t emotionally involved with the code base or the system like long-timers are. Ask these team members to solve the problem with documentation, testing and/or automation. And if you need to justify the time spent on knowledge transfer, think of it this way: if it takes a day to explain and document an issue, then it will probably take three or four days to investigate and fix the problem if you don’t have any documentation.
#3 – “Can’t Afford to Take a Vacation” Scenario
This scenario occurs when people volunteer themselves to achieve the impossible—to be that person who’s always available, works long hours and can solve any problem. But humans aren’t highly available and are susceptible to burnout.
Many people assume that working long hours improves productivity, but it doesn’t. Research shows that productivity actually declines after a certain number of hours worked per week. When your single person of failure in this scenario gets tired they get slow and sloppy, and they start making mistakes. Or else they reach their limit and finally do take time off, leaving others on the team to scramble to fill in for them.
Mandatory vacation goes a long way toward preventing this scenario. If you think mandatory vacation is impossible, remember that it’s a routine practice in the financial services industry to prevent embezzlement. Another option is to send each of your team members to training or to intern with another part of the organization for a week. During each person’s absence, have the other team members document any gaps in access or knowledge that arise.
No matter how a “Single Person of Failure” situation has evolved, recognizing its existence and implementing changes to solve it is a big step on the path towards fostering a true DevOps culture.
To get even more tips on identifying and solving the single person of failure problem, watch Sasha’s engaging and informative presentation.