Home > General > Designed to Fail

Designed to Fail

March 26th, 2014
English: Alan Shepard in capsule aboard Freedo...

English: Alan Shepard in capsule aboard Freedom 7 before launch—1961 Alan Shepard became the first American in space on May 5, 1961. He launched aboard his Mercury-Redstone 3 rocket named Freedom 7–the suborbital flight lasted 15 minutes. (Photo credit: Wikipedia)

I’m writing this a little bit for my wife and non-technical friends. As I was thinking through this idea it occurred to me that it’s not particularly technical, and it’s a great way of explaining what I spent a lot of my time doing when I’m actually designing solutions.

Recently both my wife and my phones have broken in various ways, she dropped hers and it’s never been the same since, mine just had some dodgy hardware failure and trying to fix it made it worse. The technology we buy is generally put together by the cheapest hardware provider, with the lowest cost commodity technology (not always the case, but largely true). Mass production lowers costs, so does cutting corners and reducing redundancy (multiple components that are there solely to take over when one fails).

In general home life most of us don’t bother with redundancy, some of us may have more than one desktop PC or laptop, but most of us will only have one personal phone, one big-screen TV, one Blueray player, etc. etc. We buy commodity and we accept that it is a commodity and prone to failure. Some of us don’t realise this and get upset when technology fails (perhaps a lot of us don’t realise this!), but when you remember that the technology you have isn’t necessarily the best (the best is expensive) but the cheapest, you’ll realise why it’s so prone to failure. This is why the home technology insurance industry is so strong!

You don’t see technology insurance in the IT world though, not in the same guise. You have support contracts (which in many ways you could consider to be an insurance policy!) and these generally guarantee a replacement within 4 hours or 1 business day. This is too long to wait for a recovery, so this is where my job comes in.

I design systems to fail, I push systems until they fail, write down why they failed and work out whether we should / could prevent this. One thing we always try to get away from is the single point of failure (SPOF), that is any single unit / device / component that can fail and would take systems offline. But SPOF’s scale, a computer room is a SPOF, so is a country, so is the planet Earth (some organisations have genuinely considered satellites for keeping data replicated off-planet). This is where you weigh up cost vs. benefit, protecting against country failure is probably irrelevant for most people I talk to in the UK, if the UK is gone, is a UK company really going to care that it’s still online and available? Global banks, insurance firms and other global corporations definitely will care, but most “normal” companies won’t.

So everything I design is designed in pairs, or more. At the very basic layer we start with pairs of components: power connections, network connections, servers, rack cabinets (the things that hold computers in a data centre), power substations, internet feeds and increasingly often complete data centres (what we generally call Disaster Recovery – DR, or Disaster Avoidance). But depending on the size, we increase the capabilities for failure, N+1 being the minimum (N being the number of systems you need to actually meet the technical requirements, 1 being failover capacity). More often than not this is now N+2, and increasingly with DR we actually go to having (N+2) + (N+2), as a second data centre needs the same capacity as the primary data centre in order to recover from a catastrophic failure.

There is an old saying “fail to plan, plan to fail” but in the IT industry we flip this around, everything should be designed to fail, and I push people annoyingly hard to understand what happens when different components fail, how will this impact the application / service and how can we design around this. Making things fail and understanding failure events is a huge part of my work.

Somewhere, somehow I am probably related to Gene Kranz (although not directly to my knowledge), the well known NASA Flight Director, and I find it a happy coincidence that his famous book is titled “Failure is not an Option“. My entire job revolves around this same ethos, but failure is not an option because I have to find every failure before it happens. NASA go to N+2 or N+3 in most circumstances as human lives are directly at stake, as well has hundreds of millions of dollars! But Alan Shepard famously said when asked how he felt just before the launch of the Redstone Rocket ‘… every part of this ship was built by the low bidder.’

So next time your technology device at home breaks, remember that what you bought was produced by the cheapest bidder, in the most efficient mass production model. Everytime you read a news story about a banking system going down, Facebook not being available, or some other IT system failure, one of 2 things has happened:

  1. Someone like me didn’t do their job thoroughly enough (I like to think less likely)
  2. Someone like me did their job thoroughly, but someone higher up decided the cost was not worth the benefit of fixing the chance of failure

You might think that given the publicity and bad press, option 2 is the rarity, but you would be surprised how often things don’t fail, and the longer things don’t fail, the harder it is to justify the cost of introducing redundancy.

Cover of "Failure Is Not an Option: Missi...

Cover via Amazon

General , , ,

  1. Eric
    | #1

    Hi Chris,
    Interested and nice article.For such low availability/reliability designed system like mobile phone and Tablet, we should not put critical information on that and have backup (to cloud) for sure.
    I am so agree that even global enterprise has put low priority on the backup/recovery/DR systems, not 100% confirm those system is working and meet the RTO/RPO. It is so risky as most decision maker under estimate the impact of the fail until it happens or as you said, it is too hard to justify…@~@
    Could you kindly share more for how to keeping data/DR off-planet, an interesting topic.

  2. | #2

    What’s your definition of critical? Syncing or distributing the data to render a device dispensable is a great tactic that I think all enterprises can learn from. Doesn’t need to be low priority or non-critical. Take a look at the database and filesystem world and this is happening with NoSQL systems like Cassandra and MongoDB, and distributed storage like Lustre and Ceph. The problem is removing the dependency on big important IT equipment is not in the best interest of the companies that least the big important IT industry. The horrible cliche of “No one got fired for buying IBM” is so irrelevant now, the reality is that if you don’t stay current with new technologies, you will get left behind!

    Regarding off-planet DR, these guys have something although I don’t really know anything about it: http://offworldbackup.com/Public

  1. No trackbacks yet.

This site is not affiliated or sponsored in anyway by NetApp or any other company mentioned within.
%d bloggers like this: