Tonight I am writing about Disaster Recovery Planning something that as a technical leader I have been involved in for some years during which time I have been fortunate enough to work with some amazingly talented and knowledgeable people each of which have taught me something new.

I write this in the hope that you will smile at some of the more unexpected twists and turns, but, that throughout you will find a nugget of useful information that will help you or your organisation look at disaster recovery through a different lens.

This is post number 1 of a series of examples of disaster planning either working …. or not.

Today’s Moral: Expect the unexpected

When I first started working in the technology sector in my very early twenties, I provided contract software services to a UK telco (name omitted deliberately to spare blushes!).

Based in central London, the telco had commissioned a number of its telco routeing switches to be replaced. For those of you fortunate enough to have never worked in telecommunications the routeing switches of the day were multi-million-pound affairs about the same size as a full-size server cabinet in height but three times the width and depth full of costly tech.

As is often the case the building itself was not new, in fact, it had been refitted on more occasions than was healthy and over the years the data centre had been migrated to the top floor (an interesting decision in my opinion!). On this basis, a crane had to be used to lift the cabinets into place and to do so a big main road with a bridge in the City of London had to be closed with the permission of the council (at much cost!). The deal was done, and the road was allowed to be closed between 4, and 6 am in the early hours of a Monday morning (mistake number one!).

The crane arrived and after a bit of jostling the cabinets were secured, and the lift commenced. 8 stories later and the cabinet hovered precariously over the edge of the property. With the roof fully open the cabinet was swung over and into position. This is where this story starts to go horrendously wrong!

As the crane began to rotate over the roof edge a small gust of wind caused the driver to adjust his controls. While doing so, the cabinet dropped very slightly (a matter of millimetres). As a result, the underside of the enclosure snagged on the capping stones of the roof and peeled the cabinet open like a tin of sardines (if you’ve never had sardines they are very nice but for the purpose of this post think about a tin of rip top baked beans!).

At about this point in the overall programme of work the project manager stood aghast as months of planning, months of negotiation and months of difficult board level convincing fell apart. The cabinet in question spilt its entire contents from floor 8 onto the road below, thankfully nobody was hurt, and there was only minimal peripheral damage.

At this point, some things became abundantly clear.

  • Telco cabinets are incredibly delicate things
  • Items dropped from 8 stories spread over quite some distance
  • The likelihood of clearing up the road and allowing traffic and pedestrians through by 6 am immediately moved to 0% chance
  • Risk management is an amazingly important topic

For those with a healthy curiosity, I can tell you that the road re-opened at 10 am on Monday morning and that as a result the organisation concerned was fined an inordinate sum of money. The server cabinet was irreparably damaged and to make matters worse an even more unexpected downpour caused a few thousand pounds worth of damage to the exposed data centre.

So what did we learn from this experience?

When trouble breaks the things you do in the first few minutes will define you in the situation
If you remain calm and give clear instruction you will find that people will act positively to help and recover the situation (if at all possible)
It takes a lot to sweep half a mile of the main road by hand removing components that can be as small as 3mm!
But anyway, joking to one side as a junior member of the engineering team I learnt the true importance of risk management in this context. For the first time I understood that it was not simply a box ticking exercise, nor was it simply about filing a risk rating report and carrying on. This lesson has stuck with me throughout my career, and I think that as a result, I have avoided some unexpected disasters.

Speaking to the project manager concerned a long time after the event I asked about his understanding of Risk Management at the time, and he calmly explained that the focus was always on the big obvious hazards (project running over, lack of budget or resources, failure of functional and non-functional testing). In hindsight, he learnt one basic truth

It is never the obvious risks that bite you, you are always prepared for the generic hazards, it is always the unexpected that cause business catastrophes
As such you should at the very least understand the following.

  • What is the likelihood of the risk
  • What is the worst possible impact of the risk (if left untreated)
  • What is the impact of the risk after treatment?

Armed with this information you have all the metrics required to make informed decisions based on your organisation’s risk appetite that clearly articulates the real financial impact of your actions (should they not pan out as you expect).

Now, don’t get me wrong, there will always be risks that we do not see coming or that cannot be thoroughly mitigated. That said I learnt as a junior that there are also some fairly significant risks that if we choose to overlook will have a detrimental effect on the companies we strive to support and in turn potentially our own careers.

In this instance, the project manager concerned chose to change his career after this event suggesting that the stress of project management was not for him. Lest we make the same mistakes in the future!

For the record, I have spoken to the project manager concerned very recently and shared the thought of me posting an article regarding that event. I was pleasantly surprised that he was not only happy for me to write this but in fact actually encouraged it.

So what is the moral of this tale? Plan, Review, Assess Risks, Put in Place Controls and Continually Reassess the Risks as they change.

That concludes this article, thank you for taking the time to read and please stay tuned for the next Disaster Recovery Lessons edition.