Disaster Management and Preparation

Version Relevance: All

Issue: How do we recover from catastrophic failure?

December 14th, 2016

Background: Caliach Vision runs our daily operations. If it goes down no-one can work and effectively the business stops operating. Even when we get running again it can take days or even weeks to fully recover. Are there some golden rules to follow?

Feedback: Serious problems can arrive at any time and in pretty much any form. The key objective of any response should be to act quickly to prevent a problem turning into a disaster. There have been 3 recent (2016) major site crisis, all very different in cause and effect. All could have been better handled with less painful outcomes. We discuss here the lessons learned.

All three of the recent incidents were caused externally to Caliach Vision itself, one with a change in the operating system, one by plain bad luck and one by a criminal ransom-ware attack. None could have been foreseen, but all could have been planned for in a generic sense. What is clear is that however much IT protection and redundancy you think you have, you are still vulnerable to major incidents that can never have been anticipated.

The most destructive and costly of the three incidents was the plain-bad-luck one - it was the scale of the business that contributed to this ranking. By far the most alarming was the ransomware attack. Such attacks typically take the form of a Trojan with a payload that encrypts any file the user has access to, including all network shares. You are then asked to pay up to reverse the process. You either pay up or recover all the encypted files from some pre-attack off-line backup.

This makes pre-V5 Caliach Vision systems very vulnerable (unless you are using Omnis Data Bridge) because all users must have read/write access to the database. In V5 your database is secure because of the effective firewall of the DBMS (PostgreSQL), at least with currently known mallware. By good judgement the subject site had upgraded to V5 and so although alot of damage was done to ordinary file-system files, the database itself was untouched. All operating systems are vulnerable to this - Windows, Macs and Linux.

When is a crisis not a crisis?

This is the most difficult aspects to get right. A problems starts with a bunch of strange errors. At this point it just looks like another user problem as others are working away without issue. In fact what has happened has corrupted the system, so continuing to work is not only pointless but is making the situation worse. A 4 hour shutdown is much less damaging than a week of coordinated re-work of 30 user's activity for a day. What is a morning's inefficiency turns into a week's double-working. At some point a decision needs to be made to get everyone off the system and if it is a serious problem, the sooner the better. Identifying that can be hard unless someone competent is on hand to make that judgement.

Number one, therefore, is to have someone on-site at all times that has enough experience and knowledge of the system to spot when somthing is going terribly wrong. Someone who can call up Caliach Support to clarify what might have happened and how serious the issue is. By the nature of these unexpected issues it may take time to answer this, so support by default will tell you to stop using the system. The reason is that if the database is corrupted, the only recovery may be last night's backup.

Do not panic! Do not scream and shout, send 5 emails an hour to 4 CCs! Resist the temptation to get hysterical as it is counter-productive. All three recent cases demostrated elements of this behaviour. One competant person only should be tasked with co-ordinating the recovery and left alone to get on with it. Crisis management by committee is a big mistake.

What priorities should we have?

First and foremost you need to establish the root cause of the problem. From that you can establish what damage has been done and therefore your road to recovery. In two of our recent cases the problem was encountered at logon (a time when all parts of the system have to come together). In one case this confused the heck out of the the system manager (and our support) because other users who had logged on before the attack were still running normally. After a few hours, when the true cause was established, it became apparent that a restore of the previous night's backup was needed.

In the meantime about 1000 transactions had been performed. To recover, all 1000 transactions had to be repeated in exactly the order they had been performed originallly (to maintain document numbering sequence and relationships). The site was, in this instance, lucky because we could recover from the corrupted database enough information to allow them to sequence their recovery activity. The recovery process was also very well managed internally. Even so, it took much of a week to get back to normal. In retrospect, a quicker shutdown of the working system could have saved several days of recovery disruption.

Second, and affecting the first, it is important to maintain site system know-how. Managers running arround saying "I don't know" is no good to anyone - our support answer will be "Why don't you!". You also can not rely on your IT support company as they will know nothing about the Caliach Vision system (with the exception of www.lineal.co.uk). If you have run-down or dumbed-down the Caliach system manager function you will eventually learn to regret it the hard way. There seems to be a trend developping where people and businesses think all software should behave like iPhone Apps. Caliach Vision is not a toy, and if you hold that view you should not be using it.

Plan for the worst and sleep better

Here is the good news! Besides a nuclear strike there is pretty much no event that you cannot ultimately recover from by taking sensible precautions. Fire, disgruntled employee, plain stupidity, all can do huge damage and cost a great deal. But here are some tips on how to minimize the damage and speed recovery when the inevitable crisis occurs:

  1. Develop a Caliach Vision disaster recovery plan. Brain-storm the possibilities and synthesize a strategy for survival. Then write a procedure and make it available to everyone - on paper because the computers may be down!
  2. Have a sensible and working backup system going back far enough, preferably onto external media. With pre-V5 backups of the database you can only realistically have one per day. V5 on you can have one a minute if you have the disk space. We had one case 5 years ago when we called for a backup recovery and the last backup was 6 months old. The business never recovered from the experience and is no longer operational.
  3. Prepare your staff for a system shutdown. They should be able to keep going with pen and paper, at least for a few hours without you loosing business.
  4. Do dangerous things in the early morning. That way if you screw-up you don't loose a whole day's work if you have to recover a night-time backup.
  5. Maintain good file and access security. You may trust your people but just remember they can be unwitting carriers of pretty unpleasant stuff. The less access they have the less access their malware has.
  6. Have strict rules about outsiders getting into your network. Even Caliach Consultants can be carrying dangerous stuff.
  7. Do not rely on an IT Support service provider to get you out of a Caliach Vision crisis. They will not have a clue how it works and Caliach Support will not train them for free. In our experience they just muddy the waters. (www.lineal.co.uk excepted)
  8. Keep your systems up to date, and that includes Caliach Vision. Every new version of the program incrementally builds on experience to improve reliability, protection and recoverability.
  9. Keep your users trained and prepared for an emergency shutdown.
  10. Stay disciplined and keep calm.

Chris Ross - Senior Consultant