We had a major outage on Friday this week on nine of our servers hosted in a particular data center in Florida. We have restored almost all the servers and the sites on them. As we come to the end of this recovery phase, it will be right to study how such a kind of major outage for one full day could have been avoided.
Here are some of the facts surrounding this outage:
1. The service maintenance was planned to be completed in six to seven hours, but was carried out on a week day and that too at a prime work time in Australia, Asia and Europe.
2. This particular set of servers were operated by an organization in which we had contact with only one individual and hence we strongly believe it is a one member organization.
3. The outage started right after the ‘service maintenance’ task carried out by this one particular service provider.
4. All the servers were down beyond the ‘service maintenance’ period for extended periods of time (ranging from 12 to 24 hours).
5. Even after the servers came back up, we had issues with at least five websites hosted on those servers, which were all due to the server migration.
Some of the lessons learnt from this disaster are:
a. Never accept for any planned maintenance on business days
b. Avoid data centers or service providers which are ‘single person orgainizations’.
c. Protect the clients better by having a set of backup/secondary servers.




No Comments so far ↓
Like gas stations in rural Texas after 10 pm, comments are closed.