Planning to Fail – A New Breed of Expectations
Let’s face it: as an IT professional, it is unlikely that you would ever deploy a production workload without first validating that the solution has been properly configured and meets business, load, and security requirements. Simply stated, if we want to fully understand how our workloads will behave in a real-world scenario, we must test them in real-world conditions. Perform an IT self-health evaluation based on the following three questions:
- How will your infrastructure, applications and data react in the event of service outage?
I’m not talking about the Visio diagram that your engineering staff provided to your technology leadership which explains the failover protections – I am asking: “Have you ever actually watched it work?”
- What happens when a given workload experiences a sharp increase in demand?
How will your infrastructure react to load that surpasses it’s provisioned capabilities? Have you watched it work? Are there other upstream or downstream systems that were impacted?
- How would your systems and applications recover from a replication failure that left your distributed systems with data synchronization discrepancies?
How would you repair and reconcile these systems? How long would it take you? Do you have tools queued up for such a reconciliation?
Consider the concept of an elementary school fire-drill. Most would agree that the reason fire drills are practiced regularly – and in nearly all modern schools districts – is because history has taught us that pressure and stress can disintegrate even the best laid plans. The same principle holds true for IT infrastructure Systems Availability and Business Continuity Planning (BCP). However well formed your BCP plans are, plan on your first line of defenses to fail. Adopting an IT culture that plans to fail better positions your engineering staff and Managed Services Provider to react to a service outage.
Here are some questions you should be asking your IT Managed Services Provider to determine your organization’s readiness level to a technology service interruption:
- While planning for how systems will operate under ‘normal’ conditions is needed, what planning have we accomplished for disaster scenarios and how have those plans been exercised?
- If your Production infrastructure is not cloneable (read: run a CloudFormation, Chef or Puppet script to recreate a parcel of your Production footprint to a test environment), can you provision a test environment in the Cloud for only a few hours to run Business Continuity testing?
- What type of Business Continuity and/or Disaster Recovery “Fire Drills” are currently run and at what frequency?
- Are the notification and escalation plans that are associated with your systems tested – and at what frequency?
The time to determine that you need to shore up your IT engineering and MSP responses to failure events is before an outage occurs.
After all, when did Noah build the ark – before or after the flood?
Written by Eric Klotzko
Eric Klotzko is the Vice President of Enterprise Cloud for InterCloud Systems, Inc. He has functioned as a Cloud Architect, Application Developer, Graphic Designer, Database Administrator, and Project Manager for a wide variety of business, e-commerce, advertising and entertainment applications. The high-peaks region of the Adirondack Park in upstate NY is where Eric calls home with his wife and two children.