Graceful Failure: How to Survive the Inevitable Outage
Anyone who runs a small software company or online service knows the importance of a strong web presence. A lot has been written about how you should structure your site, what message you should present to visitors, and what you can do to increase your conversion rate and ultimately drive revenue. However, all of these assume that your website is up and running and that visitors can actually reach your site.
If you’ve worked with technology for any time at all, you know that Murphy’s Law is alive and well, and that things can and will break. I’ve spent the past ten years in the hosting industry, the first eight years running the technology side of a major hosting company. In that time, I saw the ins and outs of the infrastructure that makes up the Internet, including failures of every imaginable sort. For the past two years I’ve run my own startup, Panopta, which provides advanced server monitoring and outage management tools. My goal with this series of articles is to share what I’ve learned with other startups and software companies, and hopefully save you some pain and stress down the road.
Perhaps you’re just starting out designing your online presence, and are looking for ideas on how to design your infrastructure. This is the best time to think carefully about how to avoid problems down the road. If you already have an online presence, don’t worry, these strategies can be incorporated as incremental improvements to existing infrastructure and there’s no better time than the present to begin preparations.
Given the number of components that come together to provide the full infrastructure needed to host a website, from the physical (datacenter building, electricity, air conditioning), network (backbone connectivity, global routing tables, intra-datacenter networking) and software (operating system, web server, application server, etc.) there’s plenty that can go wrong. Outages can be the result of good events too - everyone has heard stories of sites that have been knocked offline by a front page mention on Digg or Slashdot. Great exposure but it’s wasted if visitors can’t actually reach your content. This goes double for software-as-a-service companies, as an outage not only means potential customers can’t find you but that current customers are unable to use the service that they are paying for.
This is the first in a series of articles that will explore what you can do to maximize the availability of your website and other components that make up your online presence so that your current and future customers can always find you.

Where to Start?
Before getting involved with technical details, there are some general principles that should guide the design of your infrastructure. These really transcend any particular technology that you’re using, but should be kept in mind both when initially putting together your infrastructure and at each step in it’s evolution.
Take responsibility for your own infrastructure
Unless you’re launching a huge VC-funded operation, you’ll likely involve a number of service providers in building your infrastructure. We’ll talk later about how to pick these wisely, but it’s important to recognize that the availability of your infrastructure is ultimately your responsibility.
While you can (and should) count on your providers to assist when there are problems, their interest can never be 100% aligned with yours as they are also responsible to many other customers. Many people think that they are safe because they have a service level agreement (SLA) but most SLAs only provide compensation for the cost of the service, not your business losses, during the downtime. For any sizable outage, your business losses will likely be the far greater of the two.
At the end of the day, you’re the one that suffers most from an outage — take the lead and do whatever makes sense to minimize the impact.
Know what can break
There are quite a few moving parts to even the simplest online presence, most all of which are under your control. It’s important to know what all of these pieces are, how they interact, and what their dependencies are. Then, while you are in a calm, non-crisis situation, try to imagine all of the possible things that can break. In later articles, we’ll walk through the core components together and review common areas to focus on, but especially with custom applications there are lots of areas that will be unique to your setup.
As we describe below, we don’t necessarily need solutions to all of these, but we want to make sure we’re not surprised down the road with something unexpected.
Minimize the impact of outages
The worst feeling that you can have when something breaks is to find that you’re stuck and unable to act to resolve the problem. To avoid this, dream up solutions to all ways that things can break. This would involve building and maintaining additional infrastructure, such as deploying redundant systems, ensuring you have proper backups of critical data, and spreading your services around to different providers. Again, we’ll cover these in detail as we work our way through the key pieces of your infrastructure. There is additional cost of this extra infrastructure, but it can pay off dearly when you start to have problems by keeping your options openand giving you more choices in how to respond to a particular situation.
Without this, you might find that the only copy of important data sits on a server in a blacked-out datacenter or will take hours to restore from backup. Or that you are unable to direct traffic to another server because you can’t access your DNS servers. The resulting frustration can take years off your life. If you design things carefully, you can keep the pain and stress to a minimum and ensure that an outage of one component doesn’t take you entirely offline.
Match your efforts to your acceptable level of risk
It’s impossible to avoid every possible outage, but that doesn’t mean that it makes sense to prepare for all of them regardless of their likelihood. Especially for small businesses, it’s important to evaluate the possible risks to your infrastructure and their impact on your business, and then decide what it makes sense to pay (in dollars, time and complexity) to avoid each risk. There isn’t any set of answers here that apply to all businesses, as each business is different - you have to make intelligent choices about what risks to prepare for and incorporate those into your plan,
Plan your response
The worst time to think about how things should function is after something has broken - the stress and pressure of lost revenue and unhappy customers makes dealing with an outage even harder. Under this kind of pressure, we professional problem solvers tend to fix the immediate issues, sometimes “shooting from the hip” as we do so, often overlooking larger issues which might cause things to cascade into larger problems.
Ideally, when an outage hits you have a plan ready and just need to carry out the pre-arranged steps to take care of things. For most companies, this doesn’t mean that you have to put together a 500-page book describing every scenario, but you should take time before a problem occurs to assess the key components that make up your infrastructure, understand how they might fail, and plan your response.
What’s Next?
So far we’ve covered the theory of how to keep your infrastructure reachable to your customers. Fortunately, there are a number of relatively simple things that you can do make yourself more resilient. Over the next few articles we’ll cover all the elements of your online presence, starting with domain name registration and DNS.
Without a doubt, preparing for an outage in advance certainly adds to the complexity of bringing your company online, but it greatly simplifies keeping your business online. Remember, at the end of the day it’s your company, and no one can take better care of it than you!
—
Jason Abate is the founder and CEO of Panopta, a complete server monitoring solution. Learn more at panopta.com











