What are the key considerations when designing for availability?
This topic is important to architects because… architects are expected to be experts in availability design.
We’ll cover some basics in this post. Most importantly, we will cover the five key areas for achieving 99.999% availability in “The Road to 99.999%” below. Each one of those five areas is a good topic for another post.
Measuring Availability – How Many 9s?
When we talk about availability, we generally speak of how many 9s are in the percentage of availability. There are generally three broad measures of availability in terms of percentages as applied to information systems.
Three Nines: 99.9%. Most commercial systems have this level of availability. This equates to about eight hours of unplanned downtime per year. Generally speaking, if you can stand up a server and have normal IT department skills, this is not difficult to achieve.
Five Nines: 99.999%. “Critical” software systems generally this level of availability. “Critical” systems would include most large-scale financial systems. Companies that do stock trading, for instance, typically aim for “Five Nines”. This equates to about five minutes of unplanned downtime per year.
Seven Nines: 99.99999%. Achieving this level of reliability is generally restricted to “life support” systems such as air traffic control systems.
Availability across many disciplines. Availability engineering is important for IT, but many other technical disciplines are concerned about “availability” engineering as well. A prominent example would be power plant engineering for electrical systems.
Multidiscipinary. Availability design falls on the architect because achieving high availability is a “multidisciplinary” task. That is, the only way to achieve high availability is to work across multiple different IT teams and disciplines and make tradeoffs among them – that is architect work rather than developer work.
“Alive and reachable” is not enough. The simple way to conceive of availability is whether or not an application is reachable or not. Reachability is necessary, of course, but capacity and speed also need to be considered among other factors. If a system does not respond in time – even under heavy load – then it is not “available”. Most IT professionals have seen someone troubleshooting an application, “ping” a server, and declare everything to be fine, but that is not sufficient.
Availability gets expensive fast, so work to SLAs. Most experienced IT professionals assume – correctly – that if we spend a lot on hardware, then our systems will have improved availability. Investing in hardware at for any non-trivial systems is not, however, the most “effective” way to achieve availability. Rather, good architects will always start with a “Service Level Agreement” (SLA) to define things like concurrent capacity, response time, peak load, and maximum number of allowable failures. You’ll want to take action to achieve the criteria in the SLA – then stop. You’ll want to “overengineer” some parts of your system, bit not availability because of the expense.
KEY POINT: A lot of organizations get hung up on delivering “five 9s” of reliability. You should not invest in five nines without business justification. For instance, I don’t need this blog post to be available 99.999% of the time.
Mainly about infrastructure/hardware. We are concerned with other availability topics, of course, but having the right hardware in place is definitely the most important concern. Without robust and redundant hardware in place, other efforts will not matter much.
Still need a “holistic” approach. While infrastructure is the key enabler, you still need to take a “holistic” approach and consider several key factors that we list below.
KEY POINT: Hardware redundancy is important, but do not get hung up on it.
“Planned” downtime. does not count. Most very available systems do have times that they need to be down for installations and maintenance. As long as these are scheduled and communicated periods, they do not count against measurements of downtime percentage.
The Road to 99.999%
Infrastructure. To achieve 99.999%, you need to have redundancy. More importantly, you will need geographic redundancy. If you have your application deployed in only one data center, you may be able to achieve 99.9%, but you cannot have 99.999%. Infrastructure concerns are conceptually the simplest concerns for availability, but they are also the most expensive.
Data Management. Data management is more important than application design for availability. Many developers simply assume that everything is going to be OK with data management and that “the DBA team” has simply handled those issues. That may be a reasonable assumption if all you want is 99.9%, but if you want to do better than that, you have to partner with the DBA team. It is important to always bear in mind that you can have a lot of application servers, but logically have only a single database. If that single database computer goes offline, gets overloaded, or a transaction is not backed up, then you are subject to unplanned downtime.
Application Design. There are a number of application design techniques that allow us to reduce the potential for failures. Those are not coding concerns, but are more properly considered “enterprise application design patterns”. These types of patterns are typically associated with service-oriented architecture (SOA) and Microservices. Most application developers have not had much exposure to these techniques.
Maintenance and Deployments. Installation software is also software that needs to be validated along with the main application. Installation procedures can introduce availability problems among other defects if installs are not validated. Most IT professionals can recall instances where a server was patched, a network configuration change was made, or a server was missed during an install that caused unplanned downtime.
Error-Handling. Most developers have been exposed to best practices in error-handling with respect to logging, bubbling up exceptions, and what kind of data to show to end users. Those are helpful techniques to get to 99.9% availability, but to do better we have to consider additional measures like retries and timeout handling. We have to design our systems with the assumption that anything it depends on is going to fail and that alternatives have been considered. Specifically, you should never assume that networks, databases, or files system will be available or work correctly.
Monitoring. You can’t know if the system is up or down or what percentage of uptime you have without measuring it carefully and proactively. You must also alert human beings if the system fails to conform to an SLA. With a 99.9% system, you may be able to wait until a customer tells you it is down. To do better than 99.9%, you will have to keep poking the system with monitoring software.
Disaster Recovery. No matter how well you design the system, there will be times it goes down anyway. There will be factors that are simply out of your control or that you could not have reasonably anticipated. You will need to document, resource, and rehearse your “business continuity” plan for the same reason that most cities and towns have a fire department and conduct fire drills. Accidents will happen and you have to be prepared for them.