Resiliency redefined: Architectural approaches to achieving service uptime
In the coming years, the need for resiliency and how it is achieved at the datacenter and application level is expected to change significantly as the world gradually moves to a more cloud-based, hybrid and distributed environment. As we have discussed, advances in hybrid and cloud computing, DevOps, replication, distributed databases and global traffic management are now being combined to offer new ways of achieving higher availability and disaster recovery (DR).
For operators, service providers and CIOs, this area is challenging. For most, service availability is critical and non-negotiable, and so any move to newer, cheaper or more dynamic technologies or architectures must be undertaken with great care. The prize is better resiliency, with risk distributed and spread, and at lower cost of ownership; the price is more complexity, a possible loss of transparency, and a transitionary period in which dependency on unproven – or less proven – IT. 451 Research, in conjunction with our sister company, The Uptime Institute, has been researching this area in depth. This report of is one of many that will be published during 2017 examining definition and terminologies, architectures and component technologies, cloud service providers and market trends. The Uptime Institute is carefully evaluating how multi-site architectures can be best designed to support higher availability. In this report, we outline the four main architectures for achieving datacenter resiliency in the age of cloud services.
The 451 Take
The CIO's approach to resiliency, availability and recovery is set to change in the next decade – bringing all operators more into line with large-scale cloud providers. For many with mission-critical operations, proven approaches based on tight process control, redundancy and even over-provisioning will be still be needed in places; but at the CIO level, there will increasingly be a need for an overall strategy that is less binary and more nuanced. This approach will involve trading risks and costs, and sometimes accepting or supplementing what service providers can offer. The types of resiliency described here will form the building blocks to building resilient architectures, increasingly freeing up the applications teams (and cloud customers) to disregard most issues around availability, locality, site facilities and redundancy. The biggest mistake any CIO or operator can make is to assume that, in the new world of IT, resiliency strategy can simply be outsourced.
There are many terms to describe resiliency of IT systems and datacenters – or its aspects. These include uptime, resiliency, availability and DR. 451 Research and the Uptime Institute describe resiliency as:
"The extent to which a system, digital infrastructure or application architecture is able to maintain its intended service levels, with minimal or no impact on the users or business objectives, in spite of planned and unplanned disruptions. It also describes the ability of a system, infrastructure or application to recover full business operations after a disruption or disaster has occurred."
This definition represents a departure from some traditional views, which tend to separate availability (prevention) and DR (recovery). As systems evolve, with more complexity and interdependency, failure and recovery will often be less binary, with systems degraded or missing some services, but with other functions unaffected. There will also be increasing focus, with distributed cloud services, on 'zero' recovery time – which is effectively the same as high availability as a preventative strategy.
There are different ways of measuring availability/resiliency, which we view with considerable wariness. The most common of these is to measure the number of hours that a service is up and running as a percent of total time – this usually ranges from 99.9% to 99.999%. Many of these terms, and the numbers, need to be treated with caution. Like a datacenter PUE number (power usage effectiveness), there are many different ways to measure availability. There may, for example, be different numbers cited for design expectations, maximum or average achieved, and for projected or likely. Similarly, many service providers will claim a service was not down, but 'impaired' or 'sub-optimal.' As we stated, the idea of services between simply up or down is becoming outmoded, and marketing teams will often seize on this gray area.
Types of resiliency architectures
In recent years, traditional approaches to achieving resiliency have been evolving, with the cloud-based models enabling some radical departures from traditional single- or dual-site resiliency. In the years ahead, we expect many more service providers and enterprises to adopt some form of distributed resiliency, and that eventually, as networks, IT and management understanding improves, this will become the de facto method for minimizing downtime and disruption. It is not clear, however, to what degree this will enable operators to reduce on-site resiliency at the physical level: datacenters will increasingly be components in a distributed fabric, and those without on-site resiliency will become a weak link in multiple chains. At the same time, where this is planned and controlled, costs can be reduced by spreading functions and risk.
Currently, 451 Research/Uptime Institute see the following four models being used:
- Single-site availability: This is the traditional setup, with high levels of redundancy at the infrastructure level, including facilities and basic IT. With sufficient redundancy and planned design, operations can continue in spite of planned (concurrent maintainability), and in some cases, unplanned facilities failure.
At the IT level, resilience is further assured by internal replication (e.g., clusters), so that loads may be replicated elsewhere, and data/applications/configurations backed up off-site to a DR site.
- Linked site resiliency: This describes two or more lower-tier datacenters connected within a campus, region or zone using a dedicated network to achieve a higher level of availability than any individual site, typically within synchronous replication distance (this means that the two datacenters are near enough to each other and to customers so that they are always synchronized. This distance will depend on the applications, but is usually within 50 miles). In order to achieve the same or a higher level of facility availability as an expensive high-availability single-site datacenter, it may be possible to double up and share infrastructure with nearby in-zone datacenters. This assumes resilient and sufficient network capacity with predictable and independent pathways.
In this configuration, concurrent maintainability (downtime at one site does not disrupt service) is possible as long as there is sufficient capacity and processes are in place to support full operations at either site.
At the IT level, this setup can be used to support either synchronous (fault-tolerant automated failover to the second site) or asynchronous replication (a second copy of applications, data and files is kept at the second site to pick up the load).
- Distributed site resiliency: This term describes two or more independent sites, in or out of region or globally distributed (cloud or not), using shared internet/VPN networks to provide resiliency through multiple asynchronously connected instances. This can provide very high availability but can result in some loss of integrity between instances, usually minor, if outages occur.
At the IT level, distributed site resiliency is the architecture that underpins most DR services, and especially the modern cloud iteration, DRaas (DR as a service). Improved network capacity, software tools, database synchronization protocols and, critically, homogenous IT infrastructure running virtualized workloads has now made this option far more practical, flexible and economically feasible for both active/active operations, and for backup and recovery. As more distributed management technologies are added, distributed site resiliency can support or blur into in-cloud resiliency.
- Cloud-based resiliency: The term used to describe resiliency provided by distributing virtualized applications, instances and/or containers with associated data, across multiple datacenters, using middleware, orchestration and distributed databases. These will be under the control of a comprehensive and distributed control system. These systems will enable service or design choices to be made between, for example, absolute database integrity or immediate availability.
Effectively, cloud-based resiliency moves the resiliency up to the IT level. Any facility resilience achieved through redundancy provides added security, but may not prove essential. It does, however, assume that there is sufficient capacity in place, including the network, which is critical if loads are shifted from place to place. Importantly, developers do not need to concern themselves with location or infrastructure – this architecture is primarily suited for stateless or 'cloud native' applications.
How differing approaches compare
The table below sets out the basic level of resiliency that may be achieved – with some added rows to describe some different architectures within these types. Once again, this is not binary. For example, a single Tier III datacenter may not in all cases support continuous service during a component loss – but in many or most cases, it will.
The column for data integrity is important. ACID databases maintain high integrity at all times, even at the cost of availability. BASE architectures allow datacenters in different locations to fall out of sync but maintain very high availability. This has been described in previous 451 Research reports and will also be updated in a forthcoming report.
Clearly, each type of resiliency architecture fulfills different purposes and has a different profile, in terms of objectives, cost, level of availability and technical maturity. Cloud-based resiliency is the newest, and currently the most expensive; it may provide good total cost of ownership, but effectively can only be achieved at scale and with considerable capital. Each type is not mutually exclusive, at least at the facilities level. For CIOs setting out to assess how to develop appropriate resiliency strategies, this is a challenging period because engineering control is being eroded, to be replaced with a more nuanced and strategic approach where good assessments are needed.
With cloud services and architectures now part of the mix, or even the totality, the CIO must determine which type(s) of resiliency is most appropriate for each type of application and data, based on business needs and technical risk, and then architect the best technical combination of IT infrastructure. This will span data center resiliency, application, database and networking, and must take into account organizational structure, processes, tools and automation. From all this, the organization must then deliver comprehensive and consistent applications which meet and exceed customer expectations for service availability and resiliency.