Distributed resiliency is not trivial, not cheap, and not for all

Will it be possible, at some time in the not-too-distant future, for enterprises, colocation companies and cloud service providers to dispense with all the heavy infrastructural gear of the mission-critical datacenter and operate with lightweight distributed IT? How feasible is it to rely on emerging technologies that dynamically replicate or shift workloads and traffic whenever a failure looms? This prospect has been tantalizing many in IT for a decade or more, and for a few, it's a reality – of sorts. Big cloud service providers, in particular, have long boasted they can rapidly switch traffic between sites when problems arise, and so they build datacenters that are optimized for cost, not availability. In their environments, they say, developers needn't care about failures. That is all taken care of. Meanwhile, some operators and enterprises have replaced their expensive and mostly dormant DR sites with subscriptions to cloud services; others have replicated their loads across a distributed fabric of sites that are always active, and which can fail without great consequence.

All of this points to a potentially big and disruptive change in the areas of physical infrastructure, datacenters, business continuity and risk management. But as ever, the hype can screen the reality: A new report, Next-Generation Resiliency by 451 Research and Uptime Institute Research, suggests that although distributed resiliency is likely to be an increasingly used, and even dominant, architecture in the years ahead, it is also proving to be complex and demanding. The engineering diligence that is so key at the mission-critical infrastructure layer doesn't map easily onto the web of services that are the foundation of modern architectures.

The report is part of an in-depth and ongoing collaborative research initiative by 451 Research and its sister company, the Uptime Institute, which certifies designs and operations and advises on resiliency and efficiency at the datacenter site level and beyond. In association with this, Uptime is holding a series of executive deep-dive workshops (free), the next of which is in London on Oct 13. See the end of the report for an invitation.

The 451 Take

The prize for those attempting to design a distributed, active, replication-based approach to resiliency can be very significant: availability can be higher overall with risk distributed, and potentially at lower cost of ownership than many current architectures. Indeed, we believe that even if there were no other advantages, advances in applications, cloud and IT management systems would dictate a move in this direction anyway – for most. There is a price, however: more complexity in design and risk assessment is likely, as well as potential loss of transparency and control, and a transitionary period that could involve considerable costs, complexity and a dependency on unproven – or less proven – IT. Negotiating a clear path ahead in this area presents challenges for operators, service providers and CIOs.

Next-generation resiliency
The Next-Generation Resiliency report identifies four levels of resiliency, which have been documented in previous research. These are single-site (plus backup or DR); linked-site (closed associated active-active); distributed-site resiliency (load sharing among three or more datacenters); and cloud-based resiliency (single resilient environment spread across multiple datacenters). These are not necessarily alternatives – prudent CIOs will likely find themselves using some or all of these approaches. The report finds that each of these approaches introduces new layers of complexity, as well as the promise of higher availability and greater efficiency.

Critically, the report suggests that for new cloud-native applications, it is trivial to take full advantage of distributed resiliency capabilities – not because resiliency is trivial, but because the cloud provider has made the investment in redundancy, replication, load management and distributed data management. But for most existing applications, including many that are cloud-optimized rather than cloud-native, it is much more complicated. Some applications need rewriting, some will never transfer across, and several factors such as cost, compliance, transparency, skills, latency and interdependencies add complexity to the decision. For these reasons, complicated hybrid architectures will prevail for many years. The report lists more than 20 technologies that may be involved in building truly or partially distributed resiliency architectures.

Among the key findings of the report are:

  • Many of the benefits associated with a move to distributed resiliency are compelling. The long-term vision is that resiliency ultimately becomes autonomic – managing itself, shifting loads and traffic across geographies according to needs, replicating data and optimizing for performance and costs.

  • Over the next decade, we expect that resiliency and redundancy at the individual datacenter level will, in whole or part, be complemented or replaced by end-to-end resiliency architecture at the software/data/networking level.

  • In spite of this, most critical datacenters will still need to be highly available. Even hyperscale cloud operators continue to use uninterruptible supplies/generators and other protections to power and cooling.

The report concludes that the use of distributed resiliency and a complex, hybrid web of datacenters, distributed applications, and outsourcing services and partners will be problematic for executives seeking good visibility and governance of risk. Outsourcing can mean that cloud providers have power without responsibility, while CIOs have responsibility without power. This is creating a need for better governance, transparency, auditing and accountability findings.

Uptime Institute Research initiative
Throughout this year, Uptime Institute, in partnership with 451 Research has been researching this area, consulting widely and holding discussion events with experts. Research reports will continue to be available to qualifying 451 Research clients and Uptime Institute network members.

The goals of this program are:

  • To understand and validate the value and benefits of new distributed models of IT resiliency, and spell out the implications for corporate IT, suppliers and operators.
  • To identify and understand the value, as well as any limitations, weaknesses and costs of more distributed models of resiliency (when compared with traditional approaches).
  • To consider methodologies and assessment processes for evaluating and rating resiliency, and the value of applying such methods across the industry. Uptime itself is assessing methods for accuracy.

The next discussion is the Hybrid Resiliency Summit, a one-day deep-dive workshop in London on October 13 (max 20 people). This is suited to all executives, technologists and IT enterprise managers with a deep interest in the subject. For an invitation, contact andy.lawrence@451research.com.

Andy Lawrence

Research Vice President - Datacenter Technologies (DCT) & Eco-Efficient IT

New Alert Set

"My Alert"

Failed to Set Alert

"My Alert"