A threat to socioeconomic stability: Are cloud providers too big to fail?

With much of society's critical infrastructure relying on just a few cloud players, the impact of a cloud provider's collapse could extend far beyond IT and into economic stability and day-to-day life.

The 451 Take

Companies whose revenue significantly depends on public cloud face the most risk from outages. Although the risks of a well-architected application being unavailable are tiny, the resultant loss in revenue from a major incident could be damaging to those businesses. But the broader socioeconomic impact is perhaps the greater threat. If a CRM goes down for a few days, how many companies wouldn't be able to process orders? How many would lose revenue? How would this affect debt and market stability or unemployment? And how would this affect you and me? We think there is a risk from concentrating workloads today, but most large enterprises are wisely splitting their workloads so that the broader impact of a failure is very small indeed. In particular, use of private cloud is reducing concentration risk by dispersing critical workloads across several locations. As such, cloud providers are not yet too big to fail: a failure would have a substantial impact, but wouldn't affect economic stability so much that government assistance would be needed. But this security will continue only if enterprises continue to put their eggs in multiple baskets.

Eggs in one basket
'Don't put your eggs all in one basket,' goes the proverb. If that basket breaks, you lose them all. If you distribute them among several baskets, if one breaks, you'll still have the eggs in the other baskets. It's possible that one basket may break, but the odds of all the baskets breaking are very low.

This concept led to availability zones and regions in the cloud. It is probable that over a few years, virtual machines or other resources in a single place would become inaccessible for a short period. Downtime is inevitable; the key is to build in robustness. Availability zones are separate locations in the same region and the chance of multiple available zones going down at the same time is far smaller.

This is reflected in Amazon Web Services' SLAs: there is no guarantee of availability of a virtual machine on a single availability zone. AWS is confident enough to offer such a guarantee only when a developer has architected the application across two or more availability zones.

If you wanted to further reduce the odds, you could deploy virtual machines across multiple regions. So even if there were a catastrophic power outage across a large area, the other region could take the strain. What are the odds of both regions going down at the same time? Really quite slim.

Here we introduce the theory of a 'black swan' event. For centuries, it was thought that black swans did not exist, to the extent that a 'black swan' was used to describe an impossible or nonexistent situation. Then in 1697, black swans were discovered in Australia. It was a revelation because no one thought they existed – all the historical data had suggested that they did not. There are black swan events today; things that, for example, 10 years ago would have been regarded as impossible: two of the most destructive hurricanes to ever hit the US landing within two weeks, Britain leaving the EU, the European sovereign debt crisis or the Fukushima nuclear disaster. The world is a complex system, and past performance and behavior is certainly not indicative of the future.

So what are the chances of a virtual machine going down over a year? High. How about across two availability zones at the same time? Slim. How about across a region? Very small. How about an entire major cloud provider going down? Surely the likelihood is tiny – perhaps akin to a black swan. But it is possible; no matter how much we reduce this risk, it is always there. It hasn't happened so far, but cloud has been in existence for only about ten years. If the odds are so slim, perhaps we just haven't had enough time to see such a black swan event.

The risk comes in the form of technology – a security breach could bring down all of AWS, Microsoft or Google, at least temporarily – but the steps these companies take reduces it to a highly remote possibility. Could an errant administration script delete resources from multiple regions? It's possible, but unlikely. Cloud providers can take steps to reduce risk from security incidents, chaotic acts, messy code or keying error. But as the complexity of cloud systems increases, so do the chances of a catastrophe. It's easy to protect the perimeter of a small apartment, but protecting an entire estate is harder. Similarly, as a cloud provider introduces new services, interactions, regions and access points, the risk of a blind spot occurring increases. And this blind spot could be where the next catastrophic failure happens. Something no one expects – a black swan.

But perhaps the bigger threat is from external socioeconomic conditions. What is the likelihood of one of the following happening: a collapse of US-EU relations, a US general strike, an earthquake destroying submarine cables, the reversal of the magnetic poles or a world war. Very slim: these are extreme suggestions because they are so unlikely; but nevertheless, they are possible. How would these occurrences affect cloud providers? And more important, how would this affect those companies that put their applications on those clouds?

If any of these unexpected situations took place, there would be a dramatic impact on access to cloud services and thus cloud stability. New markets would open, others would close, large companies' share prices could be reduced to nothing, others could have a resurgence. Cloud companies with diverse offerings from books to search engines to software could be affected in other lines of business, thereby affecting the overall financial resiliency of their cloud divisions.

But the greater effect could be beyond cloud: the food supply chain could be disrupted, individuals could go into debt waiting for payments to be processed, 911 calls could go unanswered. This seems like hyperbole, but consider how much of day-to-day life depends on functioning IT systems. As more of these systems share capability to stay online, the greater the impact of a failure.

The cloud market eggs are distributed among a few baskets. What is the chance of these baskets breaking? Tiny, but possible. What is the impact of the baskets breaking? Fortunately, we can measure it.

How many companies are fully reliant on public cloud?
The only solution, yet not guaranteed, to a black swan event is robustness. But this has a cost implication: running virtual machines in two zones is twice the cost of one zone, for example. But if you fail to build tolerance, and the worst happens, the outcome can be far more expensive than the cost of duplication. Forty-five percent of respondents to 451 Research's Voice of the Enterprise: Cloud Transformation, Workloads and Key Projects 2017 survey stated their business would be measurably affected if their email went down for more than an hour. The loss resulting from such an outage would far outweigh the cost of duplicating zones.

But building resiliency across two separate cloud companies would be a far more significant expense: they would need two teams who understand each cloud's technology, and two different types of workload – it's unlikely the workload would be portable. Resolving issues would be more complicated across two cloud providers, with two points of contact and different technologies. Is it worth building such a high level of robustness for a small risk of a black swan event wiping a single cloud provider off the market temporarily (or even permanently)?

We can measure how at risk the cloud market is today through concentration ratio: what percentage of the market is represented by the four largest companies (the CR4). A low ratio suggests a market full of providers, none of which is dominant; a high ratio suggests a limited number of providers have dominance. If one provider disappears, then a significant portion of the market will be affected. From 451 Research's Market Monitor, the CR4 of the global IaaS market is 71%, which is medium-high as a ratio. The big risk is AWS, which is the majority of that 71%.

In February, a single AWS region, US-EAST-1, suffered high error rates on its object storage S3 region for less than a day. The result was that businesses relying on S3 were affected, including Giphy, Quora, Slack, Docker, Github, Expedia and Medium. These companies' business models rely on being accessible from the internet – without accessibility, they lose revenue. In fact, one estimate puts the cost of the outage for S&P 500 companies at $150m, and for US financial service companies at $160m.

Imagine the scenario, completely hypothetical and unlikely, where AWS or Microsoft Azure experiences a major incident that takes down all its datacenters for several days, perhaps because of a global extreme weather event. When AWS's S3 in a single region went down for a few hours, some users using IoT lights couldn't switch them off. Imagine the impact when an entire cloud goes down. Companies using cloud rely on it for their revenue. Without revenue, there are greater implications: bad debt, crumbling share prices, layoffs, foreclosure. Individuals at those companies are affected as well as global markets. If banks are affected, global financial security could be affected. If healthcare companies are affected, lives could be a risk.

This all sounds improbable: how can the failure of one cloud provider affect those beyond its domain of expertise? If a percentage of the world's supermarkets can't support their supply chain, there is an impact on the availability of food. If a percentage of the world's banks can't process payments, this can leave companies and individuals high and dry.

Fortunately, most companies aren't placing all their eggs in one basket. In fact, the majority are distributing their own workloads between public and private cloud in a hybrid model. Data from 451 Research's Voice of the Enterprise: Cloud Transformation, Vendor Evaluations 2017 survey shows over three-quarters of enterprises that use AWS are also using VMware, for example.

The addition of private cloud here is crucial in reducing concentration risk and isn't reflected in the CR4. If a private cloud goes down, only one company is affected and others can continue to operate, which drastically reduces the impact of a failure. If a public cloud goes down, multiple companies – many playing in the same domain – are likely to go down, multiplying the impact of the failure. Similarly, the global financial crisis in 2008 was amplified by financial institutions' reliance on each other – when one toppled, the others were affected as a result of bad debt. As such, private cloud is the most resilient cloud model in terms of broader economic societal stability because no single entity is likely to create a substantial impact to society or global or national economies. 451 Research is conducting further research in next-generation resiliency.

Most enterprises are far more concerned with protecting their own revenue than with global socioeconomic stability. But often these go hand-in-hand: a hybrid cloud model reduces the concentration ratio for broader society, while reducing the impact of a failure on the public or private cloud side.

Too big to fail
'Too big to fail' refers to the issue of the government stepping in to save a failing company to protect its citizens, particularly in the case of a financial institution. Are the cloud providers now too big to fail?

We say no. Most sensible enterprises are splitting their workloads today, and aren't betting wholly on public cloud. In particular, the continued strong use of private cloud suggests a single cloud provider isn't too big to fail. The impact of AWS or Azure suddenly disappearing would be hugely significant and would have some impact on national economies, but could it trigger a recession or social disaster for the long term? We don't think so. Private cloud is reducing concentration risk, and the chance of an entire cloud provider disappearing is tiny, although it is still a risk. If the major cloud players take in workloads from on-premises infrastructure in the future, this risk will remain small, but the impact of a failure could have greater repercussions.

Owen Rogers

Research Director, Digital Economics Unit

New Alert Set

"My Alert"

Failed to Set Alert

"My Alert"