The outage at one of the Amazon Web Services (AWS) data centers last week has led many to question the viability of public cloud services as a whole.
As a company that provides storage solutions for private and public clouds (including AWS), we would respectfully disagree.
Failures of one form or another are endemic to data center operations. Routinely, servers and disks fail, power lines get cut, surges in usage overwhelm available resources, blizzards keep employees from getting to the data center, etc. Less routinely, floods, fires, natural disasters, or 9/11-type events take out entire data centers.
The question should really not be “Can failures occur in the public cloud?” Of course, failures will occur. The real questions should be “How frequently do those failures occur?” “How easy is it to minimize the impact of those failures?” and “How expensive is it to reach a particular availability level?”
While the outage at the AWS data center yesterday garnered a lot of attention (no doubt because of the high profile of impacted customers like Reddit and Quora, it is worth noting that several high profile services that run their own data centers (Twitter, Gmail, etc.) have experienced similar failures over the past several months. Moreover, I would bet that most of the customers at this particular data center would have experienced far more frequent and serious failures had they been running their own operations.
There are several reasons for this:
a) Because of economies of scale, Amazon can invest far more in robust and redundant power supplies, cooling equipment, physical and logical security, etc. than a typical, self-run data center. They can also invest more in monitoring services and employees to respond to device failures.
b) Because of the basic Amazon infrastructure (as well as supporting technologies such as Gluster), it is possible to distribute workloads across multiple devices (thus minimizing the impact of any particular device failing)
c) Because of the basic Amazon infrastructure (as well as supporting technologies like Gluster) it is easy to flexibly and quickly provision additional resources to handle usage spikes
d) Because implementing all of the above would be out of the budgets of many organizations.
All of the above items relate to the frequency and impact of failures within a particular data center.
It is important to note that with AWS (and Gluster), companies also have the ability to minimize the impact of the failure of an entire data center .
Within any AWS geographic region (e.g. East, West, Europe, Japan), there are multiple availability zones. These are physically separate data centers, (with separate power supplies, water supplies, personnel, etc) that have high bandwidth connections. Because Gluster makes it possible to easily and instantly replicate between 2, 3 or more availability zones, our customers in AWS East were not taken out by this incident, as their data was also replicated into availability zones that were not impacted.
Of course, there is always the possibility that large natural or man-made disasters could affect an entire geographic region. Fortunately, we are very close (in weeks) to offering the capability to easily keep data replicated and in sync across AWS regions and between a private cloud and a public cloud. Thus, even in the event of a wide spread natural disaster affecting an entire region, it will be possible (and economical) to ensure widespread geographic availability.
More information: Gluster in Amazon EC2