Fault-Tolerance in Cloud Services
On Feb 28 2017, companies such as Quora, Slack, Expedia, and GitHub found themselves inexplicably inaccessible by their customers through the internet. They, along with several other companies, big and small alike, scrambled to find the problem, as each passing second meant significant losses in revenue. It was only 4 hours later that they found their websites functional again, but by then the damage was done. The S&P 500 companies were estimated to have lost $150 million, while US financial services companies suffered a $160 million loss. (http://www.businessinsider.com/aws-outage-hurt-internet-retailers-except-amazon-2017-3)
The one thing that these companies had in common: the use of Amazon Web Services.
Earlier that day, a member of the Amazons S3 (Simple Storage Service) team executed a command that removed servers from one of the S3 subsystems. Unfortunately, the input to the command was entered incorrectly, and a much large number of servers was removed than was intended (https://aws.amazon.com/message/41926/). With more than 140k websites using Amazon S3, a hefty chunk of the internet found themselves offline for the duration of the outage (https://techcrunch.com/2017/02/28/amazon-aws-s3-outage-is-breaking-things-for-a-lot-of-websites-and-apps/).
While this may seem to be a cautionary tale against relying too much on services like AWS and Microsoft Azure, it is important to note that for the most part, the benefits of using these services far outweigh the risks. Indeed, it is likely safer to rely on these services rather than any home-grown alternatives.
In his talk at the first lecture, Chris Cruz, Deputy State CIO of the State of California espoused the use of these services, citing benefits like the gains in speed and the economic implications of the pay-as-you-go pricing structure (https://canvas.stanford.edu/courses/67124/files?preview=1938288). In light of the AWS outage however, it seems also prudent to examine exactly what mechanisms cloud services like Microsoft Azure and AWS have in place to provide reliable service.
Firstly, let us examine Microsoft’s Service Fabric Platform. The first fault tolerance mechanism in place is that of “Fault Domains”. These domains are logical segments demarcating dependencies in host hardware. Notably, if two VMs are on different Fault Domains, they will not depend on the same power source or network switch (https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-disaster-recovery). Replicating one’s data or service across VMs in different fault domains will allow Microsoft’s Service Fabric to recover easily from the loss of a VM in any one fault domain. Even with the loss of multiple replicas, as long as a majority of the replicas are operational, the service can continue to serve requests as usual (https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-disaster-recovery). Users can then make their own choices about their target level of replicas, according to how critical their service is.
Microsoft has even made provisions for extreme events such as the loss of an entire data center. To protect against such catastrophic events, users are able to back up their service state “to a geo-redundant store” (https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-disaster-recovery). This is another advantage of using services by companies such as Amazon and Microsoft, which have datacenters scattered across the world. Not only can users choose where they want their information physically stored (which could be important based on the sensitivity of the material), they also have the option of backing up their data across very different locations, thus hedging against the risk of complete data loss in the event that a data center goes down.
Similarly, Amazon provides the notion of Availability Zones, each of which are protected from failures in other zones. Specifically, zones are “independently powered and cooled, and have their own network and security architectures” (https://www.singlehop.com/blog/aws-fault-tolerance-redundancy-ec2/) . Once again, replicating one’s service across these different zones provides a form of redundancy that reduces the risk of downtime.
All in all, while we cannot always protect against human error, as in the Amazon S3 incident, the mechanisms that Cloud Services like AWS and Microsoft Azure have in place to provide fault tolerance and redundancy are indeed powerful, if not fully comprehensive. It is thus still a sensible choice to pursue these options.
Users who have LIKED this post:
5 comments on “Fault-Tolerance in Cloud Services”
Comments are closed.
Aaron,
Great blog post! I have been on the Architecture/Engineering side of construction for data centers for the last ten plus years of my career. It is interesting to see how these companies have modified their philosophy of maintaining facility uptime through decentralization of their data centers vs. constructing a facility with 2N redundancy of power and cooling infrastructure. As you mentioned, the biggest cause of Data Center interruption of services is human error. One just needs to read up on the British Airways case recently to see how easy it can be to take down a data center, and the economic impact. Yesterday I just gave a presentation to my company’s Board of Directors on the latest trends in reliability engineering and design, and your examples above were noted by the group. – Robert Sty, PE
Interesting post to support how companies are tacking around fault-tolerance in cloud services. I have been running threat prevention architecture and engineering for infrastructure past ten years now. The number one issue we have seen to cause outage is the human error and not having proper peer review done on implementation plan whether it is configuration change or simple code fix.
We heavily rely on vendor products to support services to our employee and work with multiple vendors on hardware and software solutions. I think to remediate these issues we should run and execute technical recovery plan periodically on
1) Port failover test and validation
2) Device failover test and validation
3) Data Center failover test and validation
4) Service failover test and validation
These will ensure service recovery and bring down impact to null. Also another solution can be to place a proper process documentation followed by engineering and implementation team to avoid human errors.
Any standard or break fix change should be properly designed by engineering, reviewed and approved before it gets out to production change. This would ensure seamless execution of the change. Also any new hardware or code should be properly certified before released out to production.
Rightly pointed out, The increasing demand for flexibility and scalability has led to rise of the giants such as AWS and Azure. The benefits are immense;
1. Reduction of Cost
2. Ease of maintaining applications
3. Disaster recovery
4. Enabling easier sharing & collaborations. etc.
Systems are being improved on day-to-day basis to lessen the faults – the difference between a human error and a machine committing an error is that that the machine once corrected will not make the same mistake again.
In this case, A simple error by a human has facilitated losses in millions of dollars and hence this should not be concealed by the benefits it provides.
Users who have LIKED this comment:
Great post! I am personally inclined to dealing with software staffs, but your post inspired me to consider the physical conditions of the technology environment. From locations where data are stored to cooling the hardwares, there are plenty of things for us to take care of, or accidents might happen.
Very relevant, I do think that it is often overlooked how rare these losses of availability are when you appreciate the scale of these systems. And, as Robert commented above, it is usually due to human error of some kind.
There has been an incredible amount of work put into availability of cloud services at all layers. I was especially surprised to learn that data centers and WSCs (for the most part) use commodity components with the same failure rates that you would find in your desktop or laptop. However, due to scale, if left to their own devices these enormous systems would fail every day. If the disk drive in your computer has a .001% failure rate but the WSC contains tens of thousandths of them the probability of an error at a given moment is much higher. For this reason the technology developed to achieve this level of availability has focused not on better individual components with lower failure rates, but on software and hardware mechanisms to detect and recover from the failure events of modestly priced components without compromising availability. Like technology developed at NASA being used in a variety of industries later on, I wonder what the impact of this shift to focusing on higher level of a massive scale can/will have in other areas.