Fault-Tolerance in Cloud Services
On Feb 28 2017, companies such as Quora, Slack, Expedia, and GitHub found themselves inexplicably inaccessible by their customers through the internet. They, along with several other companies, big and small alike, scrambled to find the problem, as each passing second meant significant losses in revenue. It was only 4 hours later that they found their websites functional again, but by then the damage was done. The S&P 500 companies were estimated to have lost $150 million, while US financial services companies suffered a $160 million loss. (http://www.businessinsider.com/aws-outage-hurt-internet-retailers-except-amazon-2017-3)
The one thing that these companies had in common: the use of Amazon Web Services.
Earlier that day, a member of the Amazons S3 (Simple Storage Service) team executed a command that removed servers from one of the S3 subsystems. Unfortunately, the input to the command was entered incorrectly, and a much large number of servers was removed than was intended (https://aws.amazon.com/message/41926/). With more than 140k websites using Amazon S3, a hefty chunk of the internet found themselves offline for the duration of the outage (https://techcrunch.com/2017/02/28/amazon-aws-s3-outage-is-breaking-things-for-a-lot-of-websites-and-apps/).
While this may seem to be a cautionary tale against relying too much on services like AWS and Microsoft Azure, it is important to note that for the most part, the benefits of using these services far outweigh the risks. Indeed, it is likely safer to rely on these services rather than any home-grown alternatives.
In his talk at the first lecture, Chris Cruz, Deputy State CIO of the State of California espoused the use of these services, citing benefits like the gains in speed and the economic implications of the pay-as-you-go pricing structure (https://canvas.stanford.edu/courses/67124/files?preview=1938288). In light of the AWS outage however, it seems also prudent to examine exactly what mechanisms cloud services like Microsoft Azure and AWS have in place to provide reliable service.
Firstly, let us examine Microsoft’s Service Fabric Platform. The first fault tolerance mechanism in place is that of “Fault Domains”. These domains are logical segments demarcating dependencies in host hardware. Notably, if two VMs are on different Fault Domains, they will not depend on the same power source or network switch (https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-disaster-recovery). Replicating one’s data or service across VMs in different fault domains will allow Microsoft’s Service Fabric to recover easily from the loss of a VM in any one fault domain. Even with the loss of multiple replicas, as long as a majority of the replicas are operational, the service can continue to serve requests as usual (https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-disaster-recovery). Users can then make their own choices about their target level of replicas, according to how critical their service is.
Microsoft has even made provisions for extreme events such as the loss of an entire data center. To protect against such catastrophic events, users are able to back up their service state “to a geo-redundant store” (https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-disaster-recovery). This is another advantage of using services by companies such as Amazon and Microsoft, which have datacenters scattered across the world. Not only can users choose where they want their information physically stored (which could be important based on the sensitivity of the material), they also have the option of backing up their data across very different locations, thus hedging against the risk of complete data loss in the event that a data center goes down.
Similarly, Amazon provides the notion of Availability Zones, each of which are protected from failures in other zones. Specifically, zones are “independently powered and cooled, and have their own network and security architectures” (https://www.singlehop.com/blog/aws-fault-tolerance-redundancy-ec2/) . Once again, replicating one’s service across these different zones provides a form of redundancy that reduces the risk of downtime.
All in all, while we cannot always protect against human error, as in the Amazon S3 incident, the mechanisms that Cloud Services like AWS and Microsoft Azure have in place to provide fault tolerance and redundancy are indeed powerful, if not fully comprehensive. It is thus still a sensible choice to pursue these options.
Users who have LIKED this post: