A Question of Integrity: To MD5 or Not to MD5
Cloud Storage offers pay per drink off-site storage. Data to be saved is shuffled from the customer to the Cloud Storage Provider by the network. This all works wonderfully most of the time, what you upload is what you get back later. But what happens where the gremlins strike and what you send is not what is received?
This happened recently to some Amazon S3 customers. There were complaints in the AWS forums about ‘S3 Corruption’. The first post in the forum was recorded at Jun 22, 2008 5:05 PM PDT (although in subsequent posts some people reported emailing Amazon prior to this):
we are having some serious S3 issues.
all data we store on S3 has gone through the same code path for months. starting a couple days ago a small percentage of the objects we are retrieving are not checksumming to the correct values. we hash and store objects by checksum and rehash the objects when we retrieve to ensure there is no data corruption. all the objects we’re having issues with were uploaded at approximately the same time period a few days ago.
we’ve stored 10’s of millions of objects in S3 and never encountered such problems. please let me know ASAP if you have any idea what could be going on here. thanks.
Amazon responded 6 minutes later (!) and started investigating. To troubleshoot they asked customers to email aws@amazon.com with the ‘Bucket-Name and few keys that you believe are having issues’.
Others weighed in reporting similar problems. Amazon provided status updates and on Monday Jun 23rd at 6:10pm PDT, provided the following explanation:
We’ve isolated this issue to a single load balancer that was brought into service at 10:55pm PDT on Friday, 6/20. It was taken out of service at 11am PDT Sunday, 6/22. While it was in service it handled a small fraction of Amazon S3’s total requests in the US. Intermittently, under load, it was corrupting single bytes in the byte stream. When the requests reached Amazon S3, if the Content-MD5 header was specified, Amazon S3 returned an error indicating the object did not match the MD5 supplied. When no MD5 is specified, we are unable to determine if transmission errors occurred, and Amazon S3 must assume that the object has been correctly transmitted. Based on our investigation with both internal and external customers, the small amount of traffic received by this particular load balancer, and the intermittent nature of the above issue on this one load balancer, this appears to have impacted a very small portion of PUTs during this time frame.
What are some of the takeaways?
- If you are directly using the AWS S3 API, make sure to calculate and send MD5 checksums along with actual data. Check status return codes - an HTTP 400 error code means ’something went wrong’ - respond appropriately.
- If you are relying on 3rd party tools to access S3, be sure to check with your software vendor that they are following the advice from Amazon to use MD5. If they are not then your data can get silently corrupted…
- Downloads, aka HTTP GETs, can also be affected. The thread in the forum continues and questions are asked as to whether the corruption caused by the loadbalancer was affecting both incoming and outgoing traffic. The conclusion was yes. If you are hosting media on S3, and the browser is using partial GET requests (to download in chunks) then the corruption will not be automatically detectable.
- If your business relies on Cloud Storage, are you prepared to wait a 36 hours for a resolution? This isn’t a swipe at Amazon, this is true for any provider. Check your SLA’s, check the trouble ticket resolution times, ask about availability of experts for troubleshooting etc.
- Cloud Providers will increasingly need to instrument their services such that they can ‘early detect’ negative operational events. In this case, Amazon has stated plans to use better logging and analysis to automate detection of unusual error patterns (i.e. anomoly detection).
- This incident - caused by an Amazon malfunctioning loadbalancer - did not make it onto the AWS status page at http://status.aws.amazon.com/. Taking Amazon at face value, this incident only affected a small number of transfers, relative to the total number of S3 transfers. But this begs the question, what level of outage or service problem needs to happen before Amazon will flag the issue on their status page? On a sidenote, based on the timestamps, 31 hours passed between the loadbalancer being taken out of service and Amazon providing the explanation on the forum.
- When Amazon update their S3 API documentation, it would be useful to have entries in the S3 API index for ‘checksum’, ‘MD5′, ‘integrity’ and ‘corruption’.
- Stepping back, will customers hold Cloud Service Providers to a higher standard than their own internal IT teams?
I’m sure there are more takeaways I didn’t cover. What say you?
###
Kudos for the heads-up on the S3 issue goes to my friend and colleague Jason Harper - network supremo and crypto-head. Thanks Jason!
If you are curious about Cloud Computing and security, don’t miss out on future posts: subscribe by RSS or subscribe by email.









Companies who are currently doing business in or considering doing business “in the cloud” should take this incident as a cautionary tale.
The Amazon S3 incident is a perfect example of the negative outcomes that companies are willing to overlook or minimize once the decision has been made that they can save a few dollars by doing business in the cloud.
The Amazon S3 incident also illustrates the complete lack of liability on the part of cloud service providers when errors inevitably occur.
Additionally, the Amazon S3 incident shows that the companies using their service are considered “just another user beneath the cloud”. This is illustrated by the fact that the only way the issue came to light was through a user’s desperate post to an Internet forum. In this case, someone responded in six minutes, but it took 36 hours on Amazon’s end to identify the problem - never mind the hours spent by the poor users under the cloud trying to figure out why their data was corrupted.
Is Amazon’s explanation that “Intermittently, under load, it was corrupting single bytes in the byte stream…Based on our investigation with both internal and external customers, the small amount of traffic received by this particular load balancer, and the intermittent nature of the above issue on this one load balancer, this appears to have impacted a very small portion of PUTs during this time frame.” supposed to make their users sleep better?
Amazon’s explanation boils down to:
- It was intermittent
- It only corrupted a single byte of data
- It wasn’t our fault (the load balancer did it)
- It only impacted a very small portion of PUTs
Amazon’s explanation attempts to minimize the impact of their error - much like the credit card companies attempted to minimize the number of customers affected when their data was breached. Investigations later determined that the number of affected customers reported by the credit card companies was woefully low. The users beneath the cloud must take Amazon’s explanation at face value. Amazon did not report the data corruption issue on their own, so users are now asking the question “how bad does a problem with S3 have to be before Amazon reports it?”.
What is Amazon’s liability for time lost, business opportunities missed, customers loss of confidence or bad decisions made based on the data that they corrupted?
Now that this has happened, the affected companies may decide that they need to back up their data locally before transmitting it to S3 - just in case something else goes wrong. If companies feel the need to build their own data centers as a failsafe, then this effectively calls into question many of the reasons for using S3 in the first place (no data center, no support staff, etc…).
The Amazon S3 issue is a serious issue for the users beneath the cloud because it seriously undermines the level of confidence that users have in their data. If users beneath the cloud cannot be confident that the data they retrieve from S3 is identical to the data that they sent to S3, then every transaction or decision made based on S3 data must be called into question. Comapanies will end up spending resources to build out local data centers and to double check every byte of retrieved data - negating any savings that they hoped to realize by doing business in the cloud - a classic case of ‘pay me now or pay me later’.
You pose the question “will customers hold Cloud Service Providers to a higher standard than their own internal IT teams?” My position is that the users beneath the cloud have very little leverage to hold the cloud service providers to any standard.
Companies doing business in or considering doing business “in the cloud” should realize that they do so at their own peril.
Mike Thieruke
Hey, I was looking around for a while searching for data security services and I happened upon this site and your post regarding ion of Integrity: To MD5 or Not to MD5 | Cloud Security, I will definitely this to my data security services bookmarks!