> "expect an annual failure rate (AFR) of between 0.1% –0.5%, where failure ...

mechanical_fish · on May 10, 2010

There are people in the world who do not understand probability. Like the OP:

EBS is either reliable or not. You cannot be a little pregnant.

Even the OP's metaphor is broken! The saying "you cannot be a little bit pregnant" may be traditional, but it too is only an approximation. There are these things called miscarriages. They happen all the time, often before a woman realizes she is pregnant. Then there are the false pregnancies:

http://en.wikipedia.org/wiki/False_pregnancy

And these are just the common edge cases.

astrodust · on May 10, 2010

You can be "a little bit pregnant" when you're talking about large groups of people. Of the women in your country, how many are pregnant at any given time? It's probably "a little bit". Obviously someone is confusing a statistical sampling of 1 with the kind of volume Amazon must deal with.

I find it informative that a rare event such as this is being given so much press. If this was more routine, this wouldn't be news. Amazon seems to be doing better than 0.1% based on that alone.

mynameishere · on May 10, 2010

I failed statistics. If out of a million hard drives 5000 die in a year and take 15 minutes to swap, what are the odds of 2 failing on the same machine?

btilly · on May 10, 2010

Your assumptions are unreasonable, insufficiently well specified, and are asking the wrong question.

If in a year out of a million hard drives only 5000 die, then you're projecting a 200 year average lifetime per disk drive. No real disk has that. A more reasonable 5 year average lifespan gives you 200,000 failures per year. Which is much worse.

Next, you're asking about the odds of 2 failing on the same machine. How many disks are on a machine? 1? 10? 100? Are failures independent events? It makes a huge difference. In fact they are not independent because when the motherboard craps out you lose access to all disks on that machine at once. At their scale it is too much work to figure out whether some of that data is recoverable - you just assume there is another copy somewhere and throw away the stale data. If you're wrong, then oops.

You are also throwing out the 15 minute disk replacement time. It may take 15 minutes to replace a disk, but that figure is irrelevant. To replace a disk you have to locate the machine, and it has to matter enough to you to send a person out. I guarantee you that the time before a person gets involved is going to average more than 15 minutes. Generally a lot more than 15 minutes. (Google famously takes the attitude that it is generally more work than it is worth to find the broken machine, and lets most dead machines sit there indefinitely. I wouldn't be surprised if other cloud providers imitate this.)

Next you have to consider that the end user shouldn't care about machines. For the purpose of redundancy Amazon is not going to keep multiple copies of the same data on the same machine. They are going to put them in different machines, and hopefully in different places. That will reduce the odds of a single failure losing your data.

All of that said I am somewhat shocked that Amazon would advertise a 0.5-0.1% rate of data loss as acceptable. I don't know Google's actual failure rate, but I'd be willing to bet large amounts of money that it is much lower than that.

For instance search for "gmail lost data". The only significant gmail data loss that turns up was in 2006. (See http://techcrunch.com/2006/12/28/gmail-disaster-reports-of-m... for more.) A grand total of 60 accounts got wiped out. Subsequently most of the lost data was restored from backup. (I doubt that the error was at the data storage layer.)

That's not just better than what Amazon delivers. That is ridiculously better.

dcurtis · on May 11, 2010

That failure rate is for EBS drives, which are essentially hot disks. EBS drives can almost instantly snapshot onto S3 backup stores.

Amazon has said they've never heard of anyone experiencing data loss on S3. So if you use EC2+EBS+S3 properly, you should not ever experience data loss except data captured since your last snapshot to S3 (which should be extremely frequently).

skorgu · on May 10, 2010

To be fair you'd have to include failures of EBS snapshots and of failures to multiple datacenters to gain parity with Google in your comparison. I'm sure the gmail app doesn't use its storage subsystem naively no matter what the numbers are. You're absolutely right in general though.

btilly · on May 10, 2010

I'm positive that there is nothing naive in how Gmail uses storage. However I suspect that they are using standard best practices that are common throughout Google.

kiujhygfghjk · on May 10, 2010

Lots of people who didn't fail statistics assume that events are independant. A bit like, the chance of my machine catching fire are 1 in a 1000, my machine did catch fire and I lost both drives - the odds on that happening must be 1 in a million!

Tamerlin · on May 10, 2010

What's more relevant is that the odds of either drive failing are identical. The likelihood that the first drive will fail doesn't change because the second one failed -- which is the classic statistical blunder. ("I got 42 heads in a row, the next one's BOUND to be tails!")

Every single drive in the datacenter typically has the same likelihood of failure (since they're usually the same make and model, and similar production runs), so the odds that a drive somewhere in a large data center is failing RIGHT NOW are in reality rather high.

The Big Mac ended up being a good example of this. The likelihood of having a memory error on a normal PC is low, because the likelihood of a single DIMM having a memory error is low.

Punch that up to the 22,000 or thereabouts DIMMs that populated the Big Mac cluster, and now you're looking at a very HIGH likelihood that a large-scale computation using the cluster will experience memory errors, and therefore produce invalid results -- which is why UVa ended up replacing the entire cluster with ECC-equipped machines in short order. Until they did that, researchers had to run simulations multiple times and compare results to make sure that their simulations weren't contaminated by memory errors.

kiujhygfghjk · on May 11, 2010

>The likelihood that the first drive will fail doesn't change because the second one failed

No thats the point, hardware failures on the same machine/rack/psu/site are not independant. A power supply spike that kills drive 0 in a RAID will probably kill the mirror drive 1 as well - that's why RAID isn't a bckup stratergy

Tamerlin · on May 11, 2010

You missed the point -- a power supply spike isn't a hard drive failure, even if it kills a couple of drives, it's a power supply failure.

I was explaining a common misconception about statistics, and using hard drives and DIMMs as examples.

mynameishere · on May 10, 2010

If amazon is advertising that single failures can't cause data losses, then a single exploding power supply (for instance) breaks that agreement.

azim · on May 10, 2010

If by "exploding" power supply you mean the potential for a power supply to explode with eruptive force and destroy other equipment in the rack, I believe that the other equipment being destroyed would be considered an multiple failure. It's very unlikely that a power supply could do that (although I guess anything is possible). A proper datacenter built with Datacenter-grade equipment will have multiple redundant power supplies in a blade enclosure, fed by different rails, which come off of different main lines in to the building. So a single failed source of power won't cause a failure.

jonknee · on May 10, 2010

The percentages they give aren't for a single drive failure, they're for data loss (multiple drive failure). So the odds are .1-.5%.

lallysingh · on May 10, 2010

Higher than expected: http://labs.google.com/papers/disk_failures.pdf

lamby · on May 10, 2010

Very low. But that assumes they are independent events, which has never been the case in my experience.

jacquesm · on May 11, 2010

No, he experienced a 'user error'.

And those happen far more frequently.

GrandMasterBirt · on May 11, 2010

The OP's point:

Either you are reliable or not.

If you claim that the data is backed up and so on, then you are giving a garantee. Now yes, if all of amazon's datacenters burn down, or the one where ur data is, yes it will be lost, but that should be the corner case that you prepare for. Thats when you do an offsite backup in your house daily/weekly to ensure that at least there is no one place that burns down = company down the tube.

No matter the solution there is always the probability that shit will happen to all of it.

However if you are selling a reliable service, don't sell an unreliable reliable service. Reliable should mean that at least you back it up so that if one hard drive rack blows up, the data is not gone.

dedward · on May 11, 2010

They didn't say it was bulletproof. They stated that data was replicated to more than one device to ensure that a single component failure would not result in data loss.

They didn't say it was bulletproof or impossible to loose data. They explain fairly well how things work, what the failure rates are, and give you the tools to do the same risk analysis and cost/benefit calculations you would do anyway, whether using a cloud service or rolling your own.

Multiple equipment failure can happen. Even across data centers and availability zones. The larger the entire AWS system gets, the higher the chances of eventually seeing edge cases where the wrong equipment at just the wrong time screws up to lose data.

If you want bulletproof data storage system that has such a rediculously low failure rate that you are guaranteed not to loose data for a hundred years, you can get it - but it's going to cost a heck of a lot more than anything Amazon is selling you.