Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Your assumptions are unreasonable, insufficiently well specified, and are asking the wrong question.

If in a year out of a million hard drives only 5000 die, then you're projecting a 200 year average lifetime per disk drive. No real disk has that. A more reasonable 5 year average lifespan gives you 200,000 failures per year. Which is much worse.

Next, you're asking about the odds of 2 failing on the same machine. How many disks are on a machine? 1? 10? 100? Are failures independent events? It makes a huge difference. In fact they are not independent because when the motherboard craps out you lose access to all disks on that machine at once. At their scale it is too much work to figure out whether some of that data is recoverable - you just assume there is another copy somewhere and throw away the stale data. If you're wrong, then oops.

You are also throwing out the 15 minute disk replacement time. It may take 15 minutes to replace a disk, but that figure is irrelevant. To replace a disk you have to locate the machine, and it has to matter enough to you to send a person out. I guarantee you that the time before a person gets involved is going to average more than 15 minutes. Generally a lot more than 15 minutes. (Google famously takes the attitude that it is generally more work than it is worth to find the broken machine, and lets most dead machines sit there indefinitely. I wouldn't be surprised if other cloud providers imitate this.)

Next you have to consider that the end user shouldn't care about machines. For the purpose of redundancy Amazon is not going to keep multiple copies of the same data on the same machine. They are going to put them in different machines, and hopefully in different places. That will reduce the odds of a single failure losing your data.

All of that said I am somewhat shocked that Amazon would advertise a 0.5-0.1% rate of data loss as acceptable. I don't know Google's actual failure rate, but I'd be willing to bet large amounts of money that it is much lower than that.

For instance search for "gmail lost data". The only significant gmail data loss that turns up was in 2006. (See http://techcrunch.com/2006/12/28/gmail-disaster-reports-of-m... for more.) A grand total of 60 accounts got wiped out. Subsequently most of the lost data was restored from backup. (I doubt that the error was at the data storage layer.)

That's not just better than what Amazon delivers. That is ridiculously better.



That failure rate is for EBS drives, which are essentially hot disks. EBS drives can almost instantly snapshot onto S3 backup stores.

Amazon has said they've never heard of anyone experiencing data loss on S3. So if you use EC2+EBS+S3 properly, you should not ever experience data loss except data captured since your last snapshot to S3 (which should be extremely frequently).


To be fair you'd have to include failures of EBS snapshots and of failures to multiple datacenters to gain parity with Google in your comparison. I'm sure the gmail app doesn't use its storage subsystem naively no matter what the numbers are. You're absolutely right in general though.


I'm positive that there is nothing naive in how Gmail uses storage. However I suspect that they are using standard best practices that are common throughout Google.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: