nick_craver's comments

nick_craver · on Feb 7, 2012

Both NY-DB01 (runs everything but Stack Overflow) and NY-DB03 (runs Stack Overflow) have identical backup counterparts: NY-DB02 and NY-DB04. NY-DB04 is on a mirrored config and is always a few minutes behind, while NY-DB02 is restoring scheduled backups. With SQL Server 2012, the backup/mirror is greatly improved and both boxes will have fully-hot spares in a replica configuration. In 2008 R2, SQL Server just doesn't handle mirroring 100+ databases well.

griffordson · on Feb 7, 2012

And those are all in the same data center, right? How much downtime do you guys have a year?

nick_craver · on Feb 7, 2012

The hot backup is in the same New York datacenter, yes. We also have daily backups across the country in our original Oregon datacenter. The whole OR setup is getting love to be a much more resilient failover location as we speak (that'll be the topic of my next post). As for downtimes, pingdom says we were down 7h 6m 23s last year, so 99.92% uptime (note: not nearly all of that was DB related).

nick_craver · on Feb 7, 2012

We'll end up paying about that for the 710s, but you have look at what we're getting for the same price. Fusion IO card dies: we're dead in the water. It takes 2-3 of the Intel 710s (in a RAID 10 like they'll be) dying to do the same. Given that the IO of our storage isn't close to being a factor, we'll always stick with the much more robust route for the same price.

mrkurt · on Feb 7, 2012

Ah. Well if your RAID controller dies, you're dead in the water no? My understanding of the Fusion cards is that they're roughly as resilient as a RAIDed SSD setup.

nick_craver · on Feb 7, 2012

That doesn't really happen, it's an extraordinary event that a raid controller dies. But can it happen? Sure, anything can happen...that's why we have an identical hot backup server only minutes behind on the database for just such an occasion.

nick_craver · on Feb 7, 2012

We have done just that, Lucene.Net runs on the web tier which is otherwise severely under-utilized hardware (all servers sat at under 10% CPU before we moved search to it). But, it will be moving again soon (we're actually discussing that today). I'll have a another post around this architecture shift for search and a few other things coming up.

stavros · on Feb 7, 2012

Ah, thanks, I figured that was a good move.

nick_craver · on Oct 28, 2011

This wasn't a change of an entire codebase, only one portion of it (the tag engine). But more importantly, when only 6 web severs are running stackoverflow.com there's a decent chance a non-triggered GC happens stalling the servers in rotation anyway, granted this lessens the chance of a user-facing stall overall. Also, it complicates the build processes since that queue of rotating servers out gets involved, for example if 3-4 are in a GC rotation then 1-2 are starting the build loop, you're running only on 5 and 6, while that's fine and doesn't even hit 50% CPU, it's a little risky. Worse is you have the very real possibility of taking ALL servers down during a build and GC combo, which we'd never want to happen.

nick_craver · on June 9, 2011

In the case of an extension method, it's actually just syntactic sugar the compiler provides. An extension method is actually a static method underneath, this allows null checking to be centralized, and cleaner everywhere a step is used. The method looks like this: "public static IDisposable Step(this MiniProfiler profiler, string name, ProfileLevel level = ProfileLevel.Info) { return profiler == null ? null : profiler.StepImpl(name, level); }"