For a service that is trying to be 'mainstream' I think their blog post is horri...

harryh · on Oct 6, 2010

How do you think it could have been better? We struggled a lot trying to decide how much technical detail to include. We decided that including more information (even if a lot of our users didn't understand it) was better than "something broke, it took a long time to fix."

Would love to hear suggestions on this topic.

mscarborough · on Oct 6, 2010

I liked that your post didn't start with "we use MongoDB and there's some problems with it". I haven't used MongoDB on production, and while I'm interested to learn about specific issues, the format of this post gives the reader the opportunity to evaluate your problem from a platform-agnostic perspective first. Instead of having Mongo interfere right away, I can think about how our systems might hit the same issues.

To echo others, I'm interested to read the more in-depth post-mortem.

Good luck!

shoesfullofdust · on Oct 6, 2010

The technical details were great. And I'm sure other devs can learn from your predicament. But, as a user, you didn't answer the question, "When can I use this again?"

It broke. You fixed it. But, "Can I expect this to work again?" "Reliably?" All I heard was that it was broken.

It sounded a lot like, "something broke, it took a long time to fix."

gonehome · on Oct 6, 2010

I liked your blog post and appreciated the level of detail you put into it. I'm always wondering why things went down and what's going on behind the scenes and I'm sure many other users appreciate it too. Those that don't likely won't mind the extra information.

thegoleffect · on Oct 6, 2010

Having more MongoDB-specific technical info might be a call-to-arms for other MongoDBAs who might offer their ideas/thoughts. You could also use that as a hiring/scouting opportunity.

Does 4sq have an engineering blog?

dotBen · on Oct 6, 2010

The down voting on my previous reply is sad but I probably shouldn't be too surprised. The problem of being way too technical for a mainstream audience is a problem many people on Hacker News seem to have, and so no wonder many would disagree with me.

It's silly for someone on Hacker News to say "well I thought the level of detail was fine" - of course you would, like the rest of us you're a technical geek. The point that seems to be lost is 95% of FourSquare's userbase ISN'T!

Also FourSquare is one of those startups that, in addition to the YC startups (for obvious reasons I guess), people give a little more favoritism to then perhaps other startups of equal quality/interestingness.

How do you think it could have been better? We struggled a lot trying to decide how much technical detail to include. We decided that including more information (even if a lot of our users didn't understand it) was better than "something broke, it took a long time to fix." Would love to hear suggestions on this topic.

Well, I'm not suggesting you wrote "something broke, it took a long time to fix" - I'm all for transparency. But if you are going to be transparent you need to communicate at a level at which that transparency can be understood by all of your readers. I'm sorry if some people on Hacker News don't get that.

So ok, here's how I would have written your post (for time sake I just did the intro - I'd have repeated the technical description after this block of copy):

Yesterday, we experienced a very long downtime. All told, we were down for about 11 hours, which is unacceptably long. It sucked for everyone (including our team – we all check in everyday, too). We know how frustrating this was for all of you because many of you told us how much you’ve come to rely on foursquare when you’re out and about. For the 32 of us working here, that’s quite humbling. We’re really sorry.

Below is an explanation of what happened and what we’re doing to make sure it doesn’t happen again in the future (a more technical explanation for those inclined appears further below)

What happened As you can imagine we store a huge amount of data from all of your user check-ins. We split that data across many servers as it's obviously far to big to fit onto just one. Starting around 11:00am EST yesterday we noticed that one of these servers was performing poorly because it was receiving an unusually high volume of check-ins. Maybe there was an incredibly popular party that we missed out on! :)

Anyway, after trying various things to improve the performance of that server we decided to try to add another server to take some of the strain off the original overloaded server. We wanted to move this data in the background while the site remained up - however for some reason when we added the new server the entire site did go down. Ouch!

We tried all sorts of things to ease the strain but nothing seemed to work. By around 6:30pm EST (phew, what a day!) we decided to try one final idea, which fortunately worked. Yay!

However it took a further 5 hours to properlly test our fix, and so it was only by around 11:30pm EST that we were able to bring the site back up. Don't worry, all of your data remained safe at all times, and that hard-won mayorship is still yours!

...

Anyway, if people disagree that you should always communicate with your customer at a level they understand, then I'd urge you to read http://steveblank.com/2010/04/22/turning-on-your-reality-dis... or http://www.readwriteweb.com/start/2010/05/is-your-startup-to... (pitching to investors, media or customers - it's all the same issues).

roc · on Oct 6, 2010

While I agree on the technical-level bit, the tone in your sample floats between cavalier and condescending. That's worse than technical overkill IMO.

dotBen · on Oct 6, 2010

It would be interesting to know which bits you felt that was?

Also keep in mind I tried to edit their original post, as kinda suggested. I'm not sure I'd have written any the post quite in the way that they did - but I tried to work with what I had.

roc · on Oct 6, 2010

Pretty much every one of the attempts at editorial comedy. People are only reading a post like that if they've been inconvenienced. Comedy is hard to pitch at an angry crowd, so unless you're really good at it, it's probably best not to try.

harryh · on Oct 6, 2010

Interesting. Thx very much fo the feedback. I'm sure you understand it's a hard balance to strike between technical detail and ease of understanding. Will strive to make things a bit more on the "ease of understanding" side next time.

Also considering starting a separate engineering blog where it would probably be appropriate to go into more detail for those that are interested.

-harryh

lemming · on Oct 6, 2010

For what it's worth, I liked the original much better than the proposed replacement. People aren't idiots, if you give them a good explanation they appreciate it even if they don't fully understand all the details or implications. Think of it like going to the doctor - if I have something wrong with me, I want my doctor to explain it to me in a way I can understand, not just tell me that I have something generic wrong.

JabavuAdams · on Oct 6, 2010

Maybe you don't need to find a balance. Provide a "more (technical) info" link after the general-user description.

fauigerzigerk · on Oct 6, 2010

I disagree (but I never downvote). Leaving out technical detail, translating everything into end user friendly but meaningless phraseology is the worst thing you can do. Microsoft has done that in all its consumer products and it's infuriating because users have no way of asking someone more knowledgable for help.

I think end users have a problem if the one thing they need to know is expressed in a way they cannot understand. What they need to know is when the site is going to be up again and what the likelihood of it happening again is. Once they know that, I don't think they have a problem with added technical detail that's not meant for everyone.

marclove · on Oct 5, 2010

How many "regular users" do you really think would spend the time to read about why they were down? They just know it was down and wanted it back up ASAP. And how would you better explain it to a layman? The servers crashed. We worked on it. It took longer than we wanted. They're back up again and we're working to make sure it doesn't happen again. How do you go further than that without getting technical?

lipnitsk · on Oct 5, 2010

Actually, I think that it wasn't even all that technical. They took their time to explain what shards are, etc, so their intended audience was still power users, at best. Not saying it's good or bad, just saying how I felt when I read the post.