Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

While this is good advice, I feel it is a bit too over-simplified.

Counting incoming and outgoing requests misses a lot of potential data points when determining "is this my fault?"

I work mainly in system integrations. If I check for the ratio for input:output, then I may miss that some service providers return a 200 with a body of "<message>Error</message>".

A better message is to make sure your systems are knowledgeable in how data is received from downstream submissions, and to have a universal way of translating that feedback to a format your own service understands.

HTTP codes are (pretty much) universal. But let's say you forgot to inlcude a header or forgot to base64 encode login details or simply are using a wrong value for an API key. If your system knows that "this XML element means Y for provider X, and means Z in our own system", then you can better gauge issues as they come up, instead of waiting for customers to complain. This is also where tools like Splunk are handy, so you can be alerted to these kinds of errors as they come up.



The article never defines what an error is, so I think it is very reasonable to take your approach. I think you're mistaking very abstract advice with very simple advice :)


I agree that this is over-simplified. It also skips over the mess you can get when a downstream dependency is having issues.

If the "things calling you" can't be effectively throttled, you often run into issues like, for example, hitting the limit on number of open sockets, file descriptors, receive queue, threads etc.

So, just saying "the downstream service is at fault" isn't really correct. Your service may also not be acting correctly in that situation. Those issues can also affect your logging and metrics.

It's not a trivial exercise to architect your service such that it always does the right thing (throttling input vs retries vs fail fast vs priority queues vs load balancing to multiple instances of a downstream service, exponential backoff, etc) when a downstream dependency is slow and/or down.

Edit: Similar to your observation about structured errors, connection pooling is probably also worth talking about in this situation. Which would change the stats you want...once # of connections made isn't the same thing as # of transactions, you would want to know both.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: