I agree that this is over-simplified. It also skips over the mess you can get when a downstream dependency is having issues.
If the "things calling you" can't be effectively throttled, you often run into issues like, for example, hitting the limit on number of open sockets, file descriptors, receive queue, threads etc.
So, just saying "the downstream service is at fault" isn't really correct. Your service may also not be acting correctly in that situation. Those issues can also affect your logging and metrics.
It's not a trivial exercise to architect your service such that it always does the right thing (throttling input vs retries vs fail fast vs priority queues vs load balancing to multiple instances of a downstream service, exponential backoff, etc) when a downstream dependency is slow and/or down.
Edit: Similar to your observation about structured errors, connection pooling is probably also worth talking about in this situation. Which would change the stats you want...once # of connections made isn't the same thing as # of transactions, you would want to know both.
If the "things calling you" can't be effectively throttled, you often run into issues like, for example, hitting the limit on number of open sockets, file descriptors, receive queue, threads etc.
So, just saying "the downstream service is at fault" isn't really correct. Your service may also not be acting correctly in that situation. Those issues can also affect your logging and metrics.
It's not a trivial exercise to architect your service such that it always does the right thing (throttling input vs retries vs fail fast vs priority queues vs load balancing to multiple instances of a downstream service, exponential backoff, etc) when a downstream dependency is slow and/or down.
Edit: Similar to your observation about structured errors, connection pooling is probably also worth talking about in this situation. Which would change the stats you want...once # of connections made isn't the same thing as # of transactions, you would want to know both.