Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If I use spinnaker and chaos monkey in a prod scale (or even in a prod experiment) to create a circumstance where I can't perform a write of a resource because I couldn't achieve quorum in a replica set and that let to an inconsistency between two data stores... is that meaningfully different than observing the same issue but caused by incorrect vpc routing or insufficienty resourced test instances leading/slow node startup times/race conditions in a CI test environment?

I think there is overlap and that it does not have to be a choice between either approach.



The meaningful question isn't in observing the issue, but in what happens after. Chaos engineering is about making sure you still have enough of a chance of success in the face of failures. For CI, success means an error report that correctly captures whether the PR in question is breaking anything (... at least, anything we're testing). If your process means you can be sloppy about isolation and still get that, then I'd be okay with calling that an example of "chaos engineering". If being sloppy about isolation means you have failing tests in many CI runs that have nothing to do with the changes under consideration, that's not "chaos engineering" - it's just bad CI.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: