Resiliance in a system is a property that always needs to be thought about. It will either consciously be addressed, or it won’t, but it will be there regardless.

This is even more important when most every sytem can actually be thought of as a Distributed Systems.

Two fundamental approaches to designing a system:

  1. Optimize for mean time between failures (MTBF)
  2. Optimize for mean time to restore service (MTRS)

Both can work, but devops prioritizes on enabling the second.

The biggest enemy of resilient systems is “works of art.” Components of the system that have been hand-crafted over the years that, if failed, would be impossible to reproduce in a predictable time.