Resiliance in a system is a property that always needs to be thought about. It will either consciously be addressed, or it won’t, but it will be there regardless.
This is even more important when most every sytem can actually be thought of as a.
Two fundamental approaches to designing a system:
- Optimize for mean time between failures (MTBF)
- Optimize for mean time to restore service (MTRS)
Both can work, but devops prioritizes on enabling the second.
The biggest enemy of resilient systems is “works of art.” Components of the system that have been hand-crafted over the years that, if failed, would be impossible to reproduce in a predictable time.
Note, they can be difficult or near-impossible to reproduce as long as it can be done so in a predictable timeframe.
The emphasis on predictable timeframe suggests a known entropy, a known environment, a known context, etc. No hidden mysterious tinkerings.