Failure is not an option; it’s a requirement
Software engineering involves a complex balance of problem solving and risk management. In most solutions, you start by addressing the happy-path first and when the system finally works as expected, you switch your focus to buttoning up the failure scenarios. How deep do you go in your quest to dodge pitfalls?
Let’s consider a solution that requires a pipeline of messages to be exchanged between two systems, a Producer and a Consumer. If the collection of every single individual message is deemed as mission critical, the team might propose a near-real-time message queue based solution.
How should the system respond if the message pipeline fails to deliver a message? There are many options, so let’s create a contingency chain by listing options and repeatedly asking “and what if that fails?”. You should continue your contingency planning efforts until the answer of “we accept failure” is an acceptable option. It is important to go through the exercise of listing out the contingency chain, even with the smallest number of options as this allows you to gauge the inherent risk of the proposed solution at every step.
That is clearly a completely inadequate solution. One basic approach and no safety net. Let’s expand on it by adding another contingency option.
This still feels like an inadequate solution. Adding a weekly and monthly SFTP backfill process might serve to add more contingency options to the mix, but would leave 3 of the 4 items susceptible to a single SFTP server failure. Care should be taken to diversify the contingency options so that they don’t share the same failure points.
This seems like a much more thorough solution for a mission-critical system. Not every software development effort needs this sort of contingency planning, but for service based behaviors or critical middleware components, the drive to understand exactly when failure is an option is an important exercise.
- Create a contingency chain
- Repeatedly ask “and what if that fails?” until the answer is “we accept failure”
- Diversify the contingencies