Let it crash, let it crash, let it crash
In an age of risk assessment and increasing attempts to design out system failures, the idea of let it crash seems not only counter-intuitive but downright reckless. Surely, it’s just too risky to wait until things go wrong. Let’s look at the rationale behind and see if it makes any sense at all.
Perhaps terrorism provides an explanatory metaphor. Where will the next gunman or suicide bomber strike in Europe? An airport, a shopping centre or maybe even a humble bookstore - it’s impossible to say. How should we prepare for an attack: install a million cameras in every street, put armed guards at the entrance of every shop doorway? Even if we wanted to, we probably wouldn’t have the resources or the ability to coordinate such mass surveillance. And even then, the perpetrators would find a soft target that hadn’t been protected and immense resources would have been expended to no avail, despite a whole city being on lock down. Of course, that’s not to say that we just resign ourselves to things going wrong, only that we approach them from a different angle. In terms of terrorism it involves better intelligence and communication between services to share information. It also needs rapid response units to deal immediately with a hostage, siege or attack scenario unfolding in real time. And so it is with programming. Even if it were possible to write 100 percent faultless and fault tolerant code, excessive caution and attempts to prevent all negative eventualities at every node would lead to masses of redundant code - clogging up and slowing down the entire system, for what may be minimal or even non-existent threats. Better then to work with crashes as alarms and early warning systems.
Letting small crashes occur in isolation
The modular character of programming and the concurrency of the different system elements provide a robustness that could not be achieved through defensive programming.
Thinking in Erlang
Much of the reasoning behind the let it crash philosophy has evolved out of the Erlang programming language and community. Originally developed in 1986 by Ericsson employee, Bjarne Däcker, it was a beautiful coincidence that it was inspired by the work of Danish mathematician, Agner Krarup Erlang, while also forming an acronym of Ericsson Language. Some of Erlang’s key characteristics are:
- distributed - components on networked computers working independently while sharing messages towards a common goal
- fault tolerant - because of the modular nature of programming, failures do not knock out the whole system
- soft real-time - working to programmed deadlines but with a degree of tolerance
- highly available with higher than average uptime - systems remain up and running, particularly important in, for example, hospitals
- hot swapping - coding components can be replaced while the machine is still running
Taking these factors into account, letting small crashes occur in isolation, like in a small corridor with fire doors, starts to look rational. The modular character of programming and the concurrency of the different system elements provide a robustness that could not be achieved through defensive programming. However, this is just the start of the let it crash journey for we can now build intelligent systems that go far beyond just maintaining uptime during localised outages. Systems can be made self-healing. This is achieved by early fault detection, diagnosis and rectification involving machine learning rather than the intervention of a technician or programmer. Thus, crashes become not just a temporary inconvenience to be bypassed but provide feedback for turning weaknesses into strengths.
From coding to a new perspective
Let it crash implies carelessness but in fact, it is in its own way, a damage limitation strategy. It can be used in FailFast testing that allows systems to crash at the first occurrence of an error, while shutting off access and damage to the rest of the system. Of course, the goal is to identify vulnerabilities and fix them fast. But there’s a deeper philosophy that takes us back to the terrorism analogy. Living in fear and locking every door will not free us from anxiety or risk - we maybe locking ourselves in with the fiend. Creating robust modular systems that detect intruders and isolate an area will allow us to pass freely from room to room, secure in the knowledge that the unexpected will be managed. Are your systems fault tolerant? Are you confident that you won’t experience embarrassing and costly downtime? Or maybe you’d like to revamp your code to something more resilient, flexible and adaptive - if so, we’re ready to offer our advice and work with you to find the best solution.