Cascading failures
Now in Spain we are thinking what happened to have a power outage. Integration points are usually the cause of this.
System failures start with a crack. That crack comes from some fundamental problem. Maybe there’s a latent bug that some environmental factor triggers. Or there could be a memory leak, or some component just gets overloaded. Things to slow or stop the crack are the topics of the next chapter. Absent those mechanisms, the crack can progress and even be amplified by some structural problems. A cascading failure occurs when a crack in one layer triggers a crack in a calling layer.
Release it!, 2nd edition
Following this ENTSO-E publication about the Iberian power outage, we know that this is the timeline of Events Preceding the Blackout (April 28, 2025):
12:03–12:07 CEST: The Continental European synchronous area experienced the first period of oscillations (power and frequency swings). These oscillations were detected and seemingly mitigated.
12:19–12:21 CEST: A second period of oscillations occurred within the same synchronous area. Again, these were detected and managed, and the grid appeared stable afterward with no further oscillations detected at this point.
12:32:57–12:33:23 CEST: A series of generation trips occurred in southern Spain. This resulted in an estimated loss of 2200 MW of generation capacity within approximately 20 seconds. Notably, no generation trips were observed in Portugal or France during this initial period.
The frequency of the Iberian Peninsula power system began to decrease, and a voltage increase was observed in Spain and Portugal as a consequence of these generation losses.
12:33:18–12:33:21 CEST: The grid frequency in the Iberian Peninsula continued to drop, reaching below 48.0 Hz. This low frequency triggered the automatic load shedding defense plans in both Spain and Portugal, where parts of the load are automatically disconnected to try and stabilize the system.
This is a crack of a system, it doesn’t matter that this crack is coming from the electricity system or a software system.
Do you remember the CrowdStrike incident?
On 19 July 2024, American cybersecurity company CrowdStrike distributed a faulty update to its Falcon Sensor security software that caused widespread problems with Microsoft Windows computers running the software. As a result, roughly 8.5 million systems crashed and were unable to properly restart in what has been called the largest outage in the history of information technology and “historic in scale”.
wikipedia
Can we do something to avoid always cracks?
Before the problem occurs
Any complex system will have thousands of different states that will guide it to the disaster.
In fact there is no way to prevent all of them, basically in software this is a similar thing to the “halting problem”.
In computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running, or continue to run forever. The halting problem is undecidable, meaning that no general algorithm exists that solves the halting problem for all possible program–input pairs. The problem comes up often in discussions of computability since it demonstrates that some functions are mathematically definable but not computable.
wikipedia
In fact, any of those states usually came from the integration parts of the system. I usually say that integration points of your system are like the seams of a pair of pants. They are the more fragile part of the trouser.
Following our example of a power network, integration points are interconnectors, control systems, communication networks.
Cascading effects transmit through the integration points because we didn’t create enough shortcut mechanisms to avoid them.
Then it’s because we are terrible in our job, but the question is that being too much paranoic designing a system will end with a high complexity too costly to operate it.
Even if we want to pay the bill, we are not going to solve everything just because it is impossible to compute all the cracks your system can have, as in the halting problem.
When the problem occurs
There are a list of things you have to have in mind:
Calm down
Communicate frequently with your stakeholder about the current state
Understand the problem
Rollback if possible
Fix the problem writing an automatic test and deploy to prod
Evaluate how to manage the undesired effects created by the problem
Each one is described in detail here.
After the problem occurred
It’s time to get data and doing a postmortem, but take care, postmortems could be a place to blame people.
You should try to guide blameless-postmortems, this is a great article explaining them.
The first part of the postmortem is the above list of events described for the power outage.
For being able to collect what happened in the system we need an observable system, but one that help us to understand incidents, not just buying any SaaS that will magically give you everything.
Teams need to create observable systems and to build their own observability, it’s key in my opinion that teams who build the system run it to have good post-mortems.
Once you can describe what happened in the system from an observable point of view, we need to enrich this timeline with all actions taken by us:
what actions we took at what time
what effects we observed
expectations we had
assumptions we made
As I said, this is a blameless postmortem, we have to have in mind that everyone did the best with the information they had at that moment.
Blame will only make people to hide information, we don’t want that.
When we have all the info in front of us, we need to start thinking on actions to avoid or mitigate this problem in the future. These actions usually mean to improve:
our observability
our shortcut mechanisms for avoiding the cascading failure
our way to communicate problems
But also the blameless postmortem will evaluate the actions we took to deal with the effects created by the incident in production during the time it was there.
Design for failure
Complex systems must minimize the crack effects, we need to design them to maintain stability. But not at any cost, failure is something inherent to systems, so we have to accept that our system can fail.
The best way for me to design for failure is having a great observability and fitness functions that help you to understand that something is happening earlier enough to react and minimize costs.
Once everything is stable again, postmortems and actions will guide us to improve our system stability.
Useful stability patterns that helped me in my developer’s life:
Circuit breakers
Timeouts
Bulkheads
Fail fast
Let it crash
Back Pressure
Stability in software is a must, but has a cost, there are few companies that require 99.9995% uptime SLA. We need to design our system to be stable, but it’s much more important to improve over time than to avoid failures, that’s impossible.