Complex System Failure

Posted by Mia Kollia on January 31, 2023
Complex System Failure

Contributors: Marisa Bigelow, Kati Walker PhD, Ashley Wade, Glenn Hodges PhD, Mark Fogel, and Kent Rowand

Table of Contents

Resilience: Responding to the Unexpected and Unknown

Aviation-related incidents have dominated the news in recent months: Southwest Airlines’ widespread delays and cancellations over the holidays, the NOTAM (Notice to Air Missions) primary and backup systems shutdowns, and several consecutive near-miss incidents at JFK airport. Were all of these incidents freak accidents or were there underlying indicators that could have signaled systems were at risk? If the FAA or Southwest Airlines were resilient, the scale of these failures could have been minimized or even prevented altogether.

Defining Resilience

There are a number of concepts that are associated with resilience, the ability to bounce back, the ability to absorb disturbances, the ability to stretch to handle surprises, and the ability to handle the capacity of multiple network systems whether they be human, technology, or a process. The essence of resilience is the ability to deal with the unknown using a finite amount of resources. Resiliency is recognizing that complex system failure is inevitable and it is how people, teams, and organizations respond to failure that determines the outcome. The response itself could be people, processes, and technology set in place to act before, during, and after an incident. Taking a closer look into the circumstances surrounding the recent events in aviation and parallel examples in banking industries may reveal areas where the resilience of these systems could be improved in the future.

Recognizing Resilience

What does reduced resilience look like in an organization or team? Reduced resilience can manifest in basic patterns with indicators of corrosivity that could inform an organization of its reduced resilience.

Detection and recognition of such indicators can be difficult because in successful times they can be written off as measures of efficiency–seemingly indicating that resilience capacity is increasing rather than decreasing.

  • Leading Indicators – appear before an event has occurred and are harder to measure (e.g. maintenance on safety-critical systems and interdependent systems).
  • Lagging indicators appear after an event has occurred and tend to be easily measured (e.g. equipment downtime).

Detection and recognition of leading indicators are critical for effective decision-making about system activities and may make the difference between pushes for efficiency versus maladaptive trajectories toward failure.

Resilience in Practice

With these indicators of corrosivity in mind, let us re-evaluate a few of the recent airline incidents and one from the banking industry through the lens of resilience. As we inspect each incident, consider what indicators you may recognize and whether each incident was preventable with adequate resilience. What could each organization have done to bolster its resilience?

Notice to Air Missions (NOTAM) System Shutdowns

On January 10, US commercial air travel came to an abrupt halt for only the second time in American history. This resulted in thousands of delayed and canceled flights, millions of dollars in lost revenue, and unimaginable inconvenience to millions of travelers. One failure point, a corrupted file, tested a number of past design decisions which combined to create the conditions for the Notice to Air Missions (NOTAM) system failure. Further brittleness in FAA policies meant that the loss of the NOTAM system necessitated the abrupt halt of air travel. NOTAM system provides pilots with real-time information pertaining to the safe operation of their aircraft. The kind of information in these notices ranges from closed airports and damaged runways to broken backup taxi lights and damaged perimeter fencing. In fact, the vast majority of NOTAMs are of the latter category and only relate to the actual safe conduct of aerial vehicles only in the most unlikely and contrived scenarios. In this instance, both primary and backup systems failed with no alternative for such a safety-critical system. According to the FAA, the systems were impacted by a contractor implementing an unspecified change. A failure on this scale frequently comes with leading or lagging indicators from the past and often less damaging “near-misses” suggesting the system was not as resilient as previously thought.

Southwest Airlines Holiday Collapse: Leading Indicators in Motion

During this past holiday season, a heavy storm impacted multiple airlines across the United States, causing delays and cancellations. Southwest Airlines’ complex distributed scheduling system, SkySolver and Crew Web Access, was uniquely positioned for a critical failure due to decades-old accumulated code components that were never revisited or resolved, an issue known as technical debt.

Southwest’s system collapsed under the stress of weather delays/re-bookings combined with the technical debt, stranding thousands of passengers as other airlines recovered with greater resiliency.

The system outage reduced coordination and communication between Southwest Airlines’ flight crews and crew schedulers to telephone calls and long hold times as flights were rebooked and information was gathered manually. Each delayed flight contributed to a growing domino effect requiring every subsequent flight crew impacted by a delay to join on hold. The airline effectively lost the ability to visualize and direct its crews and planes; the enormity of the crisis grew with no graceful extensibility. The multi-faceted failure was not surprising to Southwest, nor was the solution. The airline had a number of leading indicators suggesting it was operating under reduced resilience; previous cancellation crises and employee contract negotiations highlighted the need for improved communication tools and procedures between crews and schedulers, and modernization of the various pieces of technologies critical for daily operations. The storm was a catalyst for the many weakened areas of resilience to collapse and cascade into problems the airline struggled to handle.

Knight Capital Runaway Trading: Interwoven Leading Indicators

Southwest Airlines is not alone in struggling with technical debt and keeping pace with new challenges: in 2012 the Knight Capital Group nearly went bankrupt after a stock purchasing program acted unexpectedly, purchasing around 150 different stocks in less than an hour at a cost of around $7 billion. A single server’s code revealed a web of hidden interdependencies, including old code, never intended for production being activated by new production code updates. When activated, the old code ran rampant with trading over 4 million times. The employees were not prepared to recognize or respond quickly to the issue as there were no written procedures for reviewing the manual deployment of software. Additionally, an early email indicating an unusual error was not considered a high priority, as was usually the case for Knight Capital employees, until the runaway automation began at the start of trading that morning.

Photo by Behnam Norouzi on Unsplash

The continual churn of updates without refactoring (updating/restructuring code to reduce technical debt) and lack of review procedures unknowingly led Knight Capital closer and closer to the limit of its resilience.

Building Resilience in a Changing World

If Southwest Airlines and Knight Capital had leading indicators of potential impending crises, why wouldn’t they take action? Complex distributed software systems naturally accumulate technical debt over time, especially in today’s rapidly changing landscape of evolving code frameworks and best practices. Technical debt comes from satisficing with “good-enough” code, which is quite reasonable most of the time given production pressures. Refactoring can clean up this technical debt and realign the code base with current assumptions and best practices. Dark debt, however, is not so easily tamed. It is similar to technical debt in that parts of the system created in the present pose risk to future operations, but is much harder to detect. Complex systems give rise to dark debt, which is only revealed through interactions and surfaces as anomalies. These vulnerabilities cannot be detected like technical debt and pose significant challenges to companies, such as in the cases of Knight Capital and Southwest Airlines. Outdated code reviews and refactoring are leading indicators for falling behind in managing technical debt. They also point to stale mental models of the system, making it particularly challenging to respond effectively when anomalies are generated from dark debt.

These events reveal common patterns in leading indicators for reduced resilience and show that any complex system in any industry can be impacted. It can be difficult to recognize these indicators in the moment, but it is vital to monitor, recognize, and respond when leading indicators surface. Resilience is not a state of being but a state of continuously assessing potential indicators of current capabilities (people, processes, and technology) both individually and combined, against known and unknown forecasted future challenges. Any clear picture or map you have of your organization becomes quickly outdated, which presents the issue of where you spend your effort trying to stay up to date and maintain resilience.

Resilience at Mile Two

Mile Two specializes in resilience engineering at the company, team, and individual levels. This allows us to be responsive to changing customer needs, requirements, and environments. In the coming weeks, we will publish additional posts to highlight how organizations can become more resilient. We will review a number of case studies highlighting general patterns in how complex systems fail and how we might anticipate and mitigate these challenges. If you would like to discuss resilience with us, we would love to hear from you!

Copyright © 2024 Mile 2, LLC.
All rights reserved. Web design by Jetpack