Sustaining Resilience in Software and Systems
by Kelly Shortridge, with Aaron Rinehart
Resilience in Software and Systems
This book is a guide for how to design, build and operate systems more resilient to attacks.
Resilience is the ability for a system to adapt its functioning in response to changing conditions so it can continue operating successfully.
5 key aspects of Resilience:
- Critical functionality
- Safety boundaries
- Interactions across space-time
- Feedback loops and learning culture
- Flexibility and openess to change
Chapter Takeaways
Software systems are complex.
In complex systems, failure is inevitable.
Failure is never the result of one factor. Think of stressors (acute or chronic).
Resilience: ability to gracefully adapt. This is not robustness.
Security Chaos Engineering (SCE): embraces the idea that failure is inevitable and uses it as a learning opportunity.
Systems-oriented Security
A mental model is a “cognitive representation of external reality”. The more complex a system, the more difficult it is to keep an accurate and complete mental model of it. Attackers exploit our mental models: they attack where our assumptions are wrong (“this will alwyas be true”). The only way to retain an understanding (a “feel”) of a complex system is through experimentation.
This is where we use “resilience stress testing” , or “chaos experimentation” as it is refered to for software systems.
“Making attacker math work for you”: think like an attacker to optimize your effort investment in security.
Use decision tress to model the actions the attacker will take, and how our system (including humans in the loop) will respond, what countermeasure you can implement to avoid attacks.
Ross Ashby’s law of requisite variety: Any system that governs another, larger complex system must have a degree of complexity comparable to the system it is governing
Variability is needed in resilicence/security programs: to be able to absorb the variability of the complex systems to be protected.
Chapter Takeaways
To protect complex systems, need systems thinking
Attackers take advantage of our incomplete mental models for the system
Find loopholes in your own mental model through resilience stress testing
The E&E approach: Evaluation & Experimentation Approach to incrementally transform toward resilience.
Not “fail-safe” but “safe-to-fail”
SCE - Security Chaos Engineering - prioritizes measurable success outcomes, with a decentralized approach
RAV Engineering: repeatability, accessability and variability to support resilience across the software delivery lifecycle
Architecting and Designing
Dijkstra: brainpower is by far our scarcest resource. The competent programmer is fuly aware of the strictly limited size of his own skull.
Brainpower and time are scarce resources. Use the Effort Investment Portfolio to balance your portfolio of efforts to achieve your goals. Invest time to think through the why of the design and architecture. Take into account the “local context”: current practices available, and legacy parts. Take into account the socio-technical context: e.g. will the software end users (operators) always be trained and skilled ?
Choose opportunities where your organization has sufficient effort capital to allocate and could invest that effort quickly.
The four failure modes resulting from System Design:
- Tight coupling and Linear interactions: Consequential but predictable failures
- Loose oupling and Linear interactions: Limited risk of a major incident
- Tight coupling and complex interactions: DANGER ZONE = surprising and hard-to-control incidents
- Loose coupling and complex interactions: surprising but manageable failures
So they are 2 axis for a resilient design: coupling and complexity.
Coupling
As designers, we should look to invest for loose-coupling, and linearity.
One approach is to design to preserve possibilities. We look for sustained adaptability: the ability to adapt to future surprises as conditions continues to evolve. Loose coupling enables to preserve possibilities. Tight coupling is often an undesired results of adding more (more features, etc.), or striving to reduce complexity (eg: industrial forests).
Loose coupling can be achieved with the DIE design pattern: Distributed, Immutable, Ephemeral.
Chaos experiments can reveal how much coupling exists in the system.
Complexity
There are 2 types of complexity: essential (complexity of the system core function and interactions with its user) and accidental (complexity from unintended interactions).
Use chaos experiments to reveal accidental complexity, by injecting failures in one component.
Complexity appears also when the system fails in a way that is impossible according to the designers’ mental models, as shown by the example of Appolo 13 incident.
To reduce complexity, we want to introduce Linearity.
3 first means to do it:
- Isolation: it reduces the interaction between a failing component and the rest of the system
- Choose boring technology (as attackers will do)
- Introduce functional diversity : it starts with functionally disparate components
Identity and Access Management : use permissions at every components level to control authorization. Netflix introduced a service to delegate permissions management as an architecture service instead of custom implementation within each application.
Make sure humans in the loop are not adding more tight-coupling: such as not enough documentation, centralized management. Systems - as sociotechnical entities - must outlive and outgrow their designers. Continuously refine through experimentation your understanding of the systems to avoid flawed mental models.
Chapter Takeaways
Systems are always evolving: from the simple to the complex
Responsibility: nurture your system so that it may recover from incidents and surprises
The Effort Investment Portfolio: to capture the need to balance our “effort capital”
Macro failure modes.
Invest in loose coupling to stay out of the Danger Zone
Invest in linearity: isolation, boring technology, functional diversity
Experiments generate evidence of how our systems behave in reality
Building and Delivering
Chapter Takeaways
Note: we think in terms of Systems : therefore it is not only about developing or coding, but it is building and delivering. Software has value only when it is delivered in production.
Who owns securiy ? it should be owned by engineering teams in a decentralized way, same transformation as the DBA: Security should be owned by Software/Platform engineering teams.
Decide on the critical functionalities before building, and know which features can be “thrown into the airlock”
Good practices for Building and Delivering activities:
- thoughtful code review,
- choosing boring technologies,
- selecting and standardizing raw-material in software: choosing harmless technologies - such as memory-safe and thread-safe languages (Rust instead of C/C++),
- automating security checks with CI/CD
- continuous delivery (this is advanced !)
- standardization of patterns and tools: “paved roads”
- dependency analysis and vulnerability prioritization
- Infrastructure-as-Code and Configuration-as-Code: benefits: faster incident-response, minimized environment drift, faster patching and security fixes, minimized misconfigurations, catching vulnerable configurations, autonomic policy enforcement, stronger change control.
- Fault-injection during development, crafting thoughtful test strategy, being cautious about abstractions we create
Beware of premature and improper abstractions: someone need to maintain the abstractions… An abstraction is tight-coupling hiding complexity. Sometimes they are useful, but sometimes they are introduced at a too-high cost.
To foster feedback loop: test automation, documentation as a must-have (why and how), distributed tracing and logging, refine how humans interact with our systems
To sustain resilience: look for modularity, use the strangler pattern for incremental, elegant transformation
Operating and Observing
Operating and Observing is all about feedback loops.
Security teams have a lot to learn from SRE teams. Observability is key to understand systems, especially when they run hot.
SRE teams and Security have overlapping objectives. The difference is that SRE team look at overall quality metrics, while Security teams focus on Security.
Excellence in SRE practice can be evaluated with the DORA metrics (DevOps Research & Assessment). For ex.: deployment frequency, lead time for changes, time to restore service, change failure rate.
SLO : Service Level Objectives are means to measure service reliability, and communicate SLA (Service Level Agreements) with end user. SLO are usually derived from 4 types of signals: latency, traffic, errors, saturation.
Embrace Confidence-based Security: success is an active process, a continuous challenge where teams learn experiments after experiments.
Security teams have a biais in detection of “badness” (misconfigurations, etc.). This is a limiting lense, causing a flow of “findings”, hard anyway to prioritize and treat correctly. Better to think about the full observability of the system: logs as “blood” of the system.
Use safety thresholds to identity unhealthy states. This can even be an eary sign of an attack.
Deception environments benefits: learn resilient system design, trace attackers, experiment your platform and your tools. Add to this chaos experimentation to test your observability.
Scalable systems are inherently safer. Look for where you are missing slack (including in the human aspect).
To extend your team capability: automate anything that a computer can do much better than a human.
Chapter Takeaways
Phase when we can witness System behavior as it runs in production, and can reveal where our mental models are inaccurate
SRE and security objectives overlap, but SRE understands better the need for speed in execution
Attackers have an asymmetric advantage as they get immediate feedback (success or error)
Laggy recovery from disturbances (in the socio technical system) indicates erosion of adaptive capacity
A scalable system is a safer system
Apply the concept of toil (from SRE) to security: any task that a computer can perform better than a human should be automated
Responding and Recovering
Any complex system will suffer stressors and surprises.
Security is a subset of Resilience. Attacks are a type of unhealthy behavior, but not the only kind.
Pat Lagadec: the ability to deal with a crisis situation is largely dependent on the structures that have been developed before chaos arrives.
Preparation for incident will have its reward when incidents come. Think of it for your Effort Investment Portfolio. Practice recovery.
Action Bias in Incident Response: actions may have adverse effect. Evaluate the cost of acting in crisis vs. the costs of waiting in crisis. A handy heuristic: the Null Baseline: what do we gain by not taking this action ?
Foster a blameless culture to enable organizational learning. Keep in mind that the “blunt-end” (eg: system design) is often more responsible than the “sharp-end” (operators).
As in Healthcare, Root Cause Analysis way too often end up on the first cause, and blame the human error.
More Bias: Hindsight Bias (“I knew it !”), and Outcome Bias (“This was a good decision as it worked !”).
The Just World Hypothesis: we think the world is a just place, and by default everything will just work…
Human take decisions with their local context (“local rationality”). Use neutral practitioner questions to untangle the chain of actions and the overall context of an incident.
Chapter Takeaways
Incidents: prepare for them !
You should prepare for the incidents early in the Software Development Life cycle
Sometimes actions are too quick, think “watchful waiting”
Recovering from incident requires learning, not blaming !
Humans who interact directly with the systems are often the ones less responsible, and the most blamed
During incident review: stay neutral, be curious and intellectually honest
Platform Resilience Engineering
Production pressures influences system behavior. Managers and workers striving short-term efficiency and financial success push the system towards unsafe conditions.
Stakeholders need apply quality and security pressures as a check against production pressure.
Platform Engineering: emerging discipline that treats the delivery of employee-enabling technology like a product.
“Platform Resilience Engineering” objective would be to build supporting technologies so that other teams can build and operate resilient, secure systems.
5 means to increase safety, from most effective to less effective:
- System design and redesign to eliminate hazards
- Substitute less hazardous methods or materials
- Incorporate safety devices and guards
- Provide warning and awareness systems
- Apply administrative controls (guidelines, training, etc.)
System design and redesign to eliminate hazards: - do not depend on human behavior - provide complete separation of the user from the hazard
Substitute less hazardous methods or materials: example: “don’t roll your own crypto” - use “choice architecture” (nudge) - design defaults (the principle of least resistance)
Incorporate safety devices and guards - ex.: rate limits, billing alerts - traditional CyberSecurity starts there !
Provide warning and awareness systems: most cybersec tools (EDR, SAST, DAST, security dashboards, etc.)
Apply administrative controls (guidelines, training, etc.): policies, and “infernal” “security obstructionism” policies: manual security reviews, password rotation, internal education programs, phishing exercice, vulnerability management outside of developer workflows, etc.
These policies can be useful as last resort, but most of the time, they are annoyance, and end-users will try to circumvent them.
Checklists: have their usefulness, but shouldn’t be necessary with Platform Engineering.
Don’t chase Security as a metric.
Chapter Takeaways
Sustain resilience through organizational practices: from siloed security program into a platform engineering model (“platform engineering resilience”)
We should build minimum viable products and pursue an iterative change model informed by user feedback
Design security solution for use by engineering teams : see the SPACE framework for some useful metrics on the team productivity and well-being
Security Chaos Experiments
Running Security Chaos Experiments is similar to running a scientific experiments. It is key to document in full details the why and the how of the experiment: this is the specifications of the experiment. To write the spec, we can use a (threat-model) decision tree. We need to involve stakeholders to get a buy-in. When starting with SCE, we run experiments in Non-Production environments only. We have our assumptions written. After the experiment, we collect metrics, data, results, evidence to demonstrate whether our mental model of how the system works is accurate. Often, it will not be accurate, and the result of the experiment will be a lot of security findings that we need to work on. We document the results and the lessons learned. This is our inputs for learning.
Chapter Takeaways
Experimentation: cycle of discovery and learning, which is what drives scientific progress
Socialize the experiments with stakeholders
Document a precise, exact experiment design specification: you can use decision trees to build your experiments. Include machines and humans in the system.
Communicate experimental findings through release notes
Game-days are a more manual form of conducting chaos experiments
Security Chaos Engineering in the Wild
Experience Report at UnitedHealth Group: Aaron Rinehart explains how his team introduced the concept of Security Chaos Engineering for Cloud Security as the company was migrating workloads to AWS. They build an open source tool called ChaosSlingr (inspired from Netflix Chaos Monkey). They used this tool to inject Security misconfigurations (e.g.: open a port), and to follow-up : whether the change would be “blocked”, would be logged, would be reported to the Security team. They discovered that 60% of the time the misconfiguration “incidents” were not detected. They could use this tool to reinforce their Security posture, and their capacity to responde to incident. Security is first and foremost not the ability to respond to highly complex incidents, but to be able to respond effectively to many mundane security incidents.
Experience Report at Verizon: At Verizon, SRE teams were struggling with a huge number of heterogeneous Kubernetes clusters to maintain in production. They introduced Chaos Engineering, as a scientific method to validate their (partial) understanding of each production environment. The experiments enable them to discover whether the application were deployed with good practices, with good security. They used the same approach to optimize cost, and optimize performance. As they moved forward, they build a continuous ‘chaos engineering testing’ approach. This scientific approach turned out to be the only way to slowly move forward from an initial painful chaotic situations to a more controlled situation.
Experience Report at OpenDoor: The SCE approach is used to discover gaps in the Incident-Response capability: for example, to discover gaps in the logging and monitoring capability.
Experience Report at Cardinal Health: The SCE approach has “grown organically”, it is named “Applied Security”, and uses Continuous Verification and Validation. The mission of the Applied Security team is to discover unknown Security risk - the Security risks that are not already followed by compliance Security teams.
Experience Report at Accenture: Accenture highlights the communication issues between SRE and Security teams. Resolving these communication issues is key to avoid more chaos during Security incidents.
Experience Report at CapitalOne (by David Lavezzo): The SCE approach started with Security testing scripts. Then it evoled in a set of API (with endpoints) that the application teams can use to self-tests their workload. The experiments designed were extended with contents from Threat Intelligence TPP (Tactics, Techniques and Procedures). Iteratively, take a new threat, experiment it, deploy defense against it. Compliance-driven Security is a failure-adverse. It is a disservice to the team that are in charge of application security. Failure is necessary to better understand the real risks you face; avoiding it does a disservice to the teams tasked with securing an enterprise. We need a continuous Security assessment to make sure Security works consistently through time. The goal is to help teams become self-sufficient to identity what the security posture of their tooling is.
Chapter Takeaways
Adopting SCE requires a cultural shift and change in mindset
Assert your hypothesis, validate your assumptions, and share what you have learned
Test your security before someone else does for you
Begin with a culture of fearless experimentation to learn what you don’t know
Establish cross-coordination between SRE and Security teams
My Review
Highly recommended reading ! Security Chaos Engineering is a rich book that draws from multiple fields - ecology, healthcare, cognitive science, etc. - great lessons for security and resilience. This guide will open the eyes of many Security practioners that “cybersecurity” as we know it is “broken”. It is a call for action to transform it into Platform Resilience Engineering. I found countless gems inside the book. I enjoyed the focus on mental models and modeling threats with decision trees. There were also multiple hinsights into how to get to more resilient software systems in: design, build, operations, and incident responses activities. Security Chaos Engineering, with its experimental approach, is grounded in science, and is a solid approach to face the challenges of protecting and maintaining in production complex systems.