The recent outages caused by a failed CrowdStrike update for Windows hosts underscores the significant challenges enterprises face when deploying and managing software at scale. These disruptions highlight why edge computing requires a different architecture, one that meets the stringent resiliency requirements of distributed environments. As global operations grow increasingly interconnected and dependent on technology, software, and AI, the ramifications of such outages can be severe.
While more information will emerge in the coming days about the specific causes of the CrowdStrike outage, several key issues have already been identified:
- Dependency on the Underlying Operating System for Recovery
Traditional agent-based solutions rely on the operating system for recovery and resilience. This approach can be problematic. In edge computing, it’s crucial to isolate edge devices from the underlying operating system and from each other. This isolation improves resiliency and limits the impact of any single failed update. - Inability to Automatically Roll Back to a Previous OS Version
A robust edge software stack must include the ability to test any update and, if necessary, revert to a prior version of the operating system autonomously. This capability prevents devices from becoming non-operational (or “bricked”) due to faulty updates and allows recovery without local intervention. - Challenges with Remote Recovery at Scale
The sheer scale of edge deployments, coupled with remote locations that are often unsafe or hard to reach, necessitates the ability to perform all operations remotely. Limited IT resources in the field make this capability even more critical. Edge solutions must enable remote or autonomous recovery without the need for local IT intervention. - Ability to Verify Edge Components Autonomously
Distributed edge nodes are vulnerable to attacks and misconfiguration with no local resources to enable recovery. This vulnerability can be mitigated by using measured boot and remote attestation leveraging the node’s TPM (Trusted Platform Module) to detect unauthorized changes and recover autonomously with automatic reversion of the OS. - Creating Isolation Between Applications
Isolating applications so that a single faulty application does not bring down the entire system is a critical aspect of resilient edge architecture. This approach ensures that one faulty application cannot compromise the overall system, enhancing stability and reliability.
Deploying edge software at scale demands ensuring resilience across all layers, and these are exactly the sort of challenges addressed by EVE-OS. Developed specifically for distributed environments, EVE-OS provides the robust architecture needed to maintain the resiliency of critical infrastructure amidst inevitable challenges. Each EVE-OS boot is verified at startup, ensuring secure access to application data and automatic reversion in case of unauthorized or incorrect updates. In light of recent events, EVE-OS’s ability to run Windows in a virtual machine is particularly relevant. This enables automated remote recovery and protects the base OS from any damaging bugs or improper upgrades without the need for manual intervention.
ZEDEDA’s cloud-native SaaS-based solution leverages EVE-OS to provide a purpose-built orchestration solution for the distributed edge which includes management, security, visibility, and compliance. ZEDEDA is architected from the ground up to deal with exactly these types of issues. For our users it will be relatively easy to remotely roll-back to a prior version of Windows running in a container and remotely execute the local steps needed to recover the Windows operating system at scale.Â
In addition, the EVE-OS operating system is designed to detect a bad operating system update and revert back to the last known working version autonomously and automatically with no intervention, preventing widespread failure.
As edge deployments expand globally, it’s crucial that we rethink how to architect for resiliency, and leveraging technologies like EVE-OS is a must. It ensures that critical infrastructure remains operational even when issues arise, eliminating the need for physical access to remote devices to resolve software problems.
The CrowdStrike outage serves as a stark reminder of the difficulties in managing large-scale software deployments and relying on legacy operating systems in edge environments. It also highlights the need for a modern edge computing architecture that can withstand and recover from such disruptions. By adopting a comprehensive and robust edge operating system such as EVE-OS, along with a cloud-based edge orchestration system like ZEDEDA, organizations can mitigate the risks associated with software updates and maintain the seamless operation of their global infrastructure.
To learn more about ZEDEDA’s modern approach to security at the edge, download our security architecture whitepaper.