Navigating Cloud Outages
Here’s how to approach DR strategies in the cloud and protect your operations from events like today’s (Monday, 20th October 2025) outage. Any major infrastructure provider will face downtime as some point, and as such any supported businesses relying on its infrastructure can face significant downtime and revenue loss.
By Guy Ratcliffe, CTO, BOX3
10/20/20254 min read

I had a WhatsApp message this morning from a friend, cursing that AWS was down. Hell, I thought, what’s happened now? Off I went to check, I could access all my resources, and everything seemed fine. Ah, US-EAST-1 affected…
Today’s AWS outage(*), which impacted numerous services like Snapchat, Venmo, Ring, ChatGPT, Perplexity, Reddit, and many others, serves as a stark reminder of the importance of robust disaster recovery (DR) strategies, good design practice and business resilience planning, cloud or no cloud!
First, don’t expect AWS, or any other cloud provider or technology, to be infallible. Look at what happened last week with Vodafone(*). Any major infrastructure provider will face downtime as some point, and as such any supported businesses relying on its infrastructure can face significant downtime and revenue loss. However, with a well-designed plan, you can mitigate the impact of such outages and ensure business continuity.
Here’s how to approach DR strategies in the cloud and protect your operations from events like today’s outage.
DR and good design still matter
Cloud outages, whether due to technical failures like the DynamoDB endpoint issue reported today or other disasters, can disrupt critical services. A solid DR strategy prepares your business to recover quickly, minimizing data loss and downtime. The goal is to maintain availability even when a primary region or service fails.
Key DR Strategies
AWS and other cloud providers offer several DR approaches, each balancing cost, complexity, and recovery speed. Here are the primary strategies to consider, tailored to avoid the pitfalls seen in today’s outage:
1. Backup and Restore
This is the simplest and most cost-effective method, involving regular backups of data and applications to secure locations, although backing up to S3 when the rest of your infrastructure is on AWS could create its own issue if you can’t get there! So think about the backup location. In case of an outage, you restore from these backups. While this method has a higher Recovery Time Objective (RTO) and Recovery Point Objective (RPO), meaning longer downtime and potential data loss, it’s suitable for non-critical systems. To enhance this, store backups in multiple regions or other locations to protect against regional failures like the one experienced today in the US-EAST-1 region.
2. Pilot Light
A step up, this strategy maintains a minimal environment in a secondary region, keeping critical components like databases active and synchronized. During an outage, you scale up this environment to full capacity. This offers a faster recovery than Backup and Restore, balancing cost and speed, and could well have helped businesses impacted today by enabling quicker failover. This also keep costs low, but it’s key that an understanding that the data is usually the highest cost as you need this in place to start the process. Also, have it tested and automated as much as possible, to enable rapid implementation. Plus, understand what is need to get back to primary as well (e.g. that could now just be the pilot light)
3. Warm Standby
Here, a scaled-down but fully functional version of your production environment runs in another region. It can handle traffic at reduced capacity immediately and scales up during a disaster. This approach significantly reduces RTO compared to Pilot Light and could have mitigated downtime for services today, which saw persistent issues even after AWS’s initial fix.
4. Multi-Site Active/Active
The most resilient (and costly) strategy involves running full environments in multiple regions simultaneously, with traffic distributed across them. If one region fails, as seen today, others seamlessly handle the load with near-zero RTO and RPO. This would have been ideal for mission-critical apps impacted by the outage, ensuring no disruption to users. But take into account, even with this some failures at the cloud providers will still cause impact, even for global services.
Designing an Effective DR Strategy
To shield your business from events like today’s AWS outage, consider these steps when designing your DR plan:
Assess Business Needs: Define your RTO (acceptable downtime) and RPO (acceptable data loss). For instance, if losing an hour of data is unacceptable, aim for near-zero RPO with continuous replication using tools like Amazon Aurora Global Database or DynamoDB Global Tables .
Leverage Multi-Region Architecture: Today’s outage in US-EAST-1 highlights the risk of single-region dependency (although US-WEST-1 also had some issues!). Deploy resources across multiple AWS regions using services like Amazon Route 53 for DNS failover to redirect traffic to unaffected regions.
Automate Failover Processes: Manual recovery delays response times. Use AWS Lambda, Auto Scaling, and CloudFormation to automate failover and resource provisioning, reducing recovery time during crises.
Regular Testing: Many businesses affected today might not have tested their DR plans recently or had fired their DR lead (speechless on this one…). Simulate failures regularly to validate your setup, ensuring it works under real-world conditions like those experienced this morning.
Secure Data and Access: Protect backups and DR resources with encryption (AWS KMS) and strict IAM policies to prevent unauthorized access during recovery.
Monitor and Alert: Use Amazon CloudWatch (or 3rd party monitoring tooling) to detect issues early and trigger automated responses, potentially averting full outages or preparing for failover before a crisis escalates.
Preparation yelds results...
Today’s AWS outage underscores that no cloud provider is immune to disruptions. However, with the right DR strategy, whether it’s a cost-effective Backup and Restore or a robust Multi-Site Active/Active setup, you can ensure your business weathers such storms. By designing a multi-region, automated, and regularly tested DR plan, you can avoid the downtime and frustration faced by many today. Let’s prioritize resilience in the cloud - start reviewing or building your DR strategy now to safeguard your operations against the next outage.
At BOX3 we are here to help, we are expert-led and fiercely independent so we focus on what's right for your business, not the cloud provider and not the software vendor.
#CloudComputing #DisasterRecovery #AWSOutage #BusinessContinuity #TechResilience
Get in touch

Follow us
Pages
@ 2025 Box3 Ltd. All rights reserved.
Registered in England & Wales – 15909135
Partners







