Blog

Building for resilience: Runpod’s response to the AWS us-east-1 outage

An AWS us-east-1 outage degraded Runpod's control plane, but Pods kept running with no data loss, and within 72 hours we added multi-region failover.

Building for resilience: Runpod’s response to the AWS us-east-1 outage

Last week on Monday, AWS experienced a significant outage affecting multiple services in the us-east-1 region. The disruption impacted thousands of sites and millions of users across the internet, and Runpod was certainly not immune to these effects.

During the outage, Runpod console availability was impacted as our upstream provider, Vercel, depended on this region. Serverless endpoints continued to receive requests but couldn’t process them due to the impact on our worker management microservice. Users experienced issues with Pod provisioning and access, while others encountered extended delay times throughout the platform. Our payment processing system was also impacted during this period, and we took steps after the fact to ensure customers were not charged for resources that they could not utilize.

Understanding our dependencies

Some customers questioned why Runpod was impacted by an AWS outage at all. While Runpod has over 40 data centers designed for AI application development and deployment, we leverage AWS infrastructure to host critical portions of our control plane. This architecture has enabled us to scale our web application effectively, but it also means that AWS availability directly impacts our platform's operational status.

But we want to underline the fact that Runpod's GPU compute resources remain entirely independent from our control plane. Pod workloads remained operational during the AWS outage, and even when the Runpod UI was unavailable, your Pods, endpoints, and clusters remained intact and secure. Once connectivity was restored, these resources returned to their normal operational state within without data loss or configuration changes. Similarly for Serverless, as soon as the coordination microservice was back online, workers could resume processing requests as normal.

Immediate infrastructure improvements

Following the outage, our engineering team immediately began implementing critical redundancies for our infrastructure. Within 72 hours of the incident, Runpod's engineering team deployed our core services across multiple AWS regions, and if AWS suffers another outage like this, our platform is prepared to failover to a healthy region to stay online.

We also enhanced our Serverless platform's resilience to control plane disruptions. If necessary, workers can now use their cached configurations, allowing them to continue accepting and processing requests for an extended period, even if the central service is unavailable. When connection is restored, the workers’ distributed state automatically synchronizes back up with the control plane. This distribution of state reduces the blast radius risk if AWS or any other core internet service suffers another outage.

Our roadmap to resilience

This outage was a painful lesson for us, but a valuable one. While we’re proud of our core design, which separates your compute resources from the control plane to keep your workloads safe, we are far from satisfied. Our long-term roadmap includes transitioning to a partitioned multi-region deployment hosted entirely on Runpod's own provider network, with automated load balancing and failover capabilities.

While we cannot prevent external infrastructure failures, we are committed to building a platform that remains resilient throughout such events. We appreciate your patience during last week’s disruption and we’ll continue providing updates on our progress.

‍

The Chips Got Faster. The Stack Didn't.

Explore why faster chips have shifted the bottleneck to AI infrastructure, and what that means for teams running production workloads.

Multi-Instance GPUs on Runpod: Stop Paying for Compute You Don't Need

With MIG, we can partition RTX 6000 Pro cards into isolated 24 GB instances. Here's when it makes sense for your workloads.

OpenAI Parameter Golf: what 1,100 researchers built in six weeks

How 1,100 researchers beat OpenAI's own baseline with 16 megabytes and 10 minutes.

Build what’s next.

Build, train, and scale AI workloads on Runpod with cloud GPUs, Serverless, and Clusters.

Get started