How Scatter Lab Powers 1,000+ Inference Requests per Second with Runpod

‍

The Problem

GPU scarcity and high costs threatened rapid scaling during peak growth.

Scatter Lab's flagship platform, Zeta, is a place where people can become the main character in a story and talk to AI characters like they’re real. Available in Korea and Japan, Zeta doesn’t just simulate conversation—it sparks connection. And the numbers back that up: within its first year, Zeta grew to 2.1 million cumulative active users, with users spending over 2 hours per day on average chatting with AI personas. That kind of engagement isn’t just impressive—it’s rare, surpassing even platforms like TikTok and YouTube.

Behind every AI conversation on Zeta is a large language model processing inputs in real time. That means a massive LLM serving infrastructure, capable of handling thousands of simultaneous interactions with low latency and high reliability.

To make that happen, Scatter Lab needed hundreds of GPUs running live—every day, around the clock. But when they turned to the major cloud providers to expand, they hit a wall.

“Even with access to AWS, GCP, and Azure, we couldn’t get the GPUs we needed,” one team member shared. “It felt like hitting a ceiling just as we were trying to soar.”

It wasn’t just the scarcity of GPUs that posed a problem. The cost of available instances was often prohibitively high, putting the economics of their rapidly growing business at risk. At the exact moment they needed to scale their infrastructure to meet overwhelming user demand, they were stuck navigating quota limitations and hardware shortages.

And the clock was ticking.

‍

The Solution

Multi-region GPU orchestration with dynamic scaling via APIs.

Scatter Lab responded with a thoughtful shift in strategy: rearchitecting their system to support a multi-region GPU deployment model. Instead of depending on one data center, they would draw resources from multiple sources—and orchestrate it all themselves.

A critical component of that shift was Runpod.

“By leveraging Runpod's APIs, we were able to dynamically scale the number of GPU cloud servers according to the live service load,” the team explained. “It allowed us to serve our large-scale infrastructure at nearly half the cost compared to major cloud providers.”

This wasn’t a backup plan—it was an upgrade. With Runpod’s stable and affordable GPU resources, Scatter Lab gained the ability to allocate exactly the right number of GPUs at exactly the right time, adapting their capacity in real time as user demand fluctuated.

Integration was seamless, thanks to Runpod’s developer-friendly APIs. Autoscaling was built into their workflow. What had once been a bottleneck became a flexible, reliable foundation for real-time inference at scale.

‍

The Results

1,000+ requests per second capacity with 50% cost reduction.

The impact was immediate. With Runpod in the mix, Scatter Lab’s LLM infrastructure can now handle over 1,000 requests per second—serving users at scale without compromising on speed or quality.

And the cost savings are real: compared to hyperscalers, Runpod cut infrastructure spend by nearly 50%, giving the team both the breathing room and the confidence to keep growing.

“Runpod gave us a way forward when everything else felt stuck,” one team member reflected. “It’s not just a vendor—it’s part of how we operate.”

Thanks to dynamic autoscaling, Scatter Lab doesn’t have to over-provision or guess. Their GPU fleet adjusts live, expanding and contracting with user demand. That means better performance, lower latency, and a more reliable experience for every user—even during peak usage windows.

‍

Looking Ahead

Scalable infrastructure enables bigger dreams and seamless user experiences.

Scatter Lab isn’t just scaling conversations—they’re reimagining what digital interaction can feel like. As Zeta evolves, the stakes get higher: more users, more nuanced conversations, and new markets to reach.

With Runpod, they’ve built the infrastructure that lets them dream bigger.

“Without Runpod, sustaining our business would have been extremely difficult. We truly appreciate it.”

It’s a partnership built not just on compute power, but on shared values—agility, accessibility, and the belief that great infrastructure should empower people, not constrain them.

And while most users will never see the servers behind their conversations, they’ll feel the difference in every seamless, responsive moment with their favorite AI character.