Run machine learning
inference
at scale.

Only pay for what you use — no idle costs, just unparalleled speed and scalability.

1
2
3
4
5

import runpod
def handler(job):

job_input = job["input"]
return "Running on Runpod!"

runpod.serverless.start({"handler":handler})

runpodctl -- zsh

Spend more time training your models.
Let us handle your
inference.

For your expected load, provision active workers running 24/7 with a 30% discount & flex workers to handle any sudden traffic.

Try it now

99.99% Uptime

Bring Your Container

Network Storage

9 Regions

Streaming

Webhooks

Autoscale in seconds

Respond to user demand in real time with GPU workers that
scale from 0 to 100s in seconds.

Flex

Workers

Active

Workers

10 GPUs

6:24AM

100 GPUs

11:34AM

20 GPUs

1:34PM

Active Workers

-30% DISCOUNT

Dedicated GPUs that handle consistent workloads 24/7.
Get them at a lower cost so you don't break the bank for stable usage.

Flex Workers

Flexible GPUs that cost nothing when idle.
Ready to scale up as soon as your launch goes viral.

Zero Cold-Starts with Active Workers

No cold-start time — because the workers are always
running. Get instant execution when speed is all that matters.

<250ms Cold-Start with Flashboot

Flashboot is an optimization layer for our container system to manage deployments and scale up workers in real time.

Handle more consistent workloads like fine-tuning

Scale workers by Queue Delay or Request count

Monitor your endpoint with

real-time analytics

Usage Analytics

Real-time usage analytics for your endpoint with metrics on completed and failed requests. Useful for endpoints that have fluctuating usage profiles throughout the day.

See the console

Active

Requests

Completed:

2,277

Retried:

Failed:

Execution Time

Total:

1,420s

P70:

P90:

19s

P98:

22s

Execution Time Analytics

Debug your endpoints with detailed metrics on execution time. Useful for hosting models that have varying execution times, like large language models. You can also monitor delay time, cold start time, cold start count, GPU utilization, and more.

See the console

Real-Time Logs

Get descriptive, real-time logs to show you exactly what's happening across your active and flex GPU workers at all times.

See the console

worker logs -- zsh

2024-03-15T19:56:00.8264895Z INFO | Started job db7c79
2024-03-15T19:56:03.2667597Z
0% | | 0/28 [00:00<?, ?it/s]
12% |██ | 4/28 [00:00<00:01, 12.06it/s]
38% |████ | 12/28 [00:00<00:01, 12.14it/s]
77% |████████ | 22/28 [00:01<00:00, 12.14it/s]
100% |██████████| 28/28 [00:02<00:00, 12.13it/s]
2024-03-15T19:56:04.7438407Z INFO | Completed job db7c79 in 2.9s
2024-03-15T19:57:00.8264895Z INFO | Started job ea1r14
2024-03-15T19:57:03.2667597Z
0% | | 0/28 [00:00<?, ?it/s]
15% |██ | 4/28 [00:00<00:01, 12.06it/s]
41% |████ | 12/28 [00:00<00:01, 12.14it/s]
80% |████████ | 22/28 [00:01<00:00, 12.14it/s]
100% |██████████| 28/28 [00:02<00:00, 12.13it/s]
2024-03-15T19:57:04.7438407Z INFO | Completed job ea1r14 in 2.9s
2024-03-15T19:58:00.8264895Z INFO | Started job gn3a25
2024-03-15T19:58:03.2667597Z
0% | | 0/28 [00:00<?, ?it/s]
18% |██ | 4/28 [00:00<00:01, 12.06it/s]
44% |████ | 12/28 [00:00<00:01, 12.14it/s]
83% |████████ | 22/28 [00:01<00:00, 12.14it/s]
100% |██████████| 28/28 [00:02<00:00, 12.13it/s]
2024-03-15T19:58:04.7438407Z INFO | Completed job gn3a25 in 2.9s

Cost effective
for every inference workload

Save 15% over other Serverless cloud providers on flex workers alone.

Create active workers and configure queue delay for even more savings.

Book a call

GPU Price Per

Flex

Active

141 GB

H200

PRO

$0.00155

$0.00124

Extreme throughput for huge models.

80 GB

A100

$0.00076

$0.00060

High throughput GPU, yet still very cost-effective.

80 GB

H100

PRO

$0.00116

$0.00093

Extreme throughput for big models.

48 GB

A6000, A40

$0.00034

$0.00024

A cost-effective option for running big models.

48 GB

L40, L40S, 6000 Ada

PRO

$0.00053

$0.00037

Extreme inference throughput on LLMs like Llama 3 7B.

24 GB

L4, A5000, 3090

$0.00019

$0.00013

Great for small-to-medium sized inference workloads.

24 GB

4090

PRO

$0.00031

$0.00021

Extreme throughput for small-to-medium models.

16 GB

A4000, A4500, RTX 4000

$0.00016

$0.00011

The most cost-effective for small models.

Thousands of GPUs across 9 Regions

Update your endpoint's region in two clicks. Scale up to 9 regions at a time. Global automated failover is supported out-of-the-box, so you won't have to worry about GPU errors interrupting your ML inference.

Pending Certifications

RunPod has obtained SOC2 Type 1 Certification as of February 2025. Our data center partners maintain leading compliance standards (including HIPAA, SOC2, and ISO 27001)

North America

UR-OR-1

CA-MTL-1

CA-MTL-2

European Union

EUR-IS-1

EUR-IS-2

EUR-NO-1

Europe

EU-NL-1

EU-RO-1

EU-SE-1

Serverless Pricing Calculator

Requests / Hour

Execution Time / Request

seconds

$ 42 /mo

72,000 requests per month

1. Cost estimation includes 50% of the requests using active price & running into 1s cold-start.

We're with you from seed to scale

Book a call with our sales team to learn more.

Gain

Additional Savings

with Reservations

Save more by committing to longer-term usage. Reserve discounted active and flex workers by speaking with our team.

Book a call

Are you an early-stage startup or ML researcher?

Get up to $25K in free compute credits with Runpod. These can be used towards on-demand GPUs and Serverless endpoints.

Apply

CTO, LOVO AI

Hara Kang

"It really shows that RunPod is made by developers. They know exactly what engineers really want and they ship those features in order of importance."

Hara Kang - CTO, LOVO AI

6,683,263,693

6,683,263,193

requests & 100k+ developers since launch

Join our Discord

Get started with RunPod

today.

We handle millions of serverless requests a day. Scale your machine learning inference while keeping costs low.

RunPod

Run machine learninginference at scale.

Only pay for what you use — no idle costs, just unparalleled speed and scalability.

Spend more time training your models.Let us handle your inference.

Cost effective for every inference workload

Run machine learning
inference
at scale.

Spend more time training your models.
Let us handle your
inference.

Cost effective
for every inference workload