Using Ollama to Serve Quantized Models from a GPU Container

TL;DR

For the quickest start, use the pre-defined Ollama template on Runpod.
Deploy a Runpod GPU Pod, expose HTTP port 11434, and set OLLAMA_HOST=0.0.0.0.
Run ollama serve, then ollama pull a quantized model (e.g. llama3.3, qwen3, gemma3).
Point OLLAMA_MODELS at a network volume so your downloads survive a restart.
Call it over Runpod's proxy URL using either Ollama's native /api/chat or its OpenAI-compatible /v1/chat/completions endpoint.

Ollama handles GPU memory, model loading, and serving for you. Quantized models in GGUF mean a 7–8B model runs in ~5–6 GB of VRAM, so a modest GPU is plenty. Here, we’ll cover setting up the container, loading quantized model files, and getting the system running on a platform like Runpod.

What is Ollama and Why Quantized Models?

Ollama is a lightweight framework and CLI tool that lets you run and manage language models locally (or in any environment) through a simple interface and API. It supports models in the quantized GGUF format which is a compressed model format that significantly reduces memory usage with minimal impact on performance. By using quantization, even larger models that might normally require 30+ GB of VRAM can run on single-GPU setups. Ollama comes with a library of models and handles model serving details for you, such as unloading models from GPU memory when they’re idle to free up resources (which helps when running multiple models sequentially or conserving VRAM for other tasks).

In practice, using Ollama means you don’t have to write your own server or worry about GPU memory management for inference. You can load a model with a one-liner and start a persistent service to answer queries via command line or HTTP.

Step 1: Choose a GPU that fits your model

Ollama's default model tags are 4-bit quantized, so VRAM needs are modest. Use this as a rough guide. The figures include the model plus a typical KV cache, which grows with context length:

Approximate GPU VRAM for 4-bit (Q4) Ollama models
Model size (4-bit default)	Approx. VRAM in use	Recommended GPU VRAM	Example Runpod cards
3–4B (`llama3.2`, `gemma3:4b`)	~3–4 GB	8–12 GB	RTX A4000
7–8B (`llama3.1:8b`, `qwen3:8b`)	~5–6 GB	12–16 GB	RTX A4000, L4
13–14B	~9–10 GB	16–24 GB	RTX A5000, RTX 4090
27–32B (`gemma3:27b`, `qwen3:32b`)	~20–22 GB	24–48 GB	RTX 4090, L40S
70B (`llama3.3:70b`)	~42–45 GB	48–80 GB	L40S (48 GB), A100/H100 (80 GB)

Start small; most workloads are well served by a 7–14B model on a 16–24 GB card, and scale up later only if quality demands it. Compare options on Runpod's GPU comparison page, and see pricing for current hourly rates.

Step 2: Deploy Ollama on a Runpod Pod

You can run the official ollama/ollama Docker image or install Ollama directly on the Pod.

Go to Pods then Deploy and pick a GPU from the table above.
Use either the pre-defined official Ollama template if you want to just jump in right away, or use the PyTorch template if you want to experience setting things up from scratch.
Under Expose HTTP Ports, add 11434, which isOllama's default API port.
Add an environment variable OLLAMA_HOST = 0.0.0.0 so the server listens on all interfaces, not just localhost. This is the single most common reason a Runpod Ollama endpoint is unreachable.
Deploy, then open the Web Terminal (or SSH in).

Start the server and confirm it's running:

ollama serve &
ollama --version

Step 3: Pull and run a quantized model

Ollama's library has 100+ ready-to-run models. Use current ones, not the 2023 defaults:

# Pull a quantized model (4-bit by default)
ollama pull llama3.3

# Chat interactively
ollama run llama3.3

# Or a one-shot completion
ollama run qwen3 "Explain quantization in one sentence."

Good current choices include llama3.3, qwen3, gemma3,, mistral, phi4, and gpt-oss. To pick a specific size or quant, use a tag: ollama pull gemma3:27b or ollama pull qwen3:8b. Check what's loaded and whether it's on the GPU with:

ollama ps   # the PROCESSOR column shows GPU vs CPU

If you have your own GGUF file, register it with a Modelfile (FROM ./my-model.gguf) and ollama create.

Step 4: Call the API

Expose the Pod and you can reach Ollama at its proxy URL: https://<your-pod-id>-11434.proxy.runpod.net.

Native API: note that Ollama streams by default, so pass "stream": false for a single JSON response:

curl https://<your-pod-id>-11434.proxy.runpod.net/api/chat \
  -d '{
    "model": "llama3.3",
    "messages": [{"role": "user", "content": "Why is the sky blue?"}],
    "stream": false
  }'

OpenAI-compatible API: point any OpenAI client at the /v1 base URL and existing code works unchanged:

from openai import OpenAI
client = OpenAI(
    base_url="https://<your-pod-id>-11434.proxy.runpod.net/v1",
    api_key="ollama",  # required by the client but ignored by Ollama
)
resp = client.chat.completions.create(
    model="llama3.3",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

The first request to a freshly pulled model is slower while it loads into VRAM; subsequent requests are fast. (Not every OpenAI feature is supported; chat completions, function calling, and vision work with compatible models; the Assistants and fine-tuning endpoints do not.)

Step 5: Persist your models across restarts

By default Ollama stores models in ~/.ollama/models (i.e. /root/.ollama in the container). On Runpod, anything outside a persistent volume is lost when the Pod is reset — and re-downloading a 40 GB model every session is painful, especially since Ollama can't resume an interrupted download.

Two fixes:

Point OLLAMA_MODELS at a network volume: set OLLAMA_MODELS=/workspace/ollama-models (or mount /workspace/ollama:/root/.ollama in Docker) so downloads land on persistent storage.
Stop, don't terminate. A stopped Pod keeps its volume and resumes in seconds; terminating wipes /workspace and your downloads with it.

Ollama vs. vLLM vs. llama.cpp: which to use?

Ollama: easiest path to a working API. Best for prototyping, internal tools, single-GPU serving, and anyone who wants model management handled for them.
vLLM: highest throughput under heavy concurrency. Reach for it when you're serving many simultaneous users and want maximum tokens/second per GPU.
llama.cpp: the lowest-level, most portable option, and the engine Ollama is built on. Choose it when you want fine control over the build and flags. (Ollama is essentially a friendly wrapper around it.)

A common pattern: prototype with Ollama, then move to vLLM if you outgrow its throughput. For a deeper comparison, see Runpod's LLM inference optimization playbook.

Frequently asked questions

What models can I run with Ollama?
Any model in Ollama's library (100+), plus your own GGUF files via a Modelfile. Current popular options include Llama 3.3, Qwen 3, Gemma 3/4, Mistral, Phi-4, DeepSeek, and gpt-oss. Run ollama list to see what you've pulled.

Do I need an NVIDIA GPU? No. Ollama supports NVIDIA (CUDA) and AMD (ROCm on Linux), both of which we support on Runpod.

How does quantization affect quality and speed? 4-bit quantization slightly lowers precision, but the quality cost is usually small and the memory savings are large. Because the model fits in VRAM, inference is fast. If you need maximum fidelity, run a larger model or a higher-precision quant (Q5/Q6/Q8) on a bigger GPU.

Does Ollama work with OpenAI client libraries? Yes. It exposes an OpenAI-compatible API at /v1 on port 11434 — point your client's base URL there. Chat completions, function calling, and vision work with compatible models; the Assistants and fine-tuning endpoints aren't supported.

Can I serve multiple models or concurrent users on one GPU? Yes, within VRAM limits. Use OLLAMA_MAX_LOADED_MODELS to keep several models resident and OLLAMA_NUM_PARALLEL to handle simultaneous requests. Each adds VRAM pressure, so monitor with ollama ps and nvidia-smi.

How much does it cost to run Ollama on the cloud? You pay for GPU time. A small card (e.g. RTX A4000) is enough for many 7–14B quantized models and costs a fraction of a high-end GPU; larger models on a 48–80 GB card cost more per hour. Because idle models unload from VRAM, you can stop the Pod between sessions to avoid paying while you're not serving. Check Runpod pricing for current rates.

My endpoint returns a 502 / isn't reachable — what's wrong? Almost always one of: port 11434 isn't exposed, OLLAMA_HOST isn't set to 0.0.0.0, or the ollama serve process isn't running. Verify all three, then check the Pod logs. If a model fails to load, you're likely out of VRAM — use a smaller model or a larger GPU.

Related guides

Articles

View All

How to Serve Gemma Models on L40S GPUs with Docker

Details how to deploy and serve Gemma language models on NVIDIA L40S GPUs using Docker and vLLM. Covers environment setup and how to use FastAPI to expose.

How to Run StarCoder2 as a REST API in the Cloud

Shows how to deploy StarCoder2 as a REST API on a cloud GPU. Walks through containerizing the code-generation model and setting up an API service.

How to Boost Your AI & ML Startup Using Runpod’s GPU Credits

Details how AI/ML startups can accelerate development using Runpod's GPU credits. Explains ways to leverage these credits for high-performance GPU access.

Exploring Pricing Models of Cloud Platforms for AI Deployment

Examines various cloud platform pricing models for AI deployment, helping you understand and compare cost structures for hosting machine learning workflows.

Cloud Tools with Easy Integration for AI Development Workflows

Introduces cloud-based tools that integrate seamlessly into AI development workflows. Highlights how these tools simplify model training and deployment by.

Can You Run Google’s Gemma 2B on an RTX A4000? Here’s How

Shows how to run Google's Gemma 2B model on an NVIDIA RTX A4000 GPU. Walks through environment setup and optimization steps to deploy this language model.

Build what’s next.

Build, train, and scale AI workloads on Runpod with cloud GPUs, Serverless, and Clusters.

Get started