TL;DR
- For the quickest start, use the pre-defined Ollama template on Runpod.
- Deploy a Runpod GPU Pod, expose HTTP port 11434, and set
OLLAMA_HOST=0.0.0.0. - Run
ollama serve, thenollama pulla quantized model (e.g.llama3.3,qwen3,gemma3). - Point
OLLAMA_MODELSat a network volume so your downloads survive a restart. - Call it over Runpod's proxy URL using either Ollama's native
/api/chator its OpenAI-compatible/v1/chat/completionsendpoint.
Ollama handles GPU memory, model loading, and serving for you. Quantized models in GGUF mean a 7–8B model runs in ~5–6 GB of VRAM, so a modest GPU is plenty. Here, we’ll cover setting up the container, loading quantized model files, and getting the system running on a platform like Runpod.
What is Ollama and Why Quantized Models?
Ollama is a lightweight framework and CLI tool that lets you run and manage language models locally (or in any environment) through a simple interface and API. It supports models in the quantized GGUF format which is a compressed model format that significantly reduces memory usage with minimal impact on performance. By using quantization, even larger models that might normally require 30+ GB of VRAM can run on single-GPU setups. Ollama comes with a library of models and handles model serving details for you, such as unloading models from GPU memory when they’re idle to free up resources (which helps when running multiple models sequentially or conserving VRAM for other tasks).
In practice, using Ollama means you don’t have to write your own server or worry about GPU memory management for inference. You can load a model with a one-liner and start a persistent service to answer queries via command line or HTTP.
Step 1: Choose a GPU that fits your model
Ollama's default model tags are 4-bit quantized, so VRAM needs are modest. Use this as a rough guide. The figures include the model plus a typical KV cache, which grows with context length:
Start small; most workloads are well served by a 7–14B model on a 16–24 GB card, and scale up later only if quality demands it. Compare options on Runpod's GPU comparison page, and see pricing for current hourly rates.
Step 2: Deploy Ollama on a Runpod Pod
You can run the official ollama/ollama Docker image or install Ollama directly on the Pod.
- Go to Pods then Deploy and pick a GPU from the table above.
- Use either the pre-defined official Ollama template if you want to just jump in right away, or use the PyTorch template if you want to experience setting things up from scratch.
- Under Expose HTTP Ports, add
11434, which isOllama's default API port. - Add an environment variable
OLLAMA_HOST=0.0.0.0so the server listens on all interfaces, not just localhost. This is the single most common reason a Runpod Ollama endpoint is unreachable. - Deploy, then open the Web Terminal (or SSH in).
Start the server and confirm it's running:
ollama serve &
ollama --versionStep 3: Pull and run a quantized model
Ollama's library has 100+ ready-to-run models. Use current ones, not the 2023 defaults:
# Pull a quantized model (4-bit by default)
ollama pull llama3.3
# Chat interactively
ollama run llama3.3
# Or a one-shot completion
ollama run qwen3 "Explain quantization in one sentence."Good current choices include llama3.3, qwen3, gemma3,, mistral, phi4, and gpt-oss. To pick a specific size or quant, use a tag: ollama pull gemma3:27b or ollama pull qwen3:8b. Check what's loaded and whether it's on the GPU with:
ollama ps # the PROCESSOR column shows GPU vs CPUIf you have your own GGUF file, register it with a Modelfile (FROM ./my-model.gguf) and ollama create.
Step 4: Call the API
Expose the Pod and you can reach Ollama at its proxy URL: https://<your-pod-id>-11434.proxy.runpod.net.
Native API: note that Ollama streams by default, so pass "stream": false for a single JSON response:
curl https://<your-pod-id>-11434.proxy.runpod.net/api/chat \
-d '{
"model": "llama3.3",
"messages": [{"role": "user", "content": "Why is the sky blue?"}],
"stream": false
}'OpenAI-compatible API: point any OpenAI client at the /v1 base URL and existing code works unchanged:
from openai import OpenAI
client = OpenAI(
base_url="https://<your-pod-id>-11434.proxy.runpod.net/v1",
api_key="ollama", # required by the client but ignored by Ollama
)
resp = client.chat.completions.create(
model="llama3.3",
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)The first request to a freshly pulled model is slower while it loads into VRAM; subsequent requests are fast. (Not every OpenAI feature is supported; chat completions, function calling, and vision work with compatible models; the Assistants and fine-tuning endpoints do not.)
Step 5: Persist your models across restarts
By default Ollama stores models in ~/.ollama/models (i.e. /root/.ollama in the container). On Runpod, anything outside a persistent volume is lost when the Pod is reset — and re-downloading a 40 GB model every session is painful, especially since Ollama can't resume an interrupted download.
Two fixes:
- Point
OLLAMA_MODELSat a network volume: setOLLAMA_MODELS=/workspace/ollama-models(or mount/workspace/ollama:/root/.ollamain Docker) so downloads land on persistent storage. - Stop, don't terminate. A stopped Pod keeps its volume and resumes in seconds; terminating wipes
/workspaceand your downloads with it.
Ollama vs. vLLM vs. llama.cpp: which to use?
- Ollama: easiest path to a working API. Best for prototyping, internal tools, single-GPU serving, and anyone who wants model management handled for them.
- vLLM: highest throughput under heavy concurrency. Reach for it when you're serving many simultaneous users and want maximum tokens/second per GPU.
- llama.cpp: the lowest-level, most portable option, and the engine Ollama is built on. Choose it when you want fine control over the build and flags. (Ollama is essentially a friendly wrapper around it.)
A common pattern: prototype with Ollama, then move to vLLM if you outgrow its throughput. For a deeper comparison, see Runpod's LLM inference optimization playbook.
Frequently asked questions
What models can I run with Ollama?
Any model in Ollama's library (100+), plus your own GGUF files via a Modelfile. Current popular options include Llama 3.3, Qwen 3, Gemma 3/4, Mistral, Phi-4, DeepSeek, and gpt-oss. Run ollama list to see what you've pulled.
Do I need an NVIDIA GPU? No. Ollama supports NVIDIA (CUDA) and AMD (ROCm on Linux), both of which we support on Runpod.
How does quantization affect quality and speed? 4-bit quantization slightly lowers precision, but the quality cost is usually small and the memory savings are large. Because the model fits in VRAM, inference is fast. If you need maximum fidelity, run a larger model or a higher-precision quant (Q5/Q6/Q8) on a bigger GPU.
Does Ollama work with OpenAI client libraries? Yes. It exposes an OpenAI-compatible API at /v1 on port 11434 — point your client's base URL there. Chat completions, function calling, and vision work with compatible models; the Assistants and fine-tuning endpoints aren't supported.
Can I serve multiple models or concurrent users on one GPU? Yes, within VRAM limits. Use OLLAMA_MAX_LOADED_MODELS to keep several models resident and OLLAMA_NUM_PARALLEL to handle simultaneous requests. Each adds VRAM pressure, so monitor with ollama ps and nvidia-smi.
How much does it cost to run Ollama on the cloud? You pay for GPU time. A small card (e.g. RTX A4000) is enough for many 7–14B quantized models and costs a fraction of a high-end GPU; larger models on a 48–80 GB card cost more per hour. Because idle models unload from VRAM, you can stop the Pod between sessions to avoid paying while you're not serving. Check Runpod pricing for current rates.
My endpoint returns a 502 / isn't reachable — what's wrong? Almost always one of: port 11434 isn't exposed, OLLAMA_HOST isn't set to 0.0.0.0, or the ollama serve process isn't running. Verify all three, then check the Pod logs. If a model fails to load, you're likely out of VRAM — use a smaller model or a larger GPU.
Related guides
- Deploying GPT4All in the Cloud Using Docker and a Minimal API
- How to Serve Phi-2 on a Cloud GPU with vLLM and FastAPI
- How to Run OpenChat on a Cloud GPU Using Docker
- The Fastest Way to Run Mixtral in a Docker Container with GPU Support
- How to Deploy RAG Pipelines with Faiss and LangChain on a Cloud GPU
- How to Deploy LLaMA.cpp on a Cloud GPU Without Hosting Headaches
