Ollama lets you run large language models (LLMs) like Llama, Mistral, and Gemma on your own hardware. No API keys, no per-token costs, no data leaving your server. You need a VPS with enough RAM and CPU to run the model you want.

Hardware Requirements

LLMs are memory-hungry. The model needs to fit entirely in RAM (or VRAM if using a GPU). Here’s what you need for CPU-only inference:

ModelParametersRAM RequiredSpeed (CPU)Use Case
Gemma 2B2B4 GB~20 tokens/secLightweight tasks, classification
Phi-3 Mini3.8B6 GB~12 tokens/secCode assistance, summarization
Llama 3.1 8B8B8 GB~8 tokens/secGeneral chat, writing, analysis
Mistral 7B7B8 GB~8 tokens/secGeneral purpose, good quality/speed
Llama 3.1 70B70B48 GB~1 token/secHigh quality, very slow on CPU

Recommended GoZen VPS specs:

  • Small models (2B-3B): 4 vCPU / 6 GB RAM is sufficient
  • Medium models (7B-8B): 8+ vCPU / 16 GB RAM recommended
  • Large models (70B+): Not practical on CPU-only VPS. Need a GPU server

GoZen’s Dedicated VPS plans with AMD EPYC processors provide the single-thread performance and RAM needed for 7B-8B models.

Installing Ollama

SSH into your VPS:

  curl -fsSL https://ollama.com/install.sh | sh
  

Verify:

  ollama --version
  

Ollama runs as a systemd service automatically:

  sudo systemctl status ollama
  

Running Your First Model

  # Download and run Llama 3.1 8B
ollama run llama3.1

# You'll see a chat prompt:
# >>> Send a message (/? for help)
  

Type a prompt and press Enter. The first run downloads the model (4-5 GB for 8B). Subsequent runs start instantly.

  # Mistral 7B - fast, good quality
ollama run mistral

# Gemma 2B - very lightweight
ollama run gemma2:2b

# Code Llama - optimized for programming
ollama run codellama

# Phi-3 Mini - good for structured tasks
ollama run phi3
  

List downloaded models:

  ollama list
  

Remove a model:

  ollama rm llama3.1
  

Using the API

Ollama exposes a REST API on port 11434. You can call it from any application:

  # Generate a completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Explain DNS in two sentences.",
  "stream": false
}'

# Chat format
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "user", "content": "What is DKIM?"}
  ],
  "stream": false
}'
  

From Python

  import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3.1",
    "prompt": "Write a SQL query to find duplicate emails",
    "stream": False
})

print(response.json()["response"])
  

From JavaScript/Node.js

  const response = await fetch("http://localhost:11434/api/generate", {
  method: "POST",
  body: JSON.stringify({
    model: "llama3.1",
    prompt: "Explain caching in one paragraph",
    stream: false,
  }),
});

const data = await response.json();
console.log(data.response);
  

Adding a Web Interface

Open WebUI gives you a ChatGPT-like interface for Ollama:

  docker run -d \
  --name open-webui \
  --network host \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  ghcr.io/open-webui/open-webui:main
  

Access it at http://your-server-ip:8080. Set up a reverse proxy to put it behind HTTPS.

Exposing Ollama Securely

By default, Ollama only listens on localhost. If you need to access it from other machines:

  1. Edit the service file:
  sudo systemctl edit ollama
  

Add:

  [Service]
Environment="OLLAMA_HOST=0.0.0.0"
  
  1. Restart:
  sudo systemctl restart ollama
  
  # Only allow access from specific IPs
sudo ufw allow from YOUR.TRUSTED.IP to any port 11434
  

Performance Tuning

Adjusting Thread Count

By default, Ollama uses all available CPU cores. On a shared VPS, you may want to limit it:

  # Set thread count
OLLAMA_NUM_THREADS=4 ollama serve
  

Using Quantized Models

Quantized models use less RAM and run faster at a small quality cost:

  # Pull the Q4 quantized version (smaller, faster)
ollama run llama3.1:8b-instruct-q4_0
  

Monitoring Resource Usage

  # Watch CPU and RAM in real-time
htop

# Check Ollama-specific usage
ollama ps
  

Troubleshooting

ProblemFix
“Out of memory” during model loadModel is too large for your RAM. Try a smaller model or quantized version
Very slow inferenceCPU-only is inherently slow for large models. Use a smaller model (2B-3B) for acceptable speed
API connection refusedCheck ollama ps to verify it’s running. Restart with sudo systemctl restart ollama
Port 11434 not reachable remotelyOllama defaults to localhost only. Set OLLAMA_HOST=0.0.0.0 if you need remote access
Model download failsCheck disk space. 7B models need 4-5 GB. df -h to verify

What to Do Next

Last updated 07 Apr 2026, 00:00 +0200. history

Was this page helpful?