On this page

Running Local AI (Ollama) on a VPS

Run open-source AI models on your GoZen VPS with Ollama. Private, no API costs, full control.

Ollama lets you run large language models (LLMs) like Llama, Mistral, and Gemma on your own hardware. No API keys, no per-token costs, no data leaving your server. You need a VPS with enough RAM and CPU to run the model you want.

Hardware Requirements

LLMs are memory-hungry. The model needs to fit entirely in RAM (or VRAM if using a GPU). Here’s what you need for CPU-only inference:

Model	Parameters	RAM Required	Speed (CPU)	Use Case
Gemma 2B	2B	4 GB	~20 tokens/sec	Lightweight tasks, classification
Phi-3 Mini	3.8B	6 GB	~12 tokens/sec	Code assistance, summarization
Llama 3.1 8B	8B	8 GB	~8 tokens/sec	General chat, writing, analysis
Mistral 7B	7B	8 GB	~8 tokens/sec	General purpose, good quality/speed
Llama 3.1 70B	70B	48 GB	~1 token/sec	High quality, very slow on CPU

Recommended GoZen VPS specs:

Small models (2B-3B): 4 vCPU / 6 GB RAM is sufficient
Medium models (7B-8B): 8+ vCPU / 16 GB RAM recommended
Large models (70B+): Not practical on CPU-only VPS. Need a GPU server

GoZen’s Dedicated VPS plans with AMD EPYC processors provide the single-thread performance and RAM needed for 7B-8B models.

Installing Ollama

SSH into your VPS:

  curl -fsSL https://ollama.com/install.sh | sh

Verify:

  ollama --version

Ollama runs as a systemd service automatically:

  sudo systemctl status ollama

Running Your First Model

  # Download and run Llama 3.1 8B
ollama run llama3.1

# You'll see a chat prompt:
# >>> Send a message (/? for help)

Type a prompt and press Enter. The first run downloads the model (4-5 GB for 8B). Subsequent runs start instantly.

Other Popular Models

  # Mistral 7B - fast, good quality
ollama run mistral

# Gemma 2B - very lightweight
ollama run gemma2:2b

# Code Llama - optimized for programming
ollama run codellama

# Phi-3 Mini - good for structured tasks
ollama run phi3

List downloaded models:

  ollama list

Remove a model:

  ollama rm llama3.1

Using the API

Ollama exposes a REST API on port 11434. You can call it from any application:

  # Generate a completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Explain DNS in two sentences.",
  "stream": false
}'

# Chat format
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "user", "content": "What is DKIM?"}
  ],
  "stream": false
}'

From Python

  import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3.1",
    "prompt": "Write a SQL query to find duplicate emails",
    "stream": False
})

print(response.json()["response"])

From JavaScript/Node.js

  const response = await fetch("http://localhost:11434/api/generate", {
  method: "POST",
  body: JSON.stringify({
    model: "llama3.1",
    prompt: "Explain caching in one paragraph",
    stream: false,
  }),
});

const data = await response.json();
console.log(data.response);

Adding a Web Interface

Open WebUI gives you a ChatGPT-like interface for Ollama:

  docker run -d \
  --name open-webui \
  --network host \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  ghcr.io/open-webui/open-webui:main

Access it at http://your-server-ip:8080. Set up a reverse proxy to put it behind HTTPS.

Exposing Ollama Securely

By default, Ollama only listens on localhost. If you need to access it from other machines:

Edit the service file:

  sudo systemctl edit ollama

Add:

  [Service]
Environment="OLLAMA_HOST=0.0.0.0"

Restart:

  sudo systemctl restart ollama

🔴

Don’t expose Ollama to the internet without authentication. The API has no built-in auth. Anyone who can reach port 11434 can use your models and consume your resources. Use a reverse proxy with authentication, or restrict access via firewall rules.

  # Only allow access from specific IPs
sudo ufw allow from YOUR.TRUSTED.IP to any port 11434

Performance Tuning

Adjusting Thread Count

By default, Ollama uses all available CPU cores. On a shared VPS, you may want to limit it:

  # Set thread count
OLLAMA_NUM_THREADS=4 ollama serve

Using Quantized Models

Quantized models use less RAM and run faster at a small quality cost:

  # Pull the Q4 quantized version (smaller, faster)
ollama run llama3.1:8b-instruct-q4_0

Monitoring Resource Usage

  # Watch CPU and RAM in real-time
htop

# Check Ollama-specific usage
ollama ps

Troubleshooting

Problem	Fix
“Out of memory” during model load	Model is too large for your RAM. Try a smaller model or quantized version
Very slow inference	CPU-only is inherently slow for large models. Use a smaller model (2B-3B) for acceptable speed
API connection refused	Check `ollama ps` to verify it’s running. Restart with `sudo systemctl restart ollama`
Port 11434 not reachable remotely	Ollama defaults to localhost only. Set `OLLAMA_HOST=0.0.0.0` if you need remote access
Model download fails	Check disk space. 7B models need 4-5 GB. `df -h` to verify

What to Do Next

Docker Basics for VPS Users - run Open WebUI and other tools
Set Up a Reverse Proxy - put the web interface behind HTTPS
GoZen VPS Plans - pick the right specs for your model

Last updated 07 Apr 2026, 00:00 +0200. history

How to Optimize MySQL and MariaDB Performance

Tune MySQL and MariaDB for your VPS RAM, …

WordPress Speed Optimization

The complete guide to making your …

Running Local AI (Ollama) on a VPS

Hardware Requirements link

Installing Ollama link

Running Your First Model link

Other Popular Models link

Using the API link

From Python link

From JavaScript/Node.js link

Adding a Web Interface link

Exposing Ollama Securely link

Performance Tuning link

Adjusting Thread Count link

Using Quantized Models link

Monitoring Resource Usage link

Troubleshooting link

What to Do Next link