Running Local AI (Ollama) on a VPS
Run open-source AI models on your GoZen VPS with Ollama. Private, no API costs, full control.
Ollama lets you run large language models (LLMs) like Llama, Mistral, and Gemma on your own hardware. No API keys, no per-token costs, no data leaving your server. You need a VPS with enough RAM and CPU to run the model you want.
Hardware Requirements
LLMs are memory-hungry. The model needs to fit entirely in RAM (or VRAM if using a GPU). Here’s what you need for CPU-only inference:
| Model | Parameters | RAM Required | Speed (CPU) | Use Case |
|---|---|---|---|---|
| Gemma 2B | 2B | 4 GB | ~20 tokens/sec | Lightweight tasks, classification |
| Phi-3 Mini | 3.8B | 6 GB | ~12 tokens/sec | Code assistance, summarization |
| Llama 3.1 8B | 8B | 8 GB | ~8 tokens/sec | General chat, writing, analysis |
| Mistral 7B | 7B | 8 GB | ~8 tokens/sec | General purpose, good quality/speed |
| Llama 3.1 70B | 70B | 48 GB | ~1 token/sec | High quality, very slow on CPU |
Recommended GoZen VPS specs:
- Small models (2B-3B): 4 vCPU / 6 GB RAM is sufficient
- Medium models (7B-8B): 8+ vCPU / 16 GB RAM recommended
- Large models (70B+): Not practical on CPU-only VPS. Need a GPU server
GoZen’s Dedicated VPS plans with AMD EPYC processors provide the single-thread performance and RAM needed for 7B-8B models.
Installing Ollama
SSH into your VPS:
curl -fsSL https://ollama.com/install.sh | sh
Verify:
ollama --version
Ollama runs as a systemd service automatically:
sudo systemctl status ollama
Running Your First Model
# Download and run Llama 3.1 8B
ollama run llama3.1
# You'll see a chat prompt:
# >>> Send a message (/? for help)
Type a prompt and press Enter. The first run downloads the model (4-5 GB for 8B). Subsequent runs start instantly.
Other Popular Models
# Mistral 7B - fast, good quality
ollama run mistral
# Gemma 2B - very lightweight
ollama run gemma2:2b
# Code Llama - optimized for programming
ollama run codellama
# Phi-3 Mini - good for structured tasks
ollama run phi3
List downloaded models:
ollama list
Remove a model:
ollama rm llama3.1
Using the API
Ollama exposes a REST API on port 11434. You can call it from any application:
# Generate a completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Explain DNS in two sentences.",
"stream": false
}'
# Chat format
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [
{"role": "user", "content": "What is DKIM?"}
],
"stream": false
}'
From Python
import requests
response = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3.1",
"prompt": "Write a SQL query to find duplicate emails",
"stream": False
})
print(response.json()["response"])
From JavaScript/Node.js
const response = await fetch("http://localhost:11434/api/generate", {
method: "POST",
body: JSON.stringify({
model: "llama3.1",
prompt: "Explain caching in one paragraph",
stream: false,
}),
});
const data = await response.json();
console.log(data.response);
Adding a Web Interface
Open WebUI gives you a ChatGPT-like interface for Ollama:
docker run -d \
--name open-webui \
--network host \
-v open-webui:/app/backend/data \
-e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
ghcr.io/open-webui/open-webui:main
Access it at http://your-server-ip:8080. Set up a reverse proxy to put it behind HTTPS.
Exposing Ollama Securely
By default, Ollama only listens on localhost. If you need to access it from other machines:
- Edit the service file:
sudo systemctl edit ollama
Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
- Restart:
sudo systemctl restart ollama
Don’t expose Ollama to the internet without authentication. The API has no built-in auth. Anyone who can reach port 11434 can use your models and consume your resources. Use a reverse proxy with authentication, or restrict access via firewall rules.
# Only allow access from specific IPs
sudo ufw allow from YOUR.TRUSTED.IP to any port 11434
Performance Tuning
Adjusting Thread Count
By default, Ollama uses all available CPU cores. On a shared VPS, you may want to limit it:
# Set thread count
OLLAMA_NUM_THREADS=4 ollama serve
Using Quantized Models
Quantized models use less RAM and run faster at a small quality cost:
# Pull the Q4 quantized version (smaller, faster)
ollama run llama3.1:8b-instruct-q4_0
Monitoring Resource Usage
# Watch CPU and RAM in real-time
htop
# Check Ollama-specific usage
ollama ps
Troubleshooting
| Problem | Fix |
|---|---|
| “Out of memory” during model load | Model is too large for your RAM. Try a smaller model or quantized version |
| Very slow inference | CPU-only is inherently slow for large models. Use a smaller model (2B-3B) for acceptable speed |
| API connection refused | Check ollama ps to verify it’s running. Restart with sudo systemctl restart ollama |
| Port 11434 not reachable remotely | Ollama defaults to localhost only. Set OLLAMA_HOST=0.0.0.0 if you need remote access |
| Model download fails | Check disk space. 7B models need 4-5 GB. df -h to verify |
What to Do Next
- Docker Basics for VPS Users - run Open WebUI and other tools
- Set Up a Reverse Proxy - put the web interface behind HTTPS
- GoZen VPS Plans - pick the right specs for your model
Last updated 07 Apr 2026, 00:00 +0200.