Setup — ModelRelay

Prefer a desktop app?

Skip the CLI setup — download the ModelRelay desktop app for Windows, macOS, or Linux. It runs in your system tray and handles everything below automatically.

1 Choose your platform

Or continue with the CLI setup. Select the OS where you'll run inference:

macOS

Windows

Linux

Apple Silicon (M1/M2/M3/M4) — best experience. Models run on the unified GPU with no driver setup.
Intel Macs — CPU-only inference works but is much slower.

Check: 16 GB+ unified memory recommended. Open About This Mac to confirm your chip and RAM.

NVIDIA GPU — 8 GB+ VRAM recommended. Install the latest NVIDIA driver (CUDA is bundled).
AMD GPU — supported by some backends (Ollama, vLLM with ROCm). Check your backend's compatibility.
CPU-only — works but significantly slower.

Check: Open Task Manager → Performance → GPU to see your GPU and VRAM.

NVIDIA GPU — the standard choice. Install NVIDIA drivers + CUDA toolkit, then verify with nvidia-smi.
AMD GPU — use ROCm. Supported by vLLM and Ollama on recent cards.
CPU-only — fine for small models or testing.

Check: Run nvidia-smi (NVIDIA) or rocm-smi (AMD) to confirm your GPU is visible.

2 Set up your inference backend

ModelRelay connects to any OpenAI-compatible server running on your machine. Pick whichever you prefer:

LM Studio

Ollama

llama.cpp

vLLM

Best for: beginners, desktop use, nice GUI for browsing models.

Download LM Studio →

Install, launch, and head to the Developer tab to start the local server. Runs on http://localhost:1234 by default.

Best for: CLI users, quick model management, easy multi-model setups.

curl -fsSL https://ollama.ai/install.sh | sh

On macOS, download from ollama.ai. On Windows, use the Windows installer. Serves on http://localhost:11434.

Best for: lightweight deployments, headless servers, GGUF models.

# Build from source (or download a release binary)
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build --config Release -t llama-server

Pre-built binaries available on the llama.cpp releases page. Serves on http://localhost:8080 by default (use --port 8000 to change).

Best for: production throughput, continuous batching, HuggingFace models.

pip install vllm

Requires NVIDIA GPU with CUDA. Serves an OpenAI-compatible API on http://localhost:8000. See vLLM docs.

Already have a running backend? Skip to Download Worker →

3 Download and load a model

Open the Discover tab and search for a model (e.g. llama-3.2-3b)
Click Download and wait for it to complete
Go to the Developer tab
Select your model and click Start Server
Confirm the server is running on http://localhost:1234

Pull a model and start serving:

ollama pull llama3.2:3b
ollama serve

Browse models at ollama.ai/library. The server runs on http://localhost:11434.

Download a GGUF model and start the server:

# Download a GGUF model (example: Llama 3.2 3B)
curl -L -o model.gguf https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Start the server
./build/bin/llama-server -m model.gguf --port 8000 --host 0.0.0.0

Find GGUF models on HuggingFace. The Q4_K_M quantization is a good balance of quality and speed.

Start vLLM with a HuggingFace model:

vllm serve meta-llama/Llama-3.2-3B-Instruct \
  --port 8000 \
  --host 0.0.0.0

vLLM downloads from HuggingFace automatically. You may need huggingface-cli login for gated models.

Verify it's running: curl http://localhost:1234/v1/models should return a JSON list of available models.

4 Download the worker binary

The ModelRelay worker runs alongside your model server and connects it to the relay.

curl -L -o modelrelay-worker https://github.com/ericflo/modelrelay/releases/latest/download/modelrelay-worker-linux-amd64 && chmod +x modelrelay-worker

Or download from GitHub Releases. All platforms (Linux, macOS, Windows) and architectures (x86_64, arm64) are available.

5 Configure the worker

Create a config.toml next to the worker binary:

Server URL:

Worker Secret:

Worker Name:

proxy_url = ""
worker_secret = "your-worker-secret"
worker_name = "my-gpu-box"
backend_url = "http://localhost:1234"
models = ["*"]

worker_secret — shared secret that must match the WORKER_SECRET on your ModelRelay server. It authenticates the worker connection.
worker_name — a label for this machine (e.g. "strix-halo-lmstudio", "rtx4090-desktop").
models = ["*"] — advertises all models from your backend. Replace with specific names to expose a subset.

Prefer environment variables?

export PROXY_URL=""
export WORKER_SECRET="your-worker-secret"
export WORKER_NAME="my-gpu-box"
export BACKEND_URL="http://localhost:1234"
export MODELS="*"

CLI flags also work: --proxy-url, --worker-secret, --backend-url, --models.

6 Start the worker

Run the worker from the directory with your config.toml:

./modelrelay-worker --config config.toml

The worker will connect to your server over WebSocket. We'll detect it automatically:

Waiting for worker to connect...

Taking a while? Common fixes:
• Check the worker terminal for error messages
• Verify proxy_url in config.toml points to this server
• Confirm worker_secret matches the server's WORKER_SECRET
• If the server is remote, ensure port 8080 is reachable (no firewall blocking)

Admin token must be set on the dashboard for live detection. Polling every 3 seconds.

7 Test inference

Send a request through the relay to verify the full pipeline works.

Model name:

API key (optional, for auth-required setups):

Test from the command line

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"your-model","messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}'

8 Make it persistent

Your worker is running — now set it up as a system service so it starts on boot and restarts on crash.

Linux

macOS

Windows

systemd — supports multiple workers per machine

1. Install binary and create service user

sudo install -m 755 modelrelay-worker /usr/local/bin/
sudo useradd --system --no-create-home modelrelay
sudo mkdir -p /var/lib/modelrelay /etc/modelrelay

2. Install the service file and configure

# Download the template unit
curl -L -o /tmp/modelrelay-worker@.service \
  https://raw.githubusercontent.com/ericflo/modelrelay/main/extras/modelrelay-worker%40.service

sudo cp /tmp/modelrelay-worker@.service /etc/systemd/system/

# Create per-instance env file
sudo tee /etc/modelrelay/worker-gpu0.env > /dev/null <<'EOF'
PROXY_URL=http://your-proxy:8080
WORKER_SECRET=your-secret
BACKEND_URL=http://127.0.0.1:8000
MODELS=llama3.2:3b
EOF

3. Enable and start

sudo systemctl daemon-reload
sudo systemctl enable --now modelrelay-worker@gpu0

4. Verify

systemctl status modelrelay-worker@gpu0
journalctl -u modelrelay-worker@gpu0 -f

Add more workers: modelrelay-worker@gpu1, @gpu2, etc. Each gets its own env file.

launchd — starts on boot, restarts on crash

1. Create the plist

sudo tee /Library/LaunchDaemons/io.modelrelay.worker.plist > /dev/null <<'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key><string>io.modelrelay.worker</string>
  <key>ProgramArguments</key>
  <array>
    <string>/usr/local/bin/modelrelay-worker</string>
    <string>--config</string>
    <string>/etc/modelrelay/config.toml</string>
  </array>
  <key>RunAtLoad</key><true/>
  <key>KeepAlive</key><true/>
  <key>StandardErrorPath</key>
  <string>/var/log/modelrelay-worker.log</string>
</dict>
</plist>
EOF

2. Install binary and config

sudo cp modelrelay-worker /usr/local/bin/
sudo mkdir -p /etc/modelrelay
sudo cp config.toml /etc/modelrelay/config.toml

3. Load and start

sudo launchctl load /Library/LaunchDaemons/io.modelrelay.worker.plist

4. Verify

sudo launchctl list | grep modelrelay
tail -f /var/log/modelrelay-worker.log

Windows Service — run PowerShell as Administrator

1. Install the binary

mkdir C:\ModelRelay
copy modelrelay-worker.exe C:\ModelRelay\

2. Create the service

sc.exe create ModelRelayWorker binPath= "C:\ModelRelay\modelrelay-worker.exe" start= auto

3. Set environment variables

[Environment]::SetEnvironmentVariable("PROXY_URL", "http://your-proxy:8080", "Machine")
[Environment]::SetEnvironmentVariable("WORKER_SECRET", "your-secret", "Machine")
[Environment]::SetEnvironmentVariable("BACKEND_URL", "http://localhost:8000", "Machine")
[Environment]::SetEnvironmentVariable("MODELS", "llama3.2:3b", "Machine")

4. Start and verify

Start-Service ModelRelayWorker
Get-Service ModelRelayWorker

For annotated scripts with error handling, see extras/install-windows-service-worker.ps1.

🎉 Setup complete!

Your worker is connected, tested, and will start automatically on boot.

Go to Dashboard Add another machine

Connect a Worker Machine