Skip to content

Connect a Worker Machine

Follow these steps to connect a GPU machine to your ModelRelay deployment.

Platform
Backend
Model
Download
Configure
Connect
Test
Persist

Prefer a desktop app?

Skip the CLI setup — download the ModelRelay desktop app for Windows, macOS, or Linux. It runs in your system tray and handles everything below automatically.

1 Choose your platform

Or continue with the CLI setup. Select the OS where you'll run inference:

macOS
Windows
Linux
Apple Silicon (M1/M2/M3/M4) — best experience. Models run on the unified GPU with no driver setup.
Intel Macs — CPU-only inference works but is much slower.

Check: 16 GB+ unified memory recommended. Open About This Mac to confirm your chip and RAM.
NVIDIA GPU — 8 GB+ VRAM recommended. Install the latest NVIDIA driver (CUDA is bundled).
AMD GPU — supported by some backends (Ollama, vLLM with ROCm). Check your backend's compatibility.
CPU-only — works but significantly slower.

Check: Open Task Manager → Performance → GPU to see your GPU and VRAM.
NVIDIA GPU — the standard choice. Install NVIDIA drivers + CUDA toolkit, then verify with nvidia-smi.
AMD GPU — use ROCm. Supported by vLLM and Ollama on recent cards.
CPU-only — fine for small models or testing.

Check: Run nvidia-smi (NVIDIA) or rocm-smi (AMD) to confirm your GPU is visible.

2 Set up your inference backend

ModelRelay connects to any OpenAI-compatible server running on your machine. Pick whichever you prefer:

LM Studio
Ollama
llama.cpp
vLLM

Best for: beginners, desktop use, nice GUI for browsing models.

Download LM Studio →

Install, launch, and head to the Developer tab to start the local server. Runs on http://localhost:1234 by default.

Best for: CLI users, quick model management, easy multi-model setups.

curl -fsSL https://ollama.ai/install.sh | sh

On macOS, download from ollama.ai. On Windows, use the Windows installer. Serves on http://localhost:11434.

Best for: lightweight deployments, headless servers, GGUF models.

# Build from source (or download a release binary) git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp cmake -B build && cmake --build build --config Release -t llama-server

Pre-built binaries available on the llama.cpp releases page. Serves on http://localhost:8080 by default (use --port 8000 to change).

Best for: production throughput, continuous batching, HuggingFace models.

pip install vllm

Requires NVIDIA GPU with CUDA. Serves an OpenAI-compatible API on http://localhost:8000. See vLLM docs.

Already have a running backend? Skip to Download Worker →

3 Download and load a model

  1. Open the Discover tab and search for a model (e.g. llama-3.2-3b)
  2. Click Download and wait for it to complete
  3. Go to the Developer tab
  4. Select your model and click Start Server
  5. Confirm the server is running on http://localhost:1234

Pull a model and start serving:

ollama pull llama3.2:3b ollama serve

Browse models at ollama.ai/library. The server runs on http://localhost:11434.

Download a GGUF model and start the server:

# Download a GGUF model (example: Llama 3.2 3B) curl -L -o model.gguf https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf # Start the server ./build/bin/llama-server -m model.gguf --port 8000 --host 0.0.0.0

Find GGUF models on HuggingFace. The Q4_K_M quantization is a good balance of quality and speed.

Start vLLM with a HuggingFace model:

vllm serve meta-llama/Llama-3.2-3B-Instruct \ --port 8000 \ --host 0.0.0.0

vLLM downloads from HuggingFace automatically. You may need huggingface-cli login for gated models.

Verify it's running: curl http://localhost:1234/v1/models should return a JSON list of available models.

4 Download the worker binary

The ModelRelay worker runs alongside your model server and connects it to the relay.

curl -L -o modelrelay-worker https://github.com/ericflo/modelrelay/releases/latest/download/modelrelay-worker-linux-amd64 && chmod +x modelrelay-worker

Or download from GitHub Releases. All platforms (Linux, macOS, Windows) and architectures (x86_64, arm64) are available.

5 Configure the worker

Create a config.toml next to the worker binary:

proxy_url = "" worker_secret = "your-worker-secret" worker_name = "my-gpu-box" backend_url = "http://localhost:1234" models = ["*"]
worker_secret — shared secret that must match the WORKER_SECRET on your ModelRelay server. It authenticates the worker connection.
worker_name — a label for this machine (e.g. "strix-halo-lmstudio", "rtx4090-desktop").
models = ["*"] — advertises all models from your backend. Replace with specific names to expose a subset.
Prefer environment variables?
export PROXY_URL="" export WORKER_SECRET="your-worker-secret" export WORKER_NAME="my-gpu-box" export BACKEND_URL="http://localhost:1234" export MODELS="*"

CLI flags also work: --proxy-url, --worker-secret, --backend-url, --models.

6 Start the worker

Run the worker from the directory with your config.toml:

./modelrelay-worker --config config.toml

The worker will connect to your server over WebSocket. We'll detect it automatically:

Waiting for worker to connect...

Admin token must be set on the dashboard for live detection. Polling every 3 seconds.

7 Test inference

Send a request through the relay to verify the full pipeline works.

Test from the command line
curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"your-model","messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}'

8 Make it persistent

Your worker is running — now set it up as a system service so it starts on boot and restarts on crash.

Linux
macOS
Windows

systemd — supports multiple workers per machine

1. Install binary and create service user

sudo install -m 755 modelrelay-worker /usr/local/bin/ sudo useradd --system --no-create-home modelrelay sudo mkdir -p /var/lib/modelrelay /etc/modelrelay

2. Install the service file and configure

# Download the template unit curl -L -o /tmp/modelrelay-worker@.service \ https://raw.githubusercontent.com/ericflo/modelrelay/main/extras/modelrelay-worker%40.service sudo cp /tmp/modelrelay-worker@.service /etc/systemd/system/ # Create per-instance env file sudo tee /etc/modelrelay/worker-gpu0.env > /dev/null <<'EOF' PROXY_URL=http://your-proxy:8080 WORKER_SECRET=your-secret BACKEND_URL=http://127.0.0.1:8000 MODELS=llama3.2:3b EOF

3. Enable and start

sudo systemctl daemon-reload sudo systemctl enable --now modelrelay-worker@gpu0

4. Verify

systemctl status modelrelay-worker@gpu0 journalctl -u modelrelay-worker@gpu0 -f

Add more workers: modelrelay-worker@gpu1, @gpu2, etc. Each gets its own env file.

🎉 Setup complete!

Your worker is connected, tested, and will start automatically on boot.

Go to Dashboard Add another machine