Connect a Worker Machine
Follow these steps to connect a GPU machine to your ModelRelay deployment.
Prefer a desktop app?
Skip the CLI setup — download the ModelRelay desktop app for Windows, macOS, or Linux. It runs in your system tray and handles everything below automatically.
1 Choose your platform
Or continue with the CLI setup. Select the OS where you'll run inference:
Intel Macs — CPU-only inference works but is much slower.
Check: 16 GB+ unified memory recommended. Open About This Mac to confirm your chip and RAM.
AMD GPU — supported by some backends (Ollama, vLLM with ROCm). Check your backend's compatibility.
CPU-only — works but significantly slower.
Check: Open Task Manager → Performance → GPU to see your GPU and VRAM.
nvidia-smi.AMD GPU — use ROCm. Supported by vLLM and Ollama on recent cards.
CPU-only — fine for small models or testing.
Check: Run
nvidia-smi (NVIDIA) or rocm-smi (AMD) to confirm your GPU is visible.
2 Set up your inference backend
ModelRelay connects to any OpenAI-compatible server running on your machine. Pick whichever you prefer:
Best for: beginners, desktop use, nice GUI for browsing models.
Install, launch, and head to the Developer tab to start the local server. Runs on http://localhost:1234 by default.
Best for: CLI users, quick model management, easy multi-model setups.
curl -fsSL https://ollama.ai/install.sh | sh
On macOS, download from ollama.ai. On Windows, use the Windows installer. Serves on http://localhost:11434.
Best for: lightweight deployments, headless servers, GGUF models.
# Build from source (or download a release binary)
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build --config Release -t llama-server
Pre-built binaries available on the llama.cpp releases page. Serves on http://localhost:8080 by default (use --port 8000 to change).
Best for: production throughput, continuous batching, HuggingFace models.
pip install vllm
Requires NVIDIA GPU with CUDA. Serves an OpenAI-compatible API on http://localhost:8000. See vLLM docs.
Already have a running backend? Skip to Download Worker →
3 Download and load a model
- Open the Discover tab and search for a model (e.g.
llama-3.2-3b) - Click Download and wait for it to complete
- Go to the Developer tab
- Select your model and click Start Server
- Confirm the server is running on
http://localhost:1234
Pull a model and start serving:
ollama pull llama3.2:3b
ollama serve
Browse models at ollama.ai/library. The server runs on http://localhost:11434.
Download a GGUF model and start the server:
# Download a GGUF model (example: Llama 3.2 3B)
curl -L -o model.gguf https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
# Start the server
./build/bin/llama-server -m model.gguf --port 8000 --host 0.0.0.0
Find GGUF models on HuggingFace. The Q4_K_M quantization is a good balance of quality and speed.
Start vLLM with a HuggingFace model:
vllm serve meta-llama/Llama-3.2-3B-Instruct \
--port 8000 \
--host 0.0.0.0
vLLM downloads from HuggingFace automatically. You may need huggingface-cli login for gated models.
curl http://localhost:1234/v1/models should return a JSON list of available models.
4 Download the worker binary
The ModelRelay worker runs alongside your model server and connects it to the relay.
curl -L -o modelrelay-worker https://github.com/ericflo/modelrelay/releases/latest/download/modelrelay-worker-linux-amd64 && chmod +x modelrelay-worker
Or download from GitHub Releases. All platforms (Linux, macOS, Windows) and architectures (x86_64, arm64) are available.
5 Configure the worker
Create a config.toml next to the worker binary:
proxy_url = ""
worker_secret = "your-worker-secret"
worker_name = "my-gpu-box"
backend_url = "http://localhost:1234"
models = ["*"]
WORKER_SECRET on your ModelRelay server. It authenticates the worker connection.worker_name — a label for this machine (e.g. "strix-halo-lmstudio", "rtx4090-desktop").
models = ["*"] — advertises all models from your backend. Replace with specific names to expose a subset.
Prefer environment variables?
export PROXY_URL=""
export WORKER_SECRET="your-worker-secret"
export WORKER_NAME="my-gpu-box"
export BACKEND_URL="http://localhost:1234"
export MODELS="*"
CLI flags also work: --proxy-url, --worker-secret, --backend-url, --models.
6 Start the worker
Run the worker from the directory with your config.toml:
./modelrelay-worker --config config.toml
The worker will connect to your server over WebSocket. We'll detect it automatically:
Admin token must be set on the dashboard for live detection. Polling every 3 seconds.
7 Test inference
Send a request through the relay to verify the full pipeline works.
Test from the command line
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"your-model","messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}'
8 Make it persistent
Your worker is running — now set it up as a system service so it starts on boot and restarts on crash.
systemd — supports multiple workers per machine
1. Install binary and create service user
sudo install -m 755 modelrelay-worker /usr/local/bin/
sudo useradd --system --no-create-home modelrelay
sudo mkdir -p /var/lib/modelrelay /etc/modelrelay
2. Install the service file and configure
# Download the template unit
curl -L -o /tmp/modelrelay-worker@.service \
https://raw.githubusercontent.com/ericflo/modelrelay/main/extras/modelrelay-worker%40.service
sudo cp /tmp/modelrelay-worker@.service /etc/systemd/system/
# Create per-instance env file
sudo tee /etc/modelrelay/worker-gpu0.env > /dev/null <<'EOF'
PROXY_URL=http://your-proxy:8080
WORKER_SECRET=your-secret
BACKEND_URL=http://127.0.0.1:8000
MODELS=llama3.2:3b
EOF
3. Enable and start
sudo systemctl daemon-reload
sudo systemctl enable --now modelrelay-worker@gpu0
4. Verify
systemctl status modelrelay-worker@gpu0
journalctl -u modelrelay-worker@gpu0 -f
Add more workers: modelrelay-worker@gpu1, @gpu2, etc. Each gets its own env file.
🎉 Setup complete!
Your worker is connected, tested, and will start automatically on boot.