Devstral or Nemotron on RunPod: the complete vLLM + Hermes setup guide

A step-by-step guide to self-hosting Devstral Small 2 or Nemotron 3 Nano with vLLM and Hermes.

Devstral Small 2 / Nemotron 3 Nano · vLLM · RTX PRO 6000 · Hermes Agent · WSL2 / Linux · April 2026

A 3-part series on self-hosting a coding model on RunPod.
Part 1: Running a coding model on RunPod: 8 things that will break
Part 2: Devstral or Nemotron on RunPod: the complete vLLM + Hermes setup guide — step-by-step setup (you are reading this)
Part 3: You’ve set up your coding model on RunPod. What’s next? — sampling, evaluation, ongoing operation

Devstral Small 2 is Mistral’s coding agent model, built for exploring codebases, editing multiple files and powering agentic software engineering workflows. Nemotron 3 Nano is NVIDIA’s hybrid Mamba-Transformer MoE reasoning model, built for agentic tool use and long-context workflows. Hermes is NousResearch’s local agent framework. It runs on your machine and calls an OpenAI-compatible endpoint for model inference. This guide connects either model to Hermes on your local machine (WSL2 or Linux) through a RunPod GPU pod running vLLM.

It assumes you have a RunPod account, a local Linux environment (native Linux or WSL2 on Windows) and Hermes installed locally. It does not assume you have done this before.

Why Devstral Small 2 as the primary example

Devstral Small 2 is 24B parameters shipped in native FP8, roughly 25 GB in VRAM. On a 96 GB card that leaves plenty of room for a 128K-token KV cache (Small 2’s native ceiling is 256K, but reaching it on a single 96 GB card means tighter KV cache settings and smaller batches). It scores 68% on SWE-bench Verified, which measures real GitHub bug fixes rather than toy benchmarks. Mistral built it for the kind of multi-file, tool-heavy agentic workflow Hermes runs.

It is Apache 2.0 licensed and fits a single consumer-grade cloud GPU at its native precision, so there is no runtime quantization step and the setup is straightforward.

When self-hosting is the wrong choice

Self-hosting makes sense when you run long focused sessions (several hours at a time), want your code to stay on infrastructure you control, or want to use an open-weight model no hosted API offers. It stops making sense if:

Your sessions are short and bursty. Hosted APIs bill per request and do not charge between calls.
You need multi-region availability or high concurrency. A single pod serves one GPU to one user at a time.
You are sensitive to setup and maintenance overhead. The eight blind spots and the session loop are the real cost, not the GPU rental.
Inference latency under 1 second matters. Proxied RunPod endpoints add network hops a co-located hosted API avoids. Measured TTFT through the public RunPod proxy on this setup is around 1.4 seconds for typical coding prompts; most of that is network round-trip, not GPU work. See the latency section below for the full breakdown.

If any of those describe your use case, use Mistral’s API, Anthropic’s API, or a managed inference provider instead. The rest of this guide assumes you have chosen self-hosting for one of the reasons above.

What you need before starting

RunPod account with a payment method added
A HuggingFace account with a fine-grained token scoped to mistralai/Devstral-Small-2-24B-Instruct-2512
Linux or WSL2 with Hermes installed and working against any existing endpoint
A terminal with SSH access

Create your HuggingFace fine-grained token before creating the pod so the download does not fail with an auth error.

Step by step

1. Generate your two secrets (do this locally first)

You need two credentials before touching RunPod. Generate them locally so they never pass through a third party.

vLLM API key:

# Generate a strong random key
openssl rand -hex 32

# Example output:
a3f8c2d1e4b7091f6a2c5d8e3b4f1a7c9e2d5b8f3c6a1e4d7b0c3f9e2a5d8b1

# Save it. You will use it on both sides of the connection

HuggingFace fine-grained token:

Go to huggingface.co, then Settings, then Access Tokens, then Create new token, then Fine-grained. Add repository permission for mistralai/Devstral-Small-2-24B-Instruct-2512 set to Read. Name it something like runpod-devstral-readonly. Copy it. It is shown only once.

Why fine-grained. A broad Read token gives access to every repo on your account. A fine-grained token scoped to one repo cannot access anything else. If the pod is compromised, a scoped token limits the exposure to one model repo.

2. Create the network volume

Storage, then Network Volumes, then New Volume. Do this before the pod so you can attach it at creation time.

Setting	Value
Size	100 GB (25 GB model + 75 GB cache headroom)
Region	Choose one. Must match your pod’s datacenter
Mount path	`/workspace`

Region lock. A network volume is fixed to one datacenter. Verify that RTX PRO 6000 with High availability is available in that region before creating the volume. You cannot move a volume after creation.

3. Create the pod

Setting	Value
GPU	RTX PRO 6000 · 96 GB VRAM
Template	RunPod PyTorch
Network Volume	Attach the volume from step 2
Mount path	`/workspace`
Pricing	On-Demand (not spot)
HTTP port	8000
SSH	Enabled
Jupyter	Off
Global Net.	Off unless you need it

On-demand is intentional. Spot instances can be interrupted mid-session, mid-compilation, mid-context. For a coding assistant you use for hours at a time, that interruption is not acceptable.

4. First-time setup (SSH in and run once)

Get the SSH command from the pod’s Connect tab. Export both secrets at the start of the session.

export HF_TOKEN=hf_your_fine_grained_token
export VLLM_API_KEY=your_generated_key

Save the launch script:

cat > /workspace/start_vllm.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail

# Replace with your actual key. Do not leave as placeholder
export VLLM_API_KEY="your_fixed_key_here"

export HF_HOME=/workspace/hf
export HUGGINGFACE_HUB_CACHE=/workspace/hf/hub
export XDG_CACHE_HOME=/workspace/.cache
export TMPDIR=/workspace/tmp
export TORCHINDUCTOR_CACHE_DIR=/workspace/torchinductor
export TRITON_CACHE_DIR=/workspace/triton

mkdir -p /workspace/hf/hub /workspace/.cache /workspace/tmp
mkdir -p /workspace/torchinductor /workspace/triton

# Reinstall if missing. Root disk wipes on every stop/start, not just GPU migration.
if ! python -c "import vllm" 2>/dev/null; then
  echo "vLLM not found, installing..."
  pip install -q hf_transfer
  pip install -q -U "vllm>=0.15.0" "transformers>=4.51.0"
fi

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Devstral-Small-2-24B-Instruct-2512 \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key "${VLLM_API_KEY}" \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.88 \
  --enable-auto-tool-choice \
  --tool-call-parser mistral
EOF

chmod +x /workspace/start_vllm.sh

The if ! python -c "import vllm" check means: if vLLM is present from a warm session, skip the install and start immediately. If it is missing (fresh pod after a restart wipes the root disk), install quietly then continue. No manual intervention on cold starts, no unnecessary pip installs on warm ones.

Run it:

bash /workspace/start_vllm.sh

First boot downloads ~25 GB from HuggingFace then compiles graphs and runs vLLM’s profiling pass. Measured on RTX PRO 6000 Blackwell with vLLM 0.19, expect roughly 5-6 minutes for both cold and warm restarts. The folklore that “warm restart is 2-3 minutes” is wrong for current vLLM serving 24-30B models. The dominant cost is a profile-and-warmup pass (~4 minutes) that cannot be cached and runs every time. See Post 1, blind spot 5 for the full phase breakdown.

You are ready when you see:

INFO:     Application startup complete.

One gotcha to know before you feed it a large session: the launch script caps --max-model-len at 131072 tokens (128K). If a request exceeds that, vLLM returns HTTP 400 with an “Input exceeds max length” error. Configure Hermes to drop oldest messages as the session approaches the limit, or bump --max-model-len and lower --gpu-memory-utilization to free more VRAM for KV cache. Small 2’s model-side ceiling is 256K, but reaching it on a single 96 GB card means tighter KV cache settings than this guide’s defaults.

Alternative: Nemotron 3 Nano instead of Devstral

NVIDIA’s Nemotron 3 Nano 30B-A3B is an alternative to Devstral in this setup. It is a hybrid Mamba-2 + Transformer MoE model with 31.6B total parameters and 3.2B active per forward pass (3.6B including embeddings). The FP8 checkpoint is 32.7 GB, which loads on a 96 GB card with over 60 GB to spare for KV cache. It supports context lengths up to 1M tokens and scores 68.25% on LiveCodeBench v6. NVIDIA trained it for agentic reasoning and multi-step tool use.

The setup differs from Devstral in several ways. Nemotron uses a custom architecture that requires --trust-remote-code. The FP8 variant needs --kv-cache-dtype fp8 and the VLLM_USE_FLASHINFER_MOE_FP8=1 environment variable for MoE kernel performance. Tool calling uses the qwen3_coder parser (the same parser Qwen models use). The model also has a built-in reasoning mode that uses vLLM’s nano_v3 reasoning parser, loaded from a plugin file you download separately from HuggingFace.

The license is also different: NVIDIA Open Model License, not Apache 2.0. It permits commercial use but has additional terms. Review it before deploying.

Step 1. Clear the Devstral cache

Check what is on the volume, then remove the Devstral weights to free space for the Nemotron download:

du -sh /workspace/hf/hub/*
rm -rf /workspace/hf/hub/models--mistralai--Devstral-Small-2-24B-Instruct-2512
df -h /workspace

Step 2. Download the custom reasoning parser

This is the step the HuggingFace model card buries. Without it, vLLM errors on --reasoning-parser nano_v3. Download it to /workspace so it persists:

wget -P /workspace \
  https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/resolve/main/nano_v3_reasoning_parser.py

ls /workspace/nano_v3_reasoning_parser.py

Step 3. Update the HuggingFace token scope

The model is not gated (NVIDIA publishes it openly) but you still need an HF token to avoid rate limiting on a ~32 GB download. If your existing token was scoped only to mistralai/Devstral-Small-2-24B-Instruct-2512, update it now.

Go to huggingface.co, then Settings, then Access Tokens, then edit your token, then change the repository scope to nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8, then save.

Export it on the pod:

export HF_TOKEN=hf_your_token

Step 4. Rewrite the launch script

Replace /workspace/start_vllm.sh in place:

cat > /workspace/start_vllm.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail

export VLLM_API_KEY="your_fixed_key_here"

# Required for FP8 MoE performance on this model
export VLLM_USE_FLASHINFER_MOE_FP8=1

export HF_HOME=/workspace/hf
export HUGGINGFACE_HUB_CACHE=/workspace/hf/hub
export XDG_CACHE_HOME=/workspace/.cache
export TMPDIR=/workspace/tmp
export TORCHINDUCTOR_CACHE_DIR=/workspace/torchinductor
export TRITON_CACHE_DIR=/workspace/triton

mkdir -p /workspace/hf/hub /workspace/.cache /workspace/tmp
mkdir -p /workspace/torchinductor /workspace/triton

# Reinstall if missing. Root disk wipes on every stop/start, not just GPU migration.
if ! python -c "import vllm" 2>/dev/null; then
  echo "vLLM not found, installing..."
  pip install -q hf_transfer
  pip install -q -U "vllm>=0.15.0" "transformers>=4.51.0"
fi

python -m vllm.entrypoints.openai.api_server \
  --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key "${VLLM_API_KEY}" \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.88 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --max-num-seqs 8 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin /workspace/nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3
EOF

chmod +x /workspace/start_vllm.sh

Step 5. Launch

bash /workspace/start_vllm.sh

First boot downloads ~32 GB from HuggingFace then compiles. Expect 10-20 minutes. You are ready when you see INFO: Application startup complete..

The model name to use in Hermes for Nemotron is nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8. Everything else stays the same: the test sequence, the Hermes configuration, the session loop and the security setup.

Test 1, inside the pod:

curl http://127.0.0.1:8000/v1/models \
  -H "Authorization: Bearer $VLLM_API_KEY"

Expected: JSON showing mistralai/Devstral-Small-2-24B-Instruct-2512. If this fails, vLLM is not up.

Test 2, from your local machine:

Get the proxy URL from the pod’s Connect tab, then HTTP Service, then port 8000.

export OPENAI_BASE_URL="https://YOUR_POD_ID-8000.proxy.runpod.net/v1"
export OPENAI_API_KEY="your_fixed_key"

curl "$OPENAI_BASE_URL/models" \
  -H "Authorization: Bearer $OPENAI_API_KEY"

Test 3, tool calling from your local machine:

curl "$OPENAI_BASE_URL/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "mistralai/Devstral-Small-2-24B-Instruct-2512",
    "messages": [{"role": "user", "content": "Write a Python function that merges overlapping intervals."}],
    "temperature": 0.15,
    "tools": [{
      "type": "function",
      "function": {
        "name": "read_file",
        "description": "Read a file",
        "parameters": {"type": "object", "properties": {"path": {"type": "string"}}}
      }
    }],
    "tool_choice": "auto"
  }'

Expected: a valid response that does not return HTTP 400. If you get 400 here, the tool call parser is wrong.

6. Point Hermes at the endpoint

Add to your ~/.bashrc or export before launching Hermes:

export OPENAI_BASE_URL="https://YOUR_POD_ID-8000.proxy.runpod.net/v1"
export OPENAI_API_KEY="your_fixed_key"

The model name to use in Hermes:

mistralai/Devstral-Small-2-24B-Instruct-2512

Hermes treats this endpoint as a drop-in OpenAI replacement. No other configuration changes are required.

7. The session loop (every working day)

# 1. Start the pod: RunPod dashboard -> find stopped pod -> play button
# 2. SSH in
bash /workspace/start_vllm.sh
# 3. Wait for: INFO: Application startup complete.
# 4. Verify with curl from your local machine (optional after first working session)
# 5. Use Hermes
# 6. Stop pod when done: dashboard -> stop button -> GPU billing stops

GPU billing stops the moment you stop the pod. The network volume continues at $0.07/GB/month under 1 TB ($0.05/GB above), roughly $7/month for 100 GB. Everything in /workspace persists including model weights, compiler cache and the launch script.

Latency: what to expect from the running endpoint

Once the pod is up, request latency is what your day actually feels like. Measured against Nemotron 3 Nano FP8 on RTX PRO 6000 Blackwell, hitting the public RunPod proxy from a residential network:

Prompt size	TTFT	Sustained decode
~25 tokens (small completion)	1443 ms	184 tok/s
~1830 tokens (medium context)	1541 ms	82 tok/s
~7230 tokens (long context)	1441 ms	68 tok/s

Two things to notice. First, TTFT is roughly flat across prompt sizes. At these context lengths on a Blackwell GPU, the proxy round-trip dominates the GPU’s prefill cost, so you pay the same ~1.4 second time-to-first-token whether you send 25 tokens or 7,000. The fix, if you want it lower, is to bypass the proxy with an SSH tunnel (ssh -L 8000:localhost:8000 root@your-pod-ip) and point Hermes at http://localhost:8000/v1. That removes the public proxy hop and brings TTFT closer to the raw GPU number.

Second, sustained decode degrades with context. Going from a short prompt to a 7K-token prompt drops you from 184 tok/s to 68 tok/s — about a 63% slowdown. This is the standard attention cost: longer context means each new token has to read a larger KV cache. For an agent loop that bounces off small prompts (1-4K tokens) you get the snappy end of this range. For long-context tasks like whole-file summarization at 16K+, you are at the slow end.

Practical implication: a coding agent doing typical 1-4K-token tool-call rounds will feel responsive (1.4s to first token, then 80-180 tok/s sustained). The same agent dragged into a 16K-token reasoning trace will feel noticeably slower. Context window discipline matters more than the model’s headline numbers.

What it actually costs

The decision whether to self-host is a cost question more than a technical one. Real RunPod prices for the RTX PRO 6000 Blackwell (96 GB) on Community Cloud as of April 2026:

Tier	Rate
On-demand	$1.69/hr
Spot (interruptible)	$1.19/hr

(Secure Cloud runs $3.07/hr on-demand, $1.65/hr spot. Use it only if you need datacenter-grade SLAs or pod-level isolation. Community Cloud is the realistic tier for a self-hoster.)

Network volume: $7/month for 100 GB.

Translated into monthly cost by usage pattern:

Usage	On-demand ($1.69/hr)	Spot ($1.19/hr)
24/7 always-on	$1,234/mo + $7 storage	$869/mo + $7 storage
8 h/day × 22 days (workday)	$298/mo + $7 storage	$209/mo + $7 storage
4 h/day × 22 days (light)	$149/mo + $7 storage	$105/mo + $7 storage

Compare to the alternatives:

Claude Max ($200/mo) or Pro+API top-ups: at $298/mo for 8 h/day on-demand, self-hosting is about $100 more than a single subscription, or roughly even with spot pricing. The trade-off becomes whether you would pay $100/month for full data control, no token limits, no rate limits and your choice of open-weight model.
Heavy API users ($400-800/mo): anyone burning through subscription quotas and topping up with API credits should at least price-check self-hosting. The break-even is roughly $250-300/month of API spend, which heavy daily users hit easily.
24/7 always-on: brutal. $1,234/mo on-demand or $869/mo on spot. Almost nobody should run it that way. Stop the pod when you are not using it. The session loop above is not optional discipline, it is what makes the math work.

The math works for a working software engineer who codes daily and runs sessions in 4-8 hour blocks. It does not work for the bursty “ask three questions and walk away” use case, which is what a hosted API is for.

Securing the setup

A self-hosted endpoint on a cloud GPU is easy to get working and easy to leave too open. The risks are not exotic. They are the combination of a public-facing API, credentials that appear in logs and persistent storage that holds whatever you leave on it.

Treat the vLLM API key as a real secret

The key is the only gate between your RunPod endpoint and anyone on the internet. Use openssl rand -hex 32 to generate it. Do not reuse it elsewhere. Rotate it if it appears in logs, shell history, screenshots or pasted error dumps. It appears in vLLM startup logs in plaintext every time. Any log you paste anywhere is a rotation event. One important limitation: the --api-key flag only protects /v1 endpoints. Other endpoints vLLM exposes (including /metrics and /invocations) are unauthenticated. Anyone who can reach the pod’s public URL can access them.

Expose only port 8000

The pod only needs port 8000 exposed for this setup. Keep SSH on 22 only if you actively use it. Do not expose Jupyter notebook ports or any other service you do not specifically need. Each exposed port is an additional attack surface on a publicly routable IP. For a stronger setup, avoid exposing port 8000 to the public internet entirely. Use an SSH tunnel instead: ssh -L 8000:localhost:8000 root@your-pod-ip. This forwards the pod’s port 8000 to localhost on your machine. Set OPENAI_BASE_URL to http://localhost:8000/v1 and Hermes connects through the tunnel. No public proxy URL, no unauthenticated endpoints reachable from the internet. For multi-device or team access, Tailscale is a cleaner alternative but requires installing and persisting the client on the pod.

Use least-privilege tokens everywhere

The HuggingFace token on the pod should be fine-grained and scoped to the specific model repo. A broad Read token gives access to every repo on your account, private models, datasets, everything. If the pod is compromised, a scoped token limits the exposure to one public model.

Keep secrets out of /workspace

/workspace is persistent. That is exactly why it is the wrong place for plaintext secrets. Model weights, compiler cache and scripts belong there. .env files, request logs containing bearer tokens and auth credentials do not. What is in /workspace survives pod stops and is accessible to anyone with SSH access to the pod.

Disable what you do not need

Jupyter, Global Networking and background services you did not explicitly enable are all disabled by default in a clean pod setup. Keep them off. Extra services running on a pod with a public IP are unnecessary exposure. Disable Global Networking unless you specifically need cross-pod communication.

Stop the pod when you are done

A stopped pod serves no requests, accepts no sessions and cannot drift from ad hoc changes. Stopping the pod at the end of each session is both cost control and security hygiene. The two goals coincide exactly here.

Series navigation: