A working endpoint is the beginning, not the end. Here is what to do with it.

vLLM · Sampling · Evaluation · Devstral / Nemotron · Hermes Agent

A 3-part series on self-hosting a coding model on RunPod.
Part 1: Running a coding model on RunPod: 8 things that will break
Part 2: Devstral or Nemotron on RunPod: the complete vLLM + Hermes setup guide — step-by-step setup
Part 3: You’ve set up your coding model on RunPod. What’s next? — sampling, evaluation, ongoing operation (you are reading this)


The endpoint is live. Hermes connects. Your model responds to requests. The infrastructure problem is solved.

Once the obvious failures are behind you (the OOMs, the disk quotas, the 400 errors) the failures that remain are harder to diagnose:

  • The model wanders when the task needed discipline
  • The model varies too much across retries on the same stable task
  • The model overuses tools when plain generation would be better or underuses them when a tool was the right call
  • The model writes plausible but needlessly creative patches
  • The model turns a narrow edit request into exploratory prose about what it could do

These are behavior policy failures. The model running on defaults tuned for general chat, not for an agentic coding loop. The infrastructure did not cause them. The infrastructure being stable is what makes them visible.


First: find out what Hermes is actually sending

Before tuning anything at the vLLM level, answer one question: does Hermes override sampling parameters in its requests or does it leave them to the server?

This matters because if Hermes sets temperature explicitly in every request, any server-level sampling configuration is irrelevant. Request parameters win. If Hermes does not set them, the model’s generation_config.json defaults apply.

Enable request logging in vLLM temporarily, make a request through Hermes and inspect the JSON. Look for temperature, top_p, top_k in the request body. The answer determines where your tuning work should happen.

# Inspect what Hermes sends. Add to vLLM flags temporarily
--log-level debug

# Or intercept with a local proxy and inspect the JSON body

Why this matters. A common pattern: you adjust temperature at the vLLM server. Model behavior does not change. Hermes has been sending temperature=0.7 in every request the whole time.

Log the request first. Then decide where the control should live, server or client. Trying to control both without knowing which wins produces confusion that looks like model quality issues.


Second: fix the sampling defaults

Both Devstral Small 2 and Nemotron 3 Nano ship with generation_config.json defaults tuned for general chat:

ParameterDefaultRecommended for coding
temperature0.7start with 0.15
top_p0.81.0 (disabled)
top_k200 (disabled)
repetition_penalty1.051.0 (disabled)

The reasoning behind each change:

Temperature, the one that matters most

At 0.7, the model explores. The cost shows up two ways. The first is the obvious one: in an agent loop that runs retries, variation is drift. The model produces structurally different patches across retries, making it hard for the agent to distinguish a better answer from just a different one. The second is more subtle and more dangerous.

I tested this directly on a 30B reasoning model (Nemotron 3 Nano FP8). Three tasks (explain a module, find a real bug in a function, refactor for shared logic) run twice each at temperature 0.15 and twice at temperature 0.7. The interesting result was on the bug-finding task, where a real subtle bug existed in a Rust context-window manager: the function silently dropped the most-recent message when that message alone exceeded the budget.

SampleGot the bug right?
t=0.15, run 1Correct fix, added && !kept.is_empty() guard
t=0.15, run 2Correct fix, different code structure but same semantics
t=0.70, run 1Wrong bug. Proposed break → continue, which makes it worse: it silently drops the newest message instead of keeping it.
t=0.70, run 2Hallucinated a different “bug”: a fictional usize overflow hazard that cannot occur on token counts. Proposed a fix that does not address the real issue at all.

Both temperature-0.15 samples found the real bug. Both temperature-0.7 samples failed differently: one proposed a confidently-wrong fix, one fabricated a non-existent bug. Temperature did not just affect stability. It affected correctness. On precision tasks (find this bug, propose this patch, reason carefully about this code path), high temperature is not “creative”, it is wrong.

This finding does not show up in naive stability metrics. Counting line-diffs between samples at the same temperature, the temperature-0.15 outputs actually look less stable than the temperature-0.7 ones, because two correct fixes can legitimately be written several different ways while two wrong fixes happen to be similar in length and structure. Diff counts alone tell you nothing about correctness. Stability metrics need correctness scoring layered on top to be meaningful.

At 0.15, retries converge on correct answers. Diffs are meaningful.

Reasoning models eat token budgets fast. When I first ran this test with max_tokens=1500, the bug-finding and refactor samples came back nearly empty. Nemotron’s reasoning trace consumed the whole budget before the model reached its actual answer. Bumping to max_tokens=4000 fixed it. If you self-host a reasoning model for coding tasks, budget at least 3000-4000 max_tokens per call, not the 1000-2000 you would give a non-reasoning model. The reasoning trace is a sunk cost you have to pay for.

top_p and top_k: remove stacked constraints

Both further narrow the sampling distribution on top of temperature. For coding, one strong constraint is cleaner than three overlapping nudges. Disable both and let temperature do the work alone. If output quality is unsatisfactory, adjust temperature, not the others.

repetition_penalty: code is repetitive by design

Repetition penalties prevent looping in chat generation. Code has structural repetition that is correct and necessary: imports, interface patterns, test scaffolding, naming conventions. A penalty that helps chat can quietly push against code quality. At 1.0 it does nothing. Remove the unknown variable.

How to apply these. If Hermes does not set sampling params: configure vLLM defaults via the --generation-config flag or override generation_config.json.

If Hermes does set them: find the temperature setting in Hermes configuration and change it there. The server-level setting will be overridden anyway.

Start with temperature=0.15 only. Observe. Then evaluate whether you need to adjust further.

Validate the change with a side-by-side run

Do not take any of the above on faith. The practical way to confirm a sampling change actually helps is to run the same three or four tasks under both the shipped chat defaults and the coding profile from the table at the top of this section, then compare the outputs directly. Look at correctness, patch stability across retries, tool-call quality, verbosity and whether the model wanders or stays on task. The difference at temperature 0.7 vs 0.15 on a focused patch task is usually visible in the first run, and the bug-finding A/B I described earlier is what that comparison looks like in concentrated form. The “Fifth: build a simple evaluation loop” section below turns this into a repeatable practice; this subsection is the minimum viable version of it.


Third: evaluate tool-call quality, not just compatibility

Getting HTTP 200 on a tool-call request proves Hermes and your model can communicate about tools. It does not prove the model uses them well.

Tool-call compatibility and tool-call quality are different problems. Run a small set of real tasks and observe the model’s tool usage:

  • Does it call the right tool for the situation or use tools where plain generation would be better?
  • Does it avoid calling the same tool in a loop when the result does not change?
  • Are the arguments it passes to tools clean and correct, no hallucinated paths, no malformed JSON?
  • Does it know when not to call a tool and just answer directly?

These are harder to measure than a 200 status code, but they determine whether Hermes actually completes tasks or spins in tool loops.

A few concrete heuristics worth instrumenting in Hermes (or whatever agent loop you run on top of the endpoint):

  • Repeat-call loop: the same tool is called three or more times in a row with identical arguments. The model is not making progress. Abort the loop and surface the failure rather than letting it burn through the context window.
  • Hallucinated path: a tool argument references a file or directory that is not in the agent’s known working set. Return a structured error so the model can correct on the next turn instead of confidently repeating the bad call.
  • Malformed arguments: the arguments fail to parse as JSON or fail the tool’s schema. Retry once, then fail loudly. Repeated parse failures usually mean the --tool-call-parser flag is wrong for the model family (see blind spot 7 in the first post).
  • Unnecessary tool call: the model invokes a tool when the question could have been answered from conversation history. This is a system prompt problem, not a model problem. Tighten the prompt before tuning sampling.

Fourth: calibrate context length for your actual workload

The server runs at 128K. That does not make 128K the optimal operating point for your specific work. There is a real trade-off: longer context means the model can see more of a codebase in one pass, but also means longer prefill time, higher KV cache pressure and sometimes less focused outputs.

Run the same agentic tasks at different context settings and compare:

MetricWhat to look for
CorrectnessDoes the output do what was asked?
Patch stabilityDo retries produce consistent patches?
LatencyHow long does the first token take?
Tool call qualityDoes more context improve or distract tool selection?
VerbosityDoes the model ramble more with more context?

The cost of larger context is not abstract. Measured on a 30B FP8 model on a 96 GB Blackwell card, prefill scales roughly linearly above 16K tokens at about 27 ms per 1000 tokens of additional context:

Prompt tokensTTFT (prefill + dispatch)
1,00035 ms
4,00061 ms
16,000234 ms
32,000562 ms
64,0001,263 ms
96,0002,301 ms
128,0003,239 ms

Practical zones, framed by what the latency feels like inside an agent loop:

  • Up to 16K tokens: under 250 ms TTFT. Interactive. Cost is rounding error.
  • 32K tokens: about 560 ms TTFT. Snappy. The default sweet spot for most coding work.
  • 64K tokens: about 1.3 seconds TTFT per turn. Noticeable. In a 10-turn agent loop, that is 13 seconds of pure prefill latency added to the wall clock.
  • 128K tokens: about 3.2 seconds TTFT per turn. Slow, plus you start fighting KV cache pressure. Reserve for one-shot whole-codebase summarization, not iterative agent loops.

For most coding tasks (single file changes, focused bug fixes, small refactors) 32K is sufficient. The benefit of 128K appears in tasks that require reasoning across multiple files simultaneously. Start smaller and extend only when a real task requires it. To see how close you are to the ceiling during a session, vLLM exposes Prometheus metrics by default at /metrics. The one to watch is vllm:kv_cache_usage_perc. Run curl http://localhost:8000/metrics | grep kv_cache_usage from inside the pod to see what fraction of the KV cache is in use.

The first request after server startup is artificially slow. vLLM compiles its CUDA graphs lazily for new prompt-shape buckets. The first request that hits a previously-unseen length bucket can take an order of magnitude longer than the steady-state cost. One of my measurements showed 16 seconds at 16K tokens on the first hit, then 234 ms on every subsequent call at the same length. When you benchmark, always discard the first sample at each new shape. Otherwise you end up reporting compile cost as inference latency.

What 32K means in practice: enough for selective multi-file reasoning, enough for patch planning with surrounding context, enough for agent workflows that inspect only what they need. What it does not mean: dumping the whole repository into a single prompt and hoping for coherence. More context is not always better. An agent that reads selectively at 32K often outperforms one that dumps indiscriminately at 128K.

The practical constraint. The limit on context in a coding agent loop is rarely the model’s ceiling. It is what you choose to put in the context.

An agent that reads entire files when it needs three functions will hit context limits and produce unfocused outputs at any context length.

This is a Hermes configuration and prompt design question as much as a model question. Raise the context ceiling only after the agent is reading selectively.


Fifth: build a simple evaluation loop

Once sampling and context are calibrated, the work that converts “the endpoint is alive” into “this is a daily tool” is running a fixed set of real tasks regularly and tracking the results.

A minimal evaluation loop for a coding assistant:

  • Inspect a module and explain its structure. Tests comprehension and context use.
  • Propose a focused patch for a described bug. Tests reasoning and code generation.
  • Implement a narrow feature in an existing file. Tests instruction following.
  • Refactor a function for readability. Tests code quality judgment.
  • Locate and modify code across two related files. Tests multi-file tool use.

Run these tasks under your current configuration. Note correctness, patch stability across retries, tool call quality and whether outputs are appropriately scoped or tend to over-engineer.

A few rules to make the loop honest:

  • Sample size: three runs per task per configuration is the floor for a meaningful stability signal. Five is better. One run tells you nothing about variance.
  • Scoring: rate each task on two axes, correctness (0-2: broken, partial, correct) and stability (0-2: each retry produces a different shape, similar shape with wording changes, near-identical). Sum across tasks. A configuration that improves correctness by one point but loses two points of stability is a worse configuration overall, not a better one.
  • Do not use line-diff counts as a stand-alone stability metric. Two correct answers can legitimately differ a lot in formatting; two wrong answers can be misleadingly similar. I learned this the hard way: when I scored a temperature A/B by counting diff lines between samples, the temperature-0.15 outputs looked less stable than the temperature-0.7 ones — which would have been the opposite of the truth, because the temperature-0.15 outputs were correct and the temperature-0.7 outputs were both wrong (one harmful, one hallucinated). Diff metrics need correctness scoring on top of them. They are an input to evaluation, not a substitute for it.
  • Significance floor: on a five-task battery, treat any difference under roughly 10% of the total possible score as noise. Real wins show up bigger than that.
  • Version locking: record the model ID, vLLM version and full launch flag set with every evaluation run. When any of those change, re-run the battery before drawing conclusions. The same tasks under a different vLLM build are not the same evaluation.

When you change sampling settings, context length or upgrade to a new model, run the same tasks again and compare. This is the only reliable way to know whether a change improved or degraded your setup.


Sixth: manage the session overhead

The stop/start pod pattern works well for a coding assistant used in blocks, a few hours of focused work, then stopped. The session overhead is real but predictable: about 5 minutes from pod start to Hermes ready, dominated by vLLM’s profile-and-warmup pass (which cannot be cached) and the pip reinstall after the root disk wipes (which is mandatory on every restart). That is not 5 minutes of activity, it is 5 minutes of waiting once you have hit bash /workspace/start_vllm.sh. Plan around it, do not assume it away.

The clean operating loop that keeps everything reliable:

1. Start pod (RunPod dashboard, play button)
2. SSH in
3. bash /workspace/start_vllm.sh
4. Wait for: INFO: Application startup complete.
5. Test /v1/models from your local machine
6. Use Hermes
7. Stop pod when done

That loop keeps the GPU ephemeral and the useful state persistent. The only things that should be permanent are on the volume: model weights, compiler cache and the launch script. Everything else is recreated at start.

A few things that make the pattern more reliable over time:

  • Fix the VLLM_API_KEY in the launch script. Exporting it manually each session means a new SSH session will occasionally miss it and produce Unauthorized errors that look like endpoint failures.
  • Mirror the same key in ~/.bashrc as OPENAI_API_KEY. Both sides always in sync.
  • Rotate the key quarterly or whenever it appears in a pasted log. vLLM logs it in plaintext at every startup.
  • Keep /workspace clean: model weights and scripts belong there, secrets and log files do not.
  • Stop the pod when done. This is cost control and security hygiene at the same time. A stopped pod serves no requests and holds no active sessions.

When to consider upgrading the model

Devstral Small 2 and Nemotron 3 Nano are good models for this GPU. They will not be the best models for this GPU indefinitely. The open-weight coding model landscape moves fast.

The signal to watch is SWE-bench Verified scores for models in the 20-30B FP8 or BF16 range, small enough to fit a single 96 GB card with meaningful context headroom. When a model in that range significantly exceeds 68%, it is worth evaluating against your task battery. One model already worth benchmarking alongside Devstral is NVIDIA’s Nemotron 3 Nano 30B-A3B-FP8. It is a hybrid Mamba-2 + Transformer MoE model (31.6B total parameters, 3.2B active per forward pass) with a native FP8 checkpoint at 32.7 GB. It scores 68.25% on LiveCodeBench v6 and supports context up to 1M tokens. The setup guide for this model, including the modified launch script, is in the setup guide.

Before switching:

  • Check whether the new model requires authentication on HuggingFace. If so, create a new fine-grained token scoped to that repo.
  • Verify the BF16 download size fits your 100 GB volume, not the GGUF size, the safetensors download.
  • Check which --tool-call-parser value vLLM requires for the new model’s family. Each model family was fine-tuned with a different tool-call token format. The parser must match. See blind spot 7 in the first post for how to find the right value.
  • Run your evaluation task set against the new model before committing to the switch.

The infrastructure does not change when the model changes. The network volume needs a new download. The launch script needs a new model name and parser flag. Everything else stays the same.


Series navigation: