Build your own AI coding assistant with Devstral Small 2

Want your own AI coding assistant that runs on your hardware, costs nothing after setup, and keeps your code private? Mistral's Devstral Small 2 makes that possible now. Almost.

This 24B parameter model scores 68% on SWE-bench Verified, supports up to 256K context windows, and works with any OpenAI-compatible agentic tool. Whether you have a high-end workstation GPU or rent cloud compute for as low as $0.40/hr, you can self-host your own coding agent in under an hour. Practical deployments typically use 32K-65K context windows for optimal performance on 48GB GPUs.

This guide shows you how to deploy Devstral directly with vLLM and connect it to Continue.

The economics of self-hosting

Before diving into deployment, let's examine whether self-hosting makes financial sense for your team.

Cost per developer

A single A40 GPU on Runpod ($0.40-0.50/hr) running Devstral Small 2 with vLLM can realistically support:

Light usage (occasional assistance):

10-15 developers
Pattern: 5-10 queries per hour per developer
Cost: $0.03-0.05 per developer per hour
Use cases: Code explanations, quick reviews, occasional generation

Moderate usage (active pair programming):

6-9 developers
Pattern: 15-25 queries per hour per developer
Cost: $0.04-0.08 per developer per hour
Use cases: Regular code generation, refactoring, test creation

Heavy usage (autonomous agents):

2-5 developers
Pattern: 30+ queries per hour per developer
Cost: $0.08-0.25 per developer per hour
Use cases: Agentic workflows (Cline, Vibe CLI running continuously)

Comparison to commercial alternatives

Solution	Cost per developer (moderate use, 8 hrs/day)	Privacy	Context window
Self-hosted Devstral (A40)	$0.32-0.64/day	Full	32K-256K
GitHub Copilot	$10-19/month ($0.45-0.86/day)	Partial	Limited
Cursor	$20/month ($0.90/day)	None	Limited
Claude Pro	$20/month ($0.90/day)	None	200K

For a 6-9 developer team with moderate usage on a single A40 instance:

Self-hosted: ~$80-100/month ($0.40-0.50/hr × 200 hrs)
GitHub Copilot: $60-170/month
Cursor: $120-180/month
Claude Pro: $120-180/month

The economics improve dramatically with heavy usage or larger teams, where commercial solutions scale linearly with headcount while self-hosted infrastructure scales with concurrent load.

Scaling considerations

Single A40 ($0.40-0.50/hr):

Optimal: 5-10 developers
Maximum: 15-20 developers with mixed usage

Dual A40 ($0.80-1.00/hr):

Optimal: 15-20 developers
Maximum: 30-40 developers with mixed usage

Signs you need more capacity:

Response times consistently exceed 30 seconds
Developers report waiting in queue
More than 5-8 concurrent active sessions

The key advantage of self-hosting is cost predictability and privacy. Whether your developers make 10 or 100 queries per hour, your infrastructure cost remains constant. Commercial solutions charge per seat regardless of usage, making self-hosting increasingly attractive for teams that rely heavily on AI assistance.

This article focuses on deployment. For the actual development experience, I'll save my detailed comparison for a follow-up piece. The short version: Devstral Small 2 is very impressive for a self-hosted model and the deployment is straightforward, but if privacy isn't your primary driver, Claude's output quality and reasoning still justify the per-seat cost for most teams.

Hardware requirements

Local deployment:

NVIDIA RTX 6000 Ada, A40 or A6000 (48GB VRAM)
Ubuntu 20.04+ or equivalent Linux distribution
Python 3.9-3.12
NVIDIA drivers (535+ recommended)
vLLM >= 0.8.5
mistral_common >= 1.5.5
120GB storage (150GB recommended for comfortable operation)
64GB RAM recommended

Cloud deployment:

Runpod GPU pod with A40 (48GB), A6000 (48GB), or RTX 6000 Ada (48GB) (from $0.40-1.20/hr)

Important: 48GB VRAM limitation

Devstral Small 2505 in full precision requires 47.3GB VRAM just to load the model weights. On a 48GB GPU, this leaves minimal memory for KV cache, batching, and inference overhead. The configurations below use GPTQ 4-bit quantization (31GB VRAM) for reliable operation on 48GB GPUs.

Security note: Third-party quantization

This guide uses a community-quantized model (mratsim/Devstral-Small-2505.w4a16-gptq) for tutorial simplicity. For production deployments where privacy is critical, you should quantize the official model yourself using AutoGPTQ or llama.cpp to ensure no tampering. Third-party quantized models could theoretically contain altered weights or malicious modifications.

Self-quantization is covered in the "Production considerations" section below.

Quick start

Install vLLM:

# Install vLLM (requires >= 0.8.5) and dependencies
pip install --upgrade vllm
pip install mistral_common hf_transfer

# Verify CUDA is available
python -c "import torch; print(torch.cuda.is_available())"

Download the quantized model:

# Download GPTQ 4-bit quantized model (31GB VRAM, takes 10-15 minutes)
# This is a third-party quantization for tutorial simplicity
huggingface-cli download mratsim/Devstral-Small-2505.w4a16-gptq

# Verify the model is cached
ls -la ~/.cache/huggingface/hub/ | grep -i devstral

vLLM will automatically use this cached model when you start the server.

Run Devstral:

# Run in foreground (for testing)
# Requires 48GB VRAM (A40/A6000/RTX 6000 Ada)
vllm serve mratsim/Devstral-Small-2505.w4a16-gptq \
  --tokenizer-mode mistral \
  --config-format mistral \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --host 0.0.0.0 \
  --port 8000

For persistent background execution:

# Install tmux or screen
apt-get update && apt-get install -y tmux
# OR: apt-get install -y screen

# Run in detached tmux session
tmux new-session -d -s vllm 'vllm serve mratsim/Devstral-Small-2505.w4a16-gptq \
  --tokenizer-mode mistral \
  --config-format mistral \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --host 0.0.0.0 \
  --port 8000'

# View logs
tmux attach -t vllm

# Detach: Ctrl+B then D

Why tool calling is disabled

While vLLM supports tool calling with --tool-call-parser mistral and --enable-auto-tool-choice, there are multiple unresolved bugs when using these flags with Mistral models:

Streaming tool calls trigger JSONDecodeError (vLLM issue #21303 - closed as "not planned")
Tool call validation errors with the index field (issue #17643)
Various mistral_common compatibility issues

Other coding assistants (Cline, OpenCode, Kilo Code, Continue) work reliably (sort of) without formal tool calling - they interact with Devstral via standard chat, which is more stable.

The server will:

Load the cached model into GPU memory
Start the OpenAI-compatible API server on port 8000

Wait for:

INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000

Test the deployment:

curl http://localhost:8000/v1/models

Integration with agentic coding tools

Now that you have Devstral running, integrate it with agentic coding assistants.

Context window configuration

While Devstral Small 2 supports a 256K context window at the model level, the actual context available to your tools is determined by vLLM's --max-model-len parameter. The configuration below uses 32768 tokens to match the recommended vLLM deployment settings. If you're using different performance tuning (high throughput: 65536, low memory: 16384), adjust the context length value accordingly.

Mistral Vibe CLI

Known issue with vLLM tool calling

Mistral Vibe CLI is Mistral's official terminal-based coding agent. However, when used with vLLM and tool calling enabled, it encounters a JSONDecodeError during streaming tool calls (vLLM issue #21303). The vLLM maintainers closed this issue as "not planned," and Vibe doesn't expose a configuration option to disable streaming. Until this is resolved, use other OpenAI-compatible clients instead.

Other coding assistants

Most OpenAI-compatible coding assistants work seamlessly with this setup, including Cline, Continue, Kilo Code, OpenCode, and others.

Example configuration for Continue (.continue/config.yaml):

name: Local Config
version: 1.0.0
schema: v1
models:
  - name: devstral-small-2
    provider: openai
    model: mratsim/Devstral-Small-2505.w4a16-gptq
    apiBase: http://localhost:8000/v1
    apiKey: none

Note: For Runpod/remote deployments, replace http://localhost:8000/v1 with your pod's public endpoint (e.g., http://12.345.67.89:54321/v1).

Most coding assistants support similar configuration patterns with provider: openai or provider: openai-compatible, the model endpoint URL, and the model name. Consult your specific tool's documentation for exact configuration syntax.

Performance tuning

With GPTQ 4-bit quantization (31GB VRAM), you have plenty of headroom on 48GB GPUs for optimization.

High throughput (48GB VRAM, GPTQ):

vllm serve mratsim/Devstral-Small-2505.w4a16-gptq \
  --tokenizer-mode mistral \
  --config-format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 65536 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 128 \
  --enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8000

Balanced (48GB VRAM, recommended):

vllm serve mratsim/Devstral-Small-2505.w4a16-gptq \
  --tokenizer-mode mistral \
  --config-format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8000

Low memory (32GB VRAM or heavy concurrent usage):

vllm serve mratsim/Devstral-Small-2505.w4a16-gptq \
  --tokenizer-mode mistral \
  --config-format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.75 \
  --max-model-len 16384 \
  --max-num-seqs 32 \
  --host 0.0.0.0 \
  --port 8000

Runpod deployment

For cloud deployment without local hardware.

Step 1: Deploy pod

Create account at runpod.io
Deploy pod:
- GPU: A40 (48GB, $0.40-0.50/hr), A6000 (48GB, $0.80-1.00/hr), or RTX 6000 Ada (48GB, $0.80-1.20/hr)
- Template: "RunPod Pytorch 2.4" or "RunPod Pytorch" (Python 3.10+ pre-installed)
- Disk: 120GB minimum (150GB recommended for comfortable operation)
- Pod type: Secure Cloud (recommended) or Community Cloud

Step 2: Access terminal

Click Connect > Start Web Terminal in the Runpod console. This opens a browser-based terminal with full access to your pod.

Step 3: Install and run vLLM

From here, the setup is nearly identical to the local deployment covered in the Quick start section above. Follow the same installation steps and vLLM commands. The only difference is the cache location (Runpod uses /workspace/.cache/huggingface/hub/ instead of ~/.cache/huggingface/hub/), but vLLM handles this automatically.

Step 4: Expose port

In the Runpod pod settings, expose TCP port 8000:

Click Edit Pod
Under Expose Ports, add port 8000
Note the external port mapping (e.g., 12.345.67.89:54321)

Configure your agentic tool with the public endpoint:

{
  "baseURL": "http://12.345.67.89:54321/v1",
  "model": "mratsim/Devstral-Small-2505.w4a16-gptq"
}

Alternative: For local-only access, use SSH tunnel if you have SSH configured:

ssh -L 8000:localhost:8000 root@<pod-id>.runpod.io -p <ssh-port> -N

Cost optimization:

Use spot instances (50-70% cheaper)
Stop pods when idle
Enable auto-stop in settings

Troubleshooting

Out of memory:

If you get torch.OutOfMemoryError: CUDA out of memory with the GPTQ model:

Check GPU VRAM: Verify you have 48GB VRAM:

nvidia-smi
# Look for "Memory-Usage" - should show 48GB total

Check for existing processes: Another process may be consuming GPU memory:

nvidia-smi
# Look for other processes in the "Processes" section
# Kill them with: kill -9 <PID>

Reduce memory settings: Lower memory utilization and context window:

--gpu-memory-utilization 0.75 \
--max-model-len 16384 \
--max-num-seqs 32

Set PyTorch memory allocation: Prevent fragmentation:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Insufficient VRAM: The GPTQ model requires 31GB VRAM. If you have less than 48GB, you need:
- A40 (48GB) - $0.40-0.50/hr on Runpod (best value)
- A6000 (48GB) - $0.80-1.00/hr on Runpod
- RTX 6000 Ada (48GB) - $0.80-1.20/hr on Runpod

Note: 24GB GPUs (RTX 3090/4090/A10G) cannot run Devstral Small 2 with vLLM, even quantized. Consider using Ollama with GGUF quantization (Q4 requires ~15GB) for smaller GPUs.

Cannot find any model weights:

If you get RuntimeError: Cannot find any model weights when loading the GPTQ model:

Remove --load-format parameter: The GPTQ model needs vLLM to auto-detect the format. Remove --load-format mistral from your command. vLLM will automatically detect GPTQ quantization from the model's quantize_config.json.

Verify model download: Ensure the GPTQ model was downloaded correctly:

ls -la ~/.cache/huggingface/hub/models--mratsim--Devstral-Small-2505.w4a16-gptq/
# Should contain: *.safetensors, quantize_config.json, config.json

Use correct command: For GPTQ models, use:

vllm serve mratsim/Devstral-Small-2505.w4a16-gptq \
  --tokenizer-mode mistral \
  --config-format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice
# Note: NO --load-format parameter

Model loading error (config.json not found):

Ensure all dependencies are installed and the model is downloaded:

# Verify dependencies
pip show vllm mistral_common hf_transfer

# If missing, install them
pip install --upgrade vllm mistral_common hf_transfer

# Verify the model is cached (should see a directory with the model name)
ls -la ~/.cache/huggingface/hub/ | grep -i devstral

# If not cached, download it
huggingface-cli download mistralai/Devstral-Small-2505

CUDA not available or initialization error:

If you see "CUDA unknown error" or "CUDA available: False", diagnose the issue:

# Check NVIDIA driver and GPU visibility
nvidia-smi

# Check CUDA environment variables
echo $CUDA_VISIBLE_DEVICES
echo $CUDA_HOME

# Check PyTorch CUDA availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}'); print(f'Device count: {torch.cuda.device_count()}')"

Common fixes:

NVIDIA driver not loaded: If nvidia-smi fails, reinstall drivers:

# Ubuntu/Debian
sudo apt-get install nvidia-driver-535
sudo reboot

Wrong PyTorch version: Reinstall PyTorch with CUDA support:

pip uninstall torch
pip install torch --index-url https://download.pytorch.org/whl/cu121

Environment variable conflict: Unset and retry:

unset CUDA_VISIBLE_DEVICES
python -c "import torch; print(torch.cuda.is_available())"

Reinstall vLLM: If PyTorch works but vLLM doesn't:
```
pip uninstall vllm
pip install --upgrade vllm
```

Note: vLLM requires GPU and will not run on CPU-only systems.

Server won't start:

# Check if port 8000 is already in use
lsof -i :8000

# Kill existing process if needed
kill -9 $(lsof -t -i:8000)

# Check GPU is accessible
python -c "import torch; print(torch.cuda.device_count())"

Slow download:

The 48GB model download takes 15-30 minutes depending on connection speed. Use huggingface-cli download with progress monitoring rather than letting vLLM download automatically.

Connection from agentic tools:

Ensure:

Server is running: curl http://localhost:8000/v1/models
No firewall blocking: sudo ufw allow 8000 (if needed)
Correct endpoint configured in your agentic tool

Cost comparison

Setup	Hardware cost	Ongoing cost	Throughput
RTX 6000 Ada local	$6000-7000	Electricity only	80-100 tok/s
A6000 local	$4000-5000 (used)	Electricity only	70-85 tok/s
A40 local	$3000-4000 (used)	Electricity only	60-75 tok/s
Runpod A40	$0	$0.40-0.50/hr	60-75 tok/s
Runpod A6000	$0	$0.80-1.00/hr	70-85 tok/s
Runpod RTX 6000 Ada	$0	$0.80-1.20/hr	80-100 tok/s
Runpod A100 (80GB)	$0	$1.50-2.00/hr	90-110 tok/s

For occasional use, Runpod is cheaper. The A40 at $0.40-0.50/hr offers the best value for cloud deployment. For daily use beyond 6-8 hours, local hardware pays off in 12-18 months.