How to run Claude Code with Ozeki AI Gateway

This article explains how to run Claude Code using a local MiniMax-M2.1 model via the vLLM Anthropic API endpoint. It details the required hardware, installation steps for vLLM and the model, and configuration of Claude Code to connect to the local server. Following the guide enables developers to leverage a high‑performance local model for AI‑assisted coding without relying on external services.

You can run Claude Code with your own local MiniMax-M2.1 model using vLLM's native Anthropic API endpoint support.

Hardware Used

ComponentSpecification
CPUAMD Ryzen 9 7950X3D 16-Core Processor
MotherboardROG CROSSHAIR X670E HERO
GPUDual NVIDIA RTX Pro 6000 (96 GB VRAM each)
RAM192 GB DDR5 5200 (note the model does not use the RAM, it fits into VRAM entirely)

Install vLLM Nightly

Prerequisite: Ubuntu 24.04 and the proper NVIDIA drivers

mkdir vllm-nightly
cd vllm-nightly
uv venv --python 3.12 --seed
source .venv/bin/activate

uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly

Download MiniMax-M2.1

Set up a separate environment for downloading models:

mkdir /models
cd /models
uv venv --python 3.12 --seed
source .venv/bin/activate

pip install huggingface_hub

Download the AWQ-quantized MiniMax-M2.1 model:

mkdir /models/awq
huggingface-cli download cyankiwi/MiniMax-M2.1-AWQ-4bit \
    --local-dir /models/awq/cyankiwi-MiniMax-M2.1-AWQ-4bit

Start vLLM Server

From your vLLM environment, launch the server with the Anthropic-compatible endpoint:

cd ~/vllm-nightly
source .venv/bin/activate

vllm serve \
    /models/awq/cyankiwi-MiniMax-M2.1-AWQ-4bit \
    --served-model-name MiniMax-M2.1-AWQ \
    --max-num-seqs 10 \
    --max-model-len 128000 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 1 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

The server exposes /v1/messages (Anthropic-compatible) at http://localhost:8000.


Install Claude Code

Install Claude Code on macOS, Linux, or WSL:

curl -fsSL https://claude.ai/install.sh | bash

See the official Claude Code documentation for more details.


Configure Claude Code

Create settings.json

Create or edit ~/.claude/settings.json:

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:8000",
    "ANTHROPIC_AUTH_TOKEN": "dummy",
    "API_TIMEOUT_MS": "3000000",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
    "ANTHROPIC_MODEL": "MiniMax-M2.1-AWQ",
    "ANTHROPIC_SMALL_FAST_MODEL": "MiniMax-M2.1-AWQ",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "MiniMax-M2.1-AWQ",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "MiniMax-M2.1-AWQ",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "MiniMax-M2.1-AWQ"
  }
}

Skip Onboarding (Workaround for Bug)

Due to a known bug in Claude Code 2.0.65+, fresh installs may ignore settings.json during onboarding. Add hasCompletedOnboarding to ~/.claude.json:

# If ~/.claude.json doesn't exist, create it:
echo '{"hasCompletedOnboarding": true}' > ~/.claude.json

# If it exists, add the field manually or use jq:
jq '. + {"hasCompletedOnboarding": true}' ~/.claude.json > tmp.json && mv tmp.json ~/.claude.json

Run Claude Code

With vLLM running in one terminal, open another and run:

claude

Claude Code will now use your local MiniMax-M2.1 model! If you also want to configure the Claude Code VSCode extension, see here.


References

Source: Running MiniMax-M2.1 Locally with Claude Code on Dual RTX Pro 6000 (I am not selling or promoting anything)

More information