Local AI Models Guide
This page contains comprehensive guides for running local AI models on various hardware configurations, including dual GPU setups with NVIDIA RTX 6000 Pro and RTX 5090 cards. Learn how to set up vLLM, configure Ollama, and optimize your local AI infrastructure for efficient large language model inference.
Local model execution commands for VLLM for dual NVidia RTX6000 Pro
This comprehensive guide provides detailed instructions for setting up and running large language models locally using vLLM on a system equipped with dual NVIDIA RTX 6000 Pro GPUs. It covers installation steps for both stable and nightly builds of vLLM, solutions to common GPU communication issues, and specific commands for various high-performance models including Devstral 2 123B, GLM-4.5-Air-FP8, and Qwen3-235B-A22B. The guide offers practical insights into maximizing multi-GPU performance for efficient LLM inference, with tips on achieving optimal speed and memory utilization.
Local model execution commands for VLLM for dual NVidia RTX6000 ProHow to connect Ozeki AI Gateway to Ollama
This guide demonstrates how to install Ollama on your local system and connect it to Ozeki AI Gateway. You'll learn how to download and install Ollama, test it with a local AI model, and configure it as a provider in your gateway. The tutorial includes step-by-step instructions with screenshots and video guides covering provider configuration, API endpoint setup, and testing procedures. By following these steps, you can run AI models locally on your machine and access them through Ozeki AI Gateway's unified interface.
How to connect Ozeki AI Gateway to OllamaHow to run local AI Models on 64 GB Vram, 2x RTX 5090
This guide explains how to run local AI models on a system with 64 GB VRAM and dual RTX 5090 GPUs. It focuses on using the NVFP4 quantized GLM-4.7-FLASH model from HuggingFace with vLLM. The page provides the specific command with optimized parameters including tensor-parallel-size 2, max-model-len configuration, and tool-call-parser settings. The hint notes that context size can go up to 80K, making it suitable for various inference tasks on high-end consumer hardware.
How to run local AI Models on 24 GB Vram, RTX 3090 or RTX 4090