Sloth LabSlothLab Tools

AI Model Stack Builder

Calculate total VRAM for your multi-model AI setup.

Quick Start:

Model Browser

DeepSeek R1 14BPopular
14B · 28.0 GB
DeepSeek R1 32BPopular
32B · 64.0 GB
DeepSeek R1 70BPopular
70B · 140.0 GB
DeepSeek R1 7BPopular
7B · 14.0 GB
DeepSeek R1 8BPopular
8B · 16.0 GB
Gemma 3 12BPopular
12B · 24.0 GB
Llama 3.1 70BPopular
70B · 140.0 GB
Llama 3.1 8BPopular
8B · 16.0 GB
Llama 3.3 70BPopular
70B · 140.0 GB
Llama 4 Scout 109BPopular
109B · 218.0 GB
Ministral 7BPopular
7B · 14.0 GB
Mistral 7B v0.3Popular
7B · 14.0 GB
Mistral NeMo 12BPopular
12B · 24.0 GB
Phi-3 Medium 14BPopular
14B · 28.0 GB
Phi-4 14BPopular
14B · 28.0 GB
Qwen 2.5 14BPopular
14B · 28.0 GB
Qwen 2.5 7BPopular
7B · 14.0 GB
Qwen 3 14BPopular
14B · 28.0 GB
Qwen 3 32BPopular
32B · 64.0 GB
Qwen 3 8BPopular
8B · 16.0 GB
Command R 35B
35B · 70.0 GB
Command R+ 104B
104B · 208.0 GB
DeepSeek R1 1.5B
1.5B · 3.0 GB
DeepSeek V2 236B
236B · 472.0 GB
DeepSeek V2 Lite 16B
16B · 32.0 GB
DeepSeek V3 671B
671B · 1342.0 GB
Gemma 2 27B
27B · 54.0 GB
Gemma 2 2B
2B · 4.0 GB
Gemma 2 9B
9B · 18.0 GB
Gemma 3 1B
1B · 2.0 GB
Gemma 3 27B
27B · 54.0 GB
Gemma 3 4B
4B · 8.0 GB
Llama 3.1 405B
405B · 810.0 GB
Llama 3.2 1B
1B · 2.0 GB
Llama 3.2 3B
3B · 6.0 GB
Llama 4 Maverick 400B
400B · 800.0 GB
Ministral 3B
3B · 6.0 GB
Mistral Large 3 675B
675B · 1350.0 GB
Mistral Small 24B
24B · 48.0 GB
Mixtral 8x22B
141B · 282.0 GB
Mixtral 8x7B
46.7B · 93.4 GB
Phi-3 Mini 3.8B
3.8B · 7.6 GB
Phi-4 Mini 3.8B
3.8B · 7.6 GB
Qwen 2.5 0.5B
0.5B · 1.0 GB
Qwen 2.5 1.5B
1.5B · 3.0 GB
Qwen 2.5 32B
32B · 64.0 GB
Qwen 2.5 3B
3B · 6.0 GB
Qwen 2.5 72B
72B · 144.0 GB
Qwen 3 0.6B
0.6B · 1.2 GB
Qwen 3 1.7B
1.7B · 3.4 GB
Qwen 3 4B
4B · 8.0 GB
Qwen 3 MoE 235B-A22B
235B · 470.0 GB
Qwen 3 MoE 30B-A3B
30B · 60.0 GB

Your Stack

Add models from the browser to start building your stack.

Your Hardware

Last Updated: March 16, 2026

How It Works

Add AI models from 6 categories to build your stack. Each model's VRAM is calculated based on its parameters and configuration. Models marked 'Always On' are loaded simultaneously, while 'On Demand' models only add the largest one to peak VRAM. Total includes 0.5 GB system overhead.

Why This Matters

The AI landscape has shifted from single-model deployments to complex multi-model pipelines. Modern AI applications routinely combine a large language model for reasoning, a speech-to-text model for audio input, a text-to-speech model for audio output, an image generation model for visual content, and an embedding model for retrieval-augmented generation. Each model competes for the same finite VRAM, and exceeding capacity means either the stack fails to load or models must be swapped in and out with significant latency penalties. GPU hardware represents one of the largest capital expenditures in AI development. An NVIDIA A100 80GB costs $15,000-$20,000, and even consumer GPUs like the RTX 4090 (24 GB) run $1,600-$2,000. Making the wrong hardware purchase because you did not accurately estimate your stack's total VRAM requirement is an expensive mistake that can set back a project by weeks while you wait for replacement hardware. Cloud GPU costs compound this — renting an 80 GB A100 costs $2-4/hour, so overprovisioning wastes money while underprovisioning causes OOM crashes. This calculator solves the multi-model planning problem that single-model VRAM calculators cannot. By letting you build your complete stack — with both always-on and on-demand models — it reveals whether your hardware can handle your intended pipeline before you commit to purchases or cloud contracts.

Real-World Examples

Scenario 1: The Local Voice Assistant — A developer building a privacy-first voice assistant needs Whisper Medium (1.5 GB) for speech recognition, a Q4 Llama 3 8B (5 GB) for conversation, and Piper TTS (0.3 GB) for speech synthesis. With 0.5 GB overhead, the total is 7.3 GB — fitting comfortably on a 12 GB RTX 4070. Without this calculator, the developer was considering a 24 GB GPU, saving $600 by knowing 12 GB is sufficient. Scenario 2: The Creative Studio Setup — A freelance artist wants Stable Diffusion XL (6.5 GB at FP16), a Q4 Mistral 7B (4 GB) for prompt refinement, and CLIP ViT-L (1.5 GB) for image understanding, all on an RTX 4090 (24 GB). The calculator shows 12.5 GB peak usage — plenty of room. They add a video generation model (Mochi, 10 GB) as on-demand, bringing peak to 22.5 GB — tight but feasible. Scenario 3: The RAG Pipeline — An enterprise team deploying a retrieval-augmented generation system needs an embedding model (BGE Large, 0.7 GB) running always-on for document indexing, plus a Q8 Llama 3 70B (70 GB) for generation. The calculator immediately shows they need at least an A100 80GB or dual RTX 4090s. This analysis saved weeks of debugging OOM errors on inadequate hardware.

Methodology & Sources

This calculator aggregates VRAM requirements for multiple AI models running simultaneously, using the same per-model calculation methodology as our LLM VRAM Checker. For each model in the stack, VRAM is estimated based on parameter count, quantization level, and context length. The total VRAM calculation sums individual model requirements plus a shared overhead estimate for the inference framework and CUDA context. When models share the same GPU, there is typically 200-500MB of additional fixed overhead per model instance beyond the model weights. Model specifications are sourced from official model cards on Hugging Face and manufacturer documentation. GPU specifications come from NVIDIA, AMD, and Apple official data sheets. Categories covered: Large Language Models, Embedding Models, Image Generation (Stable Diffusion, DALL-E equivalents), Speech-to-Text (Whisper variants), Text-to-Speech, and Video Generation models. Limitations: Running multiple models simultaneously on a single GPU requires VRAM for all models to be loaded at once. Some inference frameworks support dynamic model loading/unloading, which would reduce peak VRAM usage but increase latency. Actual memory usage varies by framework, batch size, and concurrent request volume.

Common Mistakes to Avoid

1. Forgetting CUDA context overhead — Each model loaded on a GPU consumes 200-500 MB of CUDA context memory beyond its weights. Loading 5 models can add 1-2.5 GB of overhead that is invisible in model-card specifications. This calculator includes per-model overhead in the total. 2. Confusing always-on and on-demand VRAM calculations — Some users add up all model sizes for peak VRAM. In practice, on-demand models are only loaded when needed. Peak VRAM equals all always-on models plus the single largest on-demand model — not all on-demand models simultaneously. 3. Ignoring quantization opportunities — Many users plan their stacks using FP16 (full precision) for all models. Applying Q4 quantization to LLMs reduces their VRAM by approximately 75% with only modest quality loss. A 70B parameter model drops from ~140 GB to ~35 GB at Q4 — the difference between needing an H100 and fitting on an RTX 4090. 4. Not accounting for batch size and concurrent requests — The VRAM calculations in this tool assume single-request inference. If your application serves multiple concurrent users, KV cache memory scales linearly with batch size. Serving 4 concurrent requests to a 7B LLM with 4K context can add 2-4 GB of additional VRAM requirement. 5. Overlooking system memory requirements — GPU VRAM is not the only bottleneck. The CPU must hold model weights during loading, intermediate tensors, and the inference runtime. Ensure your system has at least 2x the total model size in system RAM, or loading will fail or thrash to swap.

Frequently Asked Questions

What is an AI model stack?
An AI stack is a combination of multiple AI models running on the same GPU — for example, an LLM for text generation, a Stable Diffusion model for images, and Whisper for speech recognition. This tool helps you check if your GPU can handle them all.
What's the difference between Always On and On Demand?
Always On models stay loaded in VRAM at all times (e.g., your main LLM). On Demand models are loaded only when needed and unloaded after use. Peak VRAM = all Always On models + the single largest On Demand model.
How can I reduce my stack's VRAM usage?
Use Q4 quantization for LLMs (saves ~75% vs FP16). Set less-used models to On Demand mode. Use smaller model variants where quality is acceptable. Enable Low VRAM mode for TTS models that support it.
Can I run multiple models on a single GPU?
Yes, as long as the total VRAM requirement of all loaded models fits within your GPU's memory. Some inference frameworks like Ollama and text-generation-inference support loading multiple models simultaneously. Others can swap models in and out of VRAM on demand, which uses less peak memory but adds latency when switching between models. This calculator shows the peak VRAM needed when all selected models are loaded at once.
Does running multiple models slow down inference?
Running multiple models simultaneously can reduce per-model inference speed because they compete for GPU compute resources and memory bandwidth. The impact depends on whether models are being queried concurrently or sequentially. For sequential use (one model at a time), performance impact is minimal as long as all models fit in VRAM. For concurrent inference, expect some throughput reduction. Using separate GPUs for different models eliminates this contention.
What is the minimum GPU for a useful AI stack?
An NVIDIA RTX 4060 Ti (16 GB) can run a quantized 7B LLM (Q4, ~4.5 GB), Whisper Small (0.5 GB), and a TTS model (~1 GB) simultaneously with headroom to spare. For image generation, you will need at least 8 GB for SDXL. A 24 GB RTX 4090 or Apple M2 Ultra (64-192 GB unified) enables much more ambitious stacks including 13-70B LLMs alongside multimodal models.
How does Apple Silicon unified memory compare to dedicated GPU VRAM?
Apple Silicon (M1-M4) uses unified memory shared between CPU and GPU, meaning the full memory pool (16-192 GB) is available for model loading. However, bandwidth is lower than dedicated GPUs — M2 Ultra provides ~800 GB/s vs ~1,008 GB/s on an RTX 4090. For inference throughput, dedicated GPUs are faster per GB, but Apple Silicon allows loading much larger models without the VRAM ceiling.
Should I use one large GPU or two smaller GPUs for my stack?
One large GPU is generally simpler and avoids inter-GPU communication overhead. However, two GPUs allow isolating models — for example, your LLM on GPU 0 and Stable Diffusion on GPU 1, eliminating contention during concurrent use. Tensor parallelism across GPUs is complex to set up and adds latency. For most hobbyists, a single 24 GB GPU provides the best simplicity-to-capability ratio.

Related Guides

Learn more about the concepts behind this tool