Sloth LabSlothLab Tools

LLM VRAM Checker

Check if your GPU can run a specific LLM model.

Model Configuration

B

Override fields are optional — leave empty to use model defaults.

Your Hardware

Can Run

FITS20.4 GB required / 24 GB available
VRAM Usage85.1%

Model Weights

4.6 GB

KV Cache

15.4 GB

Headroom

3.6 GB

Tip: Q8 is the highest quality quantization that fits your GPU.

Last Updated: March 16, 2026

How It Works

VRAM is calculated using the formula: Model Weights (parameters × bytes per weight based on quantization) + KV Cache (scales with context length) + 0.5 GB overhead. Quantization reduces precision: Q4 uses ~0.57 bytes/param, Q8 uses 1 byte, FP16 uses 2 bytes, and FP32 uses 4 bytes. Lower quantization saves VRAM with a small quality trade-off. The KV cache component is computed from the model's layer count, number of attention heads, head dimension, and your configured context length — both key and value tensors are stored in FP16 format regardless of model quantization. For Mixture of Experts (MoE) architectures, all expert weights must reside in VRAM even though only a subset of experts activates per token, so the total parameter count determines memory rather than the active parameter count.

Why This Matters

Understanding VRAM requirements is critical for anyone working with local LLMs, as GPU memory is the primary bottleneck determining which models you can run. With the rapid proliferation of open-source models — from 1B parameter lightweight models to 405B parameter behemoths like Llama 3.1 — the gap between model requirements and consumer hardware is significant. An RTX 4090 with 24 GB VRAM can comfortably run a 13B model at FP16 or a 70B model at Q4 quantization, but miscalculating these requirements leads to frustrating out-of-memory errors or unnecessarily degraded model quality. The financial stakes are substantial: enterprise GPU costs range from $10,000 (A6000 48GB) to $30,000+ (H100 80GB), while consumer GPUs span $300-$2,000. Choosing the right quantization level can mean the difference between needing a $500 RTX 4060 Ti 16GB and a $1,600 RTX 4090. For cloud deployments, VRAM directly translates to hourly costs — an A100 80GB instance costs roughly $3-4/hour versus $1-2/hour for a 40GB variant. This tool eliminates guesswork by providing precise VRAM calculations before you commit to hardware purchases or cloud instance selections, potentially saving hundreds or thousands of dollars in misallocated resources.

Real-World Examples

Scenario 1: Home AI Developer — A developer wants to run Llama 3 70B locally for coding assistance. Their RTX 3090 has 24 GB VRAM. At FP16 (140 GB required), it's impossible. At Q4 quantization (~35 GB), it still doesn't fit. They discover they can run the 8B variant at Q8 (8.5 GB) with room for a 4K context window, or invest in two RTX 3090s with tensor parallelism for the 70B model. Scenario 2: Startup Deployment — A startup needs to serve Mixtral 8x7B (46.7B total parameters) for their API. Despite only 12.9B active parameters per forward pass, all 46.7B parameters must reside in VRAM. The calculator shows they need ~24 GB at Q4 or ~47 GB at Q8, helping them choose between a single A100 80GB or two A6000 48GB cards. Scenario 3: Researcher on Apple Silicon — A researcher with an M3 Max (128 GB unified memory) wants to know the largest model they can run. The calculator reveals they can fit Llama 3 70B at FP16 (140 GB won't fit) but can run it at Q8 (~70 GB) with ample room for a 32K context window, making Apple Silicon a cost-effective alternative to NVIDIA enterprise hardware.

Methodology & Sources

This calculator estimates VRAM requirements using the fundamental relationship between model parameters and memory. At full precision (FP32), each parameter requires 4 bytes; at half-precision (FP16/BF16), 2 bytes per parameter. Quantized formats reduce this further — Q8_0 uses approximately 1 byte per parameter, Q4_K_M uses roughly 0.5 bytes, and Q2_K approximately 0.25 bytes. These values are derived from the GGML quantization specification and validated against empirical measurements from llama.cpp memory profiling. The KV-cache memory is estimated based on the model's architecture: number of layers, attention heads, head dimension, and the configured context length. The formula accounts for both key and value tensors stored in FP16 format. For models using Grouped Query Attention (GQA), such as Llama 2 70B and Mistral, the KV cache is reduced proportionally to the GQA ratio (e.g., 8:1 GQA uses 1/8th the KV cache of standard multi-head attention). Our GPU database includes specifications from NVIDIA (GeForce RTX 30/40/50 series, Quadro, Tesla, A100, H100, B200 series), AMD (Radeon RX 7000 series, Instinct MI250/MI300 series), and Apple Silicon (M1-M4 with unified memory). Specifications are sourced from official manufacturer data sheets and verified against hardware benchmarking databases. Comparative methods: Alternative approaches to VRAM estimation include profiling-based measurement (running the model and observing nvidia-smi output) and framework-specific calculators. Our analytical approach offers the advantage of pre-deployment estimation but may differ from profiled values. For maximum accuracy, we recommend cross-referencing our estimates with empirical measurements from the Hugging Face Model Memory Calculator. Limitations: Actual VRAM usage varies by inference framework (llama.cpp, vLLM, TGI, Ollama, ExLlamaV2), operating system overhead, CUDA context initialization (~300-800 MB), and specific model architecture details. Our estimates include a 10% overhead buffer but real-world usage may differ by 5-15%. Multi-GPU setups using tensor parallelism may have additional communication overhead not captured in these estimates. Flash Attention and Paged Attention optimizations (used in vLLM) can reduce KV cache memory by 50-70% compared to standard implementations, but this calculator uses conservative estimates without these optimizations.

Common Mistakes to Avoid

1. Ignoring KV cache memory: Many users only account for model weights and forget that context length dramatically affects VRAM. A 7B model at Q4 needs ~4 GB for weights, but a 128K context window can add 4-8 GB of KV cache on top. Always factor in your required context length. 2. Confusing active parameters with total parameters in MoE models: Mixtral 8x7B has 46.7B total parameters but only 12.9B active per token. Users often assume they only need VRAM for 12.9B parameters — this is wrong. All 46.7B must be loaded into memory. 3. Forgetting OS and framework overhead: Your GPU's advertised VRAM isn't entirely available. Windows reserves 300-500 MB, CUDA context takes another 300-800 MB, and the inference framework itself consumes 200-500 MB. On a 24 GB GPU, you realistically have 22-23 GB available. 4. Over-quantizing for marginal VRAM savings: Dropping from Q4 to Q2 saves roughly 50% more VRAM but causes severe quality degradation — perplexity increases by 15-30% compared to Q4's modest 2-5% increase. Q4 is generally the sweet spot for quality-to-size ratio. 5. Not considering batch size for serving: Inference serving with batch sizes greater than 1 requires additional VRAM for concurrent KV caches. Serving 8 concurrent users can double the KV cache memory requirement compared to single-user inference.

Frequently Asked Questions

What is VRAM and why does it matter for LLMs?
VRAM (Video RAM) is the memory on your GPU. LLMs must load their weights into VRAM to run. If your GPU doesn't have enough VRAM, the model either won't load or will run extremely slowly with CPU offloading.
What is quantization and how does it affect quality?
Quantization reduces the precision of model weights to save memory. Q4 (4-bit) uses ~4x less VRAM than FP16 with a modest quality loss. Q8 (8-bit) is a good balance. FP16 is full quality. For most use cases, Q4 or Q8 is sufficient.
Can I run a model that exceeds my VRAM?
Partially — tools like llama.cpp support CPU offloading, where some layers run on system RAM instead of VRAM. This works but is significantly slower. Apple Silicon Macs can use unified memory, making large models more accessible.
How does context length affect VRAM usage?
Longer context windows require more KV cache memory. A 128K context model uses substantially more VRAM than a 4K context model. If you don't need long context, reducing it can save significant VRAM.
What does MoE (Mixture of Experts) mean for VRAM?
MoE models like Mixtral have many total parameters but only activate a fraction at once. However, all parameters must still be loaded into VRAM. The 'active parameters' only affect compute speed, not memory requirements.
How does Apple Silicon unified memory differ from discrete GPU VRAM?
Apple Silicon (M1-M4) uses unified memory shared between CPU and GPU, meaning all system RAM is accessible as VRAM. A 64 GB M2 Max can load models up to ~60 GB after OS overhead. While memory bandwidth is lower than NVIDIA GPUs (400 GB/s vs. 3,350 GB/s on H100), it enables running models that would otherwise require expensive enterprise hardware.
What is the difference between GGUF, GPTQ, and AWQ quantization formats?
GGUF (used by llama.cpp/Ollama) supports CPU+GPU hybrid inference and offers Q2 through Q8 variants. GPTQ and AWQ are GPU-only formats optimized for CUDA, often yielding slightly better quality at the same bit width. GPTQ uses calibration data for optimal weight rounding, while AWQ protects salient weights. For pure GPU inference, GPTQ/AWQ can be 10-15% faster than GGUF.
How much VRAM do I need for fine-tuning versus inference?
Fine-tuning requires significantly more VRAM than inference because optimizer states and gradients must be stored. Full fine-tuning of a 7B model in FP16 needs ~28 GB, while inference needs only ~14 GB. LoRA reduces fine-tuning overhead to roughly 1.2-1.5x inference VRAM by training only small adapter layers — typically 1-4% of total parameters.

Related Guides

Learn more about the concepts behind this tool