Sloth LabSlothLab Tools

AI 모델 스택 빌더

멀티 모델 AI 설정의 총 VRAM을 계산합니다.

빠른 시작:

모델 브라우저

DeepSeek R1 14BPopular
14B · 28.0 GB
DeepSeek R1 32BPopular
32B · 64.0 GB
DeepSeek R1 70BPopular
70B · 140.0 GB
DeepSeek R1 7BPopular
7B · 14.0 GB
DeepSeek R1 8BPopular
8B · 16.0 GB
Gemma 3 12BPopular
12B · 24.0 GB
Llama 3.1 70BPopular
70B · 140.0 GB
Llama 3.1 8BPopular
8B · 16.0 GB
Llama 3.3 70BPopular
70B · 140.0 GB
Llama 4 Scout 109BPopular
109B · 218.0 GB
Ministral 7BPopular
7B · 14.0 GB
Mistral 7B v0.3Popular
7B · 14.0 GB
Mistral NeMo 12BPopular
12B · 24.0 GB
Phi-3 Medium 14BPopular
14B · 28.0 GB
Phi-4 14BPopular
14B · 28.0 GB
Qwen 2.5 14BPopular
14B · 28.0 GB
Qwen 2.5 7BPopular
7B · 14.0 GB
Qwen 3 14BPopular
14B · 28.0 GB
Qwen 3 32BPopular
32B · 64.0 GB
Qwen 3 8BPopular
8B · 16.0 GB
Command R 35B
35B · 70.0 GB
Command R+ 104B
104B · 208.0 GB
DeepSeek R1 1.5B
1.5B · 3.0 GB
DeepSeek V2 236B
236B · 472.0 GB
DeepSeek V2 Lite 16B
16B · 32.0 GB
DeepSeek V3 671B
671B · 1342.0 GB
Gemma 2 27B
27B · 54.0 GB
Gemma 2 2B
2B · 4.0 GB
Gemma 2 9B
9B · 18.0 GB
Gemma 3 1B
1B · 2.0 GB
Gemma 3 27B
27B · 54.0 GB
Gemma 3 4B
4B · 8.0 GB
Llama 3.1 405B
405B · 810.0 GB
Llama 3.2 1B
1B · 2.0 GB
Llama 3.2 3B
3B · 6.0 GB
Llama 4 Maverick 400B
400B · 800.0 GB
Ministral 3B
3B · 6.0 GB
Mistral Large 3 675B
675B · 1350.0 GB
Mistral Small 24B
24B · 48.0 GB
Mixtral 8x22B
141B · 282.0 GB
Mixtral 8x7B
46.7B · 93.4 GB
Phi-3 Mini 3.8B
3.8B · 7.6 GB
Phi-4 Mini 3.8B
3.8B · 7.6 GB
Qwen 2.5 0.5B
0.5B · 1.0 GB
Qwen 2.5 1.5B
1.5B · 3.0 GB
Qwen 2.5 32B
32B · 64.0 GB
Qwen 2.5 3B
3B · 6.0 GB
Qwen 2.5 72B
72B · 144.0 GB
Qwen 3 0.6B
0.6B · 1.2 GB
Qwen 3 1.7B
1.7B · 3.4 GB
Qwen 3 4B
4B · 8.0 GB
Qwen 3 MoE 235B-A22B
235B · 470.0 GB
Qwen 3 MoE 30B-A3B
30B · 60.0 GB

내 스택

브라우저에서 모델을 추가하여 스택 구성을 시작하세요.

하드웨어

최종 업데이트: March 16, 2026

작동 원리

6개 카테고리에서 AI 모델을 추가하여 스택을 구성합니다. 각 모델의 VRAM은 파라미터와 설정에 따라 계산됩니다. '항상 로드' 모델은 동시에 로드되고, '필요시 로드' 모델은 가장 큰 것만 피크 VRAM에 추가됩니다. 총합에는 0.5GB 시스템 오버헤드가 포함됩니다.

common.whyThisMatters

tools.ai-stack-builder.whyThisMatters

common.realWorldExamples

tools.ai-stack-builder.realWorldExamples

방법론 및 출처

This calculator aggregates VRAM requirements for multiple AI models running simultaneously, using the same per-model calculation methodology as our LLM VRAM Checker. For each model in the stack, VRAM is estimated based on parameter count, quantization level, and context length. The total VRAM calculation sums individual model requirements plus a shared overhead estimate for the inference framework and CUDA context. When models share the same GPU, there is typically 200-500MB of additional fixed overhead per model instance beyond the model weights. Model specifications are sourced from official model cards on Hugging Face and manufacturer documentation. GPU specifications come from NVIDIA, AMD, and Apple official data sheets. Categories covered: Large Language Models, Embedding Models, Image Generation (Stable Diffusion, DALL-E equivalents), Speech-to-Text (Whisper variants), Text-to-Speech, and Video Generation models. Limitations: Running multiple models simultaneously on a single GPU requires VRAM for all models to be loaded at once. Some inference frameworks support dynamic model loading/unloading, which would reduce peak VRAM usage but increase latency. Actual memory usage varies by framework, batch size, and concurrent request volume.

common.commonMistakes

tools.ai-stack-builder.commonMistakes

자주 묻는 질문

AI 모델 스택이란 무엇인가요?
AI 스택은 같은 GPU에서 실행되는 여러 AI 모델의 조합입니다. 예를 들어 텍스트 생성용 LLM, 이미지용 Stable Diffusion, 음성 인식용 Whisper 등입니다. 이 도구로 GPU가 모두 감당할 수 있는지 확인할 수 있습니다.
'항상 로드'와 '필요시 로드'의 차이는 무엇인가요?
'항상 로드' 모델은 항상 VRAM에 상주합니다(예: 메인 LLM). '필요시 로드' 모델은 필요할 때만 로드되고 사용 후 언로드됩니다. 피크 VRAM = 모든 '항상 로드' 모델 + 가장 큰 '필요시 로드' 모델.
스택의 VRAM 사용량을 어떻게 줄일 수 있나요?
LLM에 Q4 양자화 사용(FP16 대비 약 75% 절감). 자주 사용하지 않는 모델은 '필요시 로드'로 설정. 품질이 허용되는 경우 더 작은 모델 변형 사용. TTS 모델에서 지원하는 경우 저VRAM 모드 활성화.
Can I run multiple models on a single GPU?
Yes, as long as the total VRAM requirement of all loaded models fits within your GPU's memory. Some inference frameworks like Ollama and text-generation-inference support loading multiple models simultaneously. Others can swap models in and out of VRAM on demand, which uses less peak memory but adds latency when switching between models. This calculator shows the peak VRAM needed when all selected models are loaded at once.
Does running multiple models slow down inference?
Running multiple models simultaneously can reduce per-model inference speed because they compete for GPU compute resources and memory bandwidth. The impact depends on whether models are being queried concurrently or sequentially. For sequential use (one model at a time), performance impact is minimal as long as all models fit in VRAM. For concurrent inference, expect some throughput reduction. Using separate GPUs for different models eliminates this contention.
tools.ai-stack-builder.faq.q6
tools.ai-stack-builder.faq.a6
tools.ai-stack-builder.faq.q7
tools.ai-stack-builder.faq.a7
tools.ai-stack-builder.faq.q8
tools.ai-stack-builder.faq.a8

관련 가이드

이 도구의 개념에 대해 더 알아보세요