AI 모델 스택 빌더
멀티 모델 AI 설정의 총 VRAM을 계산합니다.
빠른 시작:
모델 브라우저
DeepSeek R1 14BPopular
14B · 28.0 GBDeepSeek R1 32BPopular
32B · 64.0 GBDeepSeek R1 70BPopular
70B · 140.0 GBDeepSeek R1 7BPopular
7B · 14.0 GBDeepSeek R1 8BPopular
8B · 16.0 GBGemma 3 12BPopular
12B · 24.0 GBLlama 3.1 70BPopular
70B · 140.0 GBLlama 3.1 8BPopular
8B · 16.0 GBLlama 3.3 70BPopular
70B · 140.0 GBLlama 4 Scout 109BPopular
109B · 218.0 GBMinistral 7BPopular
7B · 14.0 GBMistral 7B v0.3Popular
7B · 14.0 GBMistral NeMo 12BPopular
12B · 24.0 GBPhi-3 Medium 14BPopular
14B · 28.0 GBPhi-4 14BPopular
14B · 28.0 GBQwen 2.5 14BPopular
14B · 28.0 GBQwen 2.5 7BPopular
7B · 14.0 GBQwen 3 14BPopular
14B · 28.0 GBQwen 3 32BPopular
32B · 64.0 GBQwen 3 8BPopular
8B · 16.0 GBCommand R 35B
35B · 70.0 GBCommand R+ 104B
104B · 208.0 GBDeepSeek R1 1.5B
1.5B · 3.0 GBDeepSeek V2 236B
236B · 472.0 GBDeepSeek V2 Lite 16B
16B · 32.0 GBDeepSeek V3 671B
671B · 1342.0 GBGemma 2 27B
27B · 54.0 GBGemma 2 2B
2B · 4.0 GBGemma 2 9B
9B · 18.0 GBGemma 3 1B
1B · 2.0 GBGemma 3 27B
27B · 54.0 GBGemma 3 4B
4B · 8.0 GBLlama 3.1 405B
405B · 810.0 GBLlama 3.2 1B
1B · 2.0 GBLlama 3.2 3B
3B · 6.0 GBLlama 4 Maverick 400B
400B · 800.0 GBMinistral 3B
3B · 6.0 GBMistral Large 3 675B
675B · 1350.0 GBMistral Small 24B
24B · 48.0 GBMixtral 8x22B
141B · 282.0 GBMixtral 8x7B
46.7B · 93.4 GBPhi-3 Mini 3.8B
3.8B · 7.6 GBPhi-4 Mini 3.8B
3.8B · 7.6 GBQwen 2.5 0.5B
0.5B · 1.0 GBQwen 2.5 1.5B
1.5B · 3.0 GBQwen 2.5 32B
32B · 64.0 GBQwen 2.5 3B
3B · 6.0 GBQwen 2.5 72B
72B · 144.0 GBQwen 3 0.6B
0.6B · 1.2 GBQwen 3 1.7B
1.7B · 3.4 GBQwen 3 4B
4B · 8.0 GBQwen 3 MoE 235B-A22B
235B · 470.0 GBQwen 3 MoE 30B-A3B
30B · 60.0 GB내 스택
브라우저에서 모델을 추가하여 스택 구성을 시작하세요.
하드웨어
최종 업데이트: March 16, 2026
작동 원리
6개 카테고리에서 AI 모델을 추가하여 스택을 구성합니다. 각 모델의 VRAM은 파라미터와 설정에 따라 계산됩니다. '항상 로드' 모델은 동시에 로드되고, '필요시 로드' 모델은 가장 큰 것만 피크 VRAM에 추가됩니다. 총합에는 0.5GB 시스템 오버헤드가 포함됩니다.
common.whyThisMatters
tools.ai-stack-builder.whyThisMatters
common.realWorldExamples
tools.ai-stack-builder.realWorldExamples
방법론 및 출처
This calculator aggregates VRAM requirements for multiple AI models running simultaneously, using the same per-model calculation methodology as our LLM VRAM Checker. For each model in the stack, VRAM is estimated based on parameter count, quantization level, and context length.
The total VRAM calculation sums individual model requirements plus a shared overhead estimate for the inference framework and CUDA context. When models share the same GPU, there is typically 200-500MB of additional fixed overhead per model instance beyond the model weights.
Model specifications are sourced from official model cards on Hugging Face and manufacturer documentation. GPU specifications come from NVIDIA, AMD, and Apple official data sheets.
Categories covered: Large Language Models, Embedding Models, Image Generation (Stable Diffusion, DALL-E equivalents), Speech-to-Text (Whisper variants), Text-to-Speech, and Video Generation models.
Limitations: Running multiple models simultaneously on a single GPU requires VRAM for all models to be loaded at once. Some inference frameworks support dynamic model loading/unloading, which would reduce peak VRAM usage but increase latency. Actual memory usage varies by framework, batch size, and concurrent request volume.
common.commonMistakes
tools.ai-stack-builder.commonMistakes
자주 묻는 질문
AI 모델 스택이란 무엇인가요?
AI 스택은 같은 GPU에서 실행되는 여러 AI 모델의 조합입니다. 예를 들어 텍스트 생성용 LLM, 이미지용 Stable Diffusion, 음성 인식용 Whisper 등입니다. 이 도구로 GPU가 모두 감당할 수 있는지 확인할 수 있습니다.
'항상 로드'와 '필요시 로드'의 차이는 무엇인가요?
'항상 로드' 모델은 항상 VRAM에 상주합니다(예: 메인 LLM). '필요시 로드' 모델은 필요할 때만 로드되고 사용 후 언로드됩니다. 피크 VRAM = 모든 '항상 로드' 모델 + 가장 큰 '필요시 로드' 모델.
스택의 VRAM 사용량을 어떻게 줄일 수 있나요?
LLM에 Q4 양자화 사용(FP16 대비 약 75% 절감). 자주 사용하지 않는 모델은 '필요시 로드'로 설정. 품질이 허용되는 경우 더 작은 모델 변형 사용. TTS 모델에서 지원하는 경우 저VRAM 모드 활성화.
Can I run multiple models on a single GPU?
Yes, as long as the total VRAM requirement of all loaded models fits within your GPU's memory. Some inference frameworks like Ollama and text-generation-inference support loading multiple models simultaneously. Others can swap models in and out of VRAM on demand, which uses less peak memory but adds latency when switching between models. This calculator shows the peak VRAM needed when all selected models are loaded at once.
Does running multiple models slow down inference?
Running multiple models simultaneously can reduce per-model inference speed because they compete for GPU compute resources and memory bandwidth. The impact depends on whether models are being queried concurrently or sequentially. For sequential use (one model at a time), performance impact is minimal as long as all models fit in VRAM. For concurrent inference, expect some throughput reduction. Using separate GPUs for different models eliminates this contention.
tools.ai-stack-builder.faq.q6
tools.ai-stack-builder.faq.a6
tools.ai-stack-builder.faq.q7
tools.ai-stack-builder.faq.a7
tools.ai-stack-builder.faq.q8
tools.ai-stack-builder.faq.a8
관련 도구
관련 가이드
이 도구의 개념에 대해 더 알아보세요