Choosing between an RTX 4090, an RTX 5090 and an H100 SXM5 for self-hosted AI compute in 2026 is rarely about the headline TFLOPS number. The right GPU is the one whose VRAM, memory bandwidth and price per inference hour fit the model class and batch shape you actually run. This guide walks through the four GPU tiers ServPrivacy ships, the workloads each one is sized for, and how to read the throughput numbers on the chart.
The four tiers in one paragraph
RTX 4090 (GPU-S, $249-329/mo) ships 24 GB of GDDR6X at ~1 TB/s of memory bandwidth and ~83 TFLOPS FP16. It is the right pick for 7B-13B language models, FLUX.1 / SDXL image generation, Whisper transcription, and Bark text-to-speech. RTX 5090 (GPU-M, $399-519/mo) bumps to 32 GB GDDR7 at ~1.8 TB/s and ~104 TFLOPS FP16; the extra 8 GB and ~80% bandwidth uplift unlock 27B-32B models comfortably (Gemma-3-27B, Qwen3-32B, Mistral-Small-3) and let you finetune small Llamas. H100 SXM5 (GPU-L, $1699-1899/mo) is a different category — 80 GB HBM3 at ~3.35 TB/s, ~989 TFLOPS FP16 (Tensor-Core), with NVLink-class fabric available; it is sized for 70B-class language models, longer-context inference, and faster training. 2× H100 SXM5 (GPU-XL, $3199-3599/mo) is for full-precision 70B inference, multi-GPU training, and 100B+ models at Q4 / Q5.

Memory bandwidth dominates LLM inference
For decoder-only transformer inference at batch sizes up to roughly 16, the bottleneck is memory bandwidth, not raw FLOPS. Every generated token forces a full read of the model weights from VRAM (the prefill phase reuses K-V cache, but each new token reads the weight matrices again). The H100's 3.35 TB/s of HBM3 is what makes it ~3x faster per token than a 4090 on a 70B-class model — not the higher TFLOPS number. This is also why the RTX 5090 jump from GDDR6X to GDDR7 (~1.8 TB/s vs ~1 TB/s) matters more for inference than the raw FLOPS bump. If your workload is dominated by inference rather than training, prioritize bandwidth over TFLOPS.
What fits in 24 GB / 32 GB / 80 GB
Quantization changes the picture. At Q4_K_M (a typical "good quality" quant): a 7B model needs ~4.5 GB, a 13B needs ~8 GB, a 27-32B needs ~20 GB, a 70B needs ~42 GB, a 100B needs ~60 GB. Add ~10-15% headroom for K-V cache and CUDA workspace. The practical fits: 24 GB = 7B-13B comfortable, 27-32B with offload pain, 70B not viable. 32 GB = 27-32B comfortable, 70B with CPU offload (slow). 80 GB = 70B comfortable at Q4-Q5, 100B with offload. 160 GB (dual H100) = 70B at FP16 / BF16, 100-180B at Q4. At FP16 / BF16 (no quantization) the numbers double: a 70B at FP16 needs ~140 GB, which is why 2× H100 is the entry point for full-precision flagship-model inference.
When the RTX 5090 is the right answer
The RTX 5090's release in early 2025 created a new sweet spot. For the 27B-32B-class models that matter most in 2026 (Gemma-3-27B, Qwen3-32B, Mistral-Small-3, Phi-4, DeepSeek-R1-Distill-Qwen-32B), the 5090 is roughly 2.5x the throughput of a 4090 at half the cost of an H100. If your workload is "I need a really capable assistant model with reasoning, multilingual support, and a 32K context window, but I don't need 70B+", the GPU-M tier is where you should start. It also doubles as a generous image-generation rig — FLUX.1-dev runs comfortably with 16 GB of VRAM headroom for high-resolution batches.
When you want H100, not 4090
Three signals push the buying decision up to GPU-L (single H100): (1) you serve 70B-class models or DeepSeek-R1-Distill-Llama-70B and want sub-second time-to-first-token at batch 1; (2) you run high-batch concurrent inference (vLLM with batch 16+ users) where the H100's memory bandwidth is the bottleneck-breaker; (3) you train or LoRA-finetune on datasets above ~10M tokens and want the FP8 training path that the 4090 / 5090 do not have. The H100's FP8 Transformer Engine roughly doubles training throughput vs FP16, which is what makes finetuning 70B Llama feasible on a single card.
$/token economics
For high-volume workloads, the right comparison is dollars per million tokens at sustained throughput. On Llama-3.1-70B Q4, vLLM 0.7+, batch 16: an RTX 4090 cannot host the model without offload (CPU-RAM offload kills throughput by ~10x). An RTX 5090 with CPU-offload sits around $X per 1M tokens (rough; varies by quant). A single H100 SXM5 sits around $1.40-2.20 per 1M output tokens at our $1699/mo entry price. Compare to OpenAI GPT-4o output at ~$10 / 1M and Claude Sonnet at ~$15 / 1M — once your workload reaches roughly 30M tokens per day, self-hosting on a single H100 is cheaper than calling hosted APIs, and the privacy outcome is end-to-end. For lower volumes, hosted APIs win on cost.
Image, video and audio workloads
Image generation rarely needs more than a 4090 — FLUX.1-dev, SDXL, SD 3.5 all fit in 24 GB at production quality, and the RTX 4090's ~83 TFLOPS FP16 is plenty. Going to 5090 / H100 mostly buys you batch-size headroom (more concurrent generations) rather than per-image speed. AI video (Wan-2.1, CogVideoX-5B, Runway-class workflows) is more demanding — GPU-M is the practical entry, GPU-L for production-quality long-form. Whisper Large v3 ASR and Bark TTS both run comfortably on the 4090; the H100 is overkill for them. Finetuning with LoRA or QLoRA on 7B-13B works on a 4090; finetuning 32B-70B realistically wants 5090 minimum, H100 if you value time.
What about RTX 5090 vs RTX A6000 / A100?
If you have looked at GPU options outside the consumer-card line, you may have come across RTX A6000 (48 GB, datacenter card) or A100 (40 / 80 GB, prior-generation HBM2e). Quick verdict: the A6000 is roughly 4090-class compute with twice the VRAM, useful if VRAM is your bottleneck but bandwidth is not (rare); the A100 is a generation behind the H100 and now mostly available on the secondary market — if you can find it cheap it remains a credible 70B-inference card, but new builds in 2026 are typically H100. We do not currently offer A6000 or A100 tiers; the catalog jumps from RTX 5090 to H100.
What we ship and what to pick
To summarize the GPU buying decision in one sentence per workload: chatbot / coding-assistant under 32B → GPU-S (RTX 4090) for 7B-13B, GPU-M (RTX 5090) for 27B-32B; flagship 70B inference (Llama-3.3-70B-Instruct, DeepSeek-R1-Distill-Llama-70B) → GPU-L (H100 SXM5); full-precision 70B or multi-GPU training → GPU-XL (2× H100 SXM5); image / video / voice generation → GPU-S unless you need batch headroom, then GPU-M. All four tiers ship with CUDA 12.4 + cuDNN preinstalled and 1-click vLLM / Ollama / ComfyUI / Stable Diffusion templates. The full hardware spec is on /gpu.