Home / Privacy Hosting Guides / RTX 4090 vs H100 SXM5 for AI Inference (and Where RTX 5090 Fits)

Buying

RTX 4090 vs H100 — Which GPU for Your AI Workload?

Picking the right NVIDIA GPU for self-hosted AI is not just about VRAM. RTX 4090 is the price-sweet-spot for 7B-13B inference and image generation; RTX 5090 (32 GB GDDR7) is the new mid-tier for 27B-32B; H100 SXM5 (80 GB HBM3) is for 70B-class workloads where memory bandwidth dominates. We go through the trade-offs by workload class with throughput numbers, $/token economics, and what fits in each ServPrivacy GPU tier.

Read the guide FAQ

Choosing between an RTX 4090, an RTX 5090 and an H100 SXM5 for self-hosted AI compute in 2026 is rarely about the headline TFLOPS number. The right GPU is the one whose VRAM, memory bandwidth and price per inference hour fit the model class and batch shape you actually run. This guide walks through the four GPU tiers ServPrivacy ships, the workloads each one is sized for, and how to read the throughput numbers on the chart.

The four tiers in one paragraph

RTX 4090 (GPU-S, $249-329/mo) ships 24 GB of GDDR6X at ~1 TB/s of memory bandwidth and ~83 TFLOPS FP16. It is the right pick for 7B-13B language models, FLUX.1 / SDXL image generation, Whisper transcription, and Bark text-to-speech. RTX 5090 (GPU-M, $399-519/mo) bumps to 32 GB GDDR7 at ~1.8 TB/s and ~104 TFLOPS FP16; the extra 8 GB and ~80% bandwidth uplift unlock 27B-32B models comfortably (Gemma-3-27B, Qwen3-32B, Mistral-Small-3) and let you finetune small Llamas. H100 SXM5 (GPU-L, $1699-1899/mo) is a different category — 80 GB HBM3 at ~3.35 TB/s, ~989 TFLOPS FP16 (Tensor-Core), with NVLink-class fabric available; it is sized for 70B-class language models, longer-context inference, and faster training. 2× H100 SXM5 (GPU-XL, $3199-3599/mo) is for full-precision 70B inference, multi-GPU training, and 100B+ models at Q4 / Q5.

RTX 4090 vs H100 — Which GPU for Your AI Workload? — Throughput vs batch size on RTX 4090 (24 GB), RTX 5090 (32 GB) and H100 SXM5 (80 GB) — Llama-3.1-70B-Instruct quantized to Q4_K_M, vLLM 0.7+, batch 1 to batch 32.

Memory bandwidth dominates LLM inference

For decoder-only transformer inference at batch sizes up to roughly 16, the bottleneck is memory bandwidth, not raw FLOPS. Every generated token forces a full read of the model weights from VRAM (the prefill phase reuses K-V cache, but each new token reads the weight matrices again). The H100's 3.35 TB/s of HBM3 is what makes it ~3x faster per token than a 4090 on a 70B-class model — not the higher TFLOPS number. This is also why the RTX 5090 jump from GDDR6X to GDDR7 (~1.8 TB/s vs ~1 TB/s) matters more for inference than the raw FLOPS bump. If your workload is dominated by inference rather than training, prioritize bandwidth over TFLOPS.

What fits in 24 GB / 32 GB / 80 GB

Quantization changes the picture. At Q4_K_M (a typical "good quality" quant): a 7B model needs ~4.5 GB, a 13B needs ~8 GB, a 27-32B needs ~20 GB, a 70B needs ~42 GB, a 100B needs ~60 GB. Add ~10-15% headroom for K-V cache and CUDA workspace. The practical fits: 24 GB = 7B-13B comfortable, 27-32B with offload pain, 70B not viable. 32 GB = 27-32B comfortable, 70B with CPU offload (slow). 80 GB = 70B comfortable at Q4-Q5, 100B with offload. 160 GB (dual H100) = 70B at FP16 / BF16, 100-180B at Q4. At FP16 / BF16 (no quantization) the numbers double: a 70B at FP16 needs ~140 GB, which is why 2× H100 is the entry point for full-precision flagship-model inference.

When the RTX 5090 is the right answer

The RTX 5090's release in early 2025 created a new sweet spot. For the 27B-32B-class models that matter most in 2026 (Gemma-3-27B, Qwen3-32B, Mistral-Small-3, Phi-4, DeepSeek-R1-Distill-Qwen-32B), the 5090 is roughly 2.5x the throughput of a 4090 at half the cost of an H100. If your workload is "I need a really capable assistant model with reasoning, multilingual support, and a 32K context window, but I don't need 70B+", the GPU-M tier is where you should start. It also doubles as a generous image-generation rig — FLUX.1-dev runs comfortably with 16 GB of VRAM headroom for high-resolution batches.

When you want H100, not 4090

Three signals push the buying decision up to GPU-L (single H100): (1) you serve 70B-class models or DeepSeek-R1-Distill-Llama-70B and want sub-second time-to-first-token at batch 1; (2) you run high-batch concurrent inference (vLLM with batch 16+ users) where the H100's memory bandwidth is the bottleneck-breaker; (3) you train or LoRA-finetune on datasets above ~10M tokens and want the FP8 training path that the 4090 / 5090 do not have. The H100's FP8 Transformer Engine roughly doubles training throughput vs FP16, which is what makes finetuning 70B Llama feasible on a single card.

$/token economics

For high-volume workloads, the right comparison is dollars per million tokens at sustained throughput. On Llama-3.1-70B Q4, vLLM 0.7+, batch 16: an RTX 4090 cannot host the model without offload (CPU-RAM offload kills throughput by ~10x). An RTX 5090 with CPU-offload sits around $X per 1M tokens (rough; varies by quant). A single H100 SXM5 sits around $1.40-2.20 per 1M output tokens at our $1699/mo entry price. Compare to OpenAI GPT-4o output at ~$10 / 1M and Claude Sonnet at ~$15 / 1M — once your workload reaches roughly 30M tokens per day, self-hosting on a single H100 is cheaper than calling hosted APIs, and the privacy outcome is end-to-end. For lower volumes, hosted APIs win on cost.

Image, video and audio workloads

Image generation rarely needs more than a 4090 — FLUX.1-dev, SDXL, SD 3.5 all fit in 24 GB at production quality, and the RTX 4090's ~83 TFLOPS FP16 is plenty. Going to 5090 / H100 mostly buys you batch-size headroom (more concurrent generations) rather than per-image speed. AI video (Wan-2.1, CogVideoX-5B, Runway-class workflows) is more demanding — GPU-M is the practical entry, GPU-L for production-quality long-form. Whisper Large v3 ASR and Bark TTS both run comfortably on the 4090; the H100 is overkill for them. Finetuning with LoRA or QLoRA on 7B-13B works on a 4090; finetuning 32B-70B realistically wants 5090 minimum, H100 if you value time.

What about RTX 5090 vs RTX A6000 / A100?

If you have looked at GPU options outside the consumer-card line, you may have come across RTX A6000 (48 GB, datacenter card) or A100 (40 / 80 GB, prior-generation HBM2e). Quick verdict: the A6000 is roughly 4090-class compute with twice the VRAM, useful if VRAM is your bottleneck but bandwidth is not (rare); the A100 is a generation behind the H100 and now mostly available on the secondary market — if you can find it cheap it remains a credible 70B-inference card, but new builds in 2026 are typically H100. We do not currently offer A6000 or A100 tiers; the catalog jumps from RTX 5090 to H100.

What we ship and what to pick

To summarize the GPU buying decision in one sentence per workload: chatbot / coding-assistant under 32B → GPU-S (RTX 4090) for 7B-13B, GPU-M (RTX 5090) for 27B-32B; flagship 70B inference (Llama-3.3-70B-Instruct, DeepSeek-R1-Distill-Llama-70B) → GPU-L (H100 SXM5); full-precision 70B or multi-GPU training → GPU-XL (2× H100 SXM5); image / video / voice generation → GPU-S unless you need batch headroom, then GPU-M. All four tiers ship with CUDA 12.4 + cuDNN preinstalled and 1-click vLLM / Ollama / ComfyUI / Stable Diffusion templates. The full hardware spec is on /gpu.

FAQ

GPU buying — frequently asked

01 Why is memory bandwidth more important than TFLOPS for inference?

Decoder-only transformer inference at small-to-medium batch sizes is memory-bound: every generated token requires reading the entire weight matrix from VRAM. The compute kernels are fast enough that the GPU spends most of its time waiting on memory loads. This is why the H100's 3.35 TB/s HBM3 is roughly 3x faster per token than a 4090's 1 TB/s GDDR6X on the same 70B model, despite the H100's larger TFLOPS number being almost incidental.

02 Can I run Llama-3.3-70B on an RTX 4090?

Technically yes, with CPU offload via llama.cpp or KTransformers — but throughput drops to ~3-5 tokens/second on long-form generation, which is unusably slow for chat. Practically, 70B is an H100 workload (or 2× RTX 5090 with NVLink, which we do not offer). If 70B is what you need and you do not want H100 pricing, consider DeepSeek-R1-Distill-Llama-8B or DeepSeek-R1-Distill-Qwen-14B on a 4090 — the distilled models are surprisingly competitive on reasoning.

03 Is the RTX 5090 better than an A100 for AI?

For inference, mostly yes — the 5090's GDDR7 (~1.8 TB/s) edges out the A100 40GB's HBM2e (~1.55 TB/s) on bandwidth, and the FLOPS are higher. The A100's 80 GB SKU has more VRAM (80 vs 32), which matters for 70B inference. For training, the A100 still has ECC memory and the proper datacenter feature set the 5090 lacks. New builds in 2026 typically pick H100 over A100; the 5090 fills the consumer-class gap.

04 When is self-hosting actually cheaper than OpenAI / Anthropic?

Roughly: a single H100 SXM5 at $1699/mo running Llama-3.3-70B at sustained batch-16 throughput delivers ~30-50M output tokens/day. At GPT-4o pricing ($10/1M output) that is $300-500/day of equivalent hosted spend. Break-even point is around 5-7M output tokens per day. Below that, hosted APIs win; above that, self-hosting wins. The RTX 4090 / 5090 break-even points scale down with the smaller models they fit.

05 How does ServPrivacy GPU compare to Vast.ai or RunPod?

Vast.ai is cheaper on hourly spot ($0.30-0.70/h for a 4090) but quality varies wildly (consumer hardware in homes, mixed networking, eviction risk). RunPod is more consistent ($0.69-3.99/h on-demand) but US-jurisdiction with email/payment-method KYC. ServPrivacy is more expensive per hour than Vast.ai spot and roughly comparable to RunPod on-demand on a monthly basis, but token-only signup, native Monero, no eviction, no KYC, and 4 offshore jurisdictions. The right pick depends on whether privacy and predictability or pure cents-per-hour matters more.

06 What about the H200 or B200 — should I wait for those?

H200 (141 GB HBM3e) is in the catalog at hyperscale providers like CoreWeave, but supply on the offshore privacy-host segment is gated by NVIDIA channel partner status — we are evaluating 2026-Q3 availability. B200 NVL72 is hyperscale-fabric-only at this point and not feasible for single-card rentals. For most self-hosters, an H100 SXM5 in 2026 has sufficient capability for 70B-class workloads — the case for waiting for H200 is mostly multi-modal long-context use cases (200K+ tokens).

Ready to deploy your AI box?

RTX 4090 from $249/mo, RTX 5090 from $399/mo, H100 SXM5 from $1699/mo. Token-only signup, crypto checkout, CUDA 12 + 1-click AI templates.

View GPU Plans No-KYC GPU Hosting Self-Host LLM

Welcome back

RTX 4090 vs H100 — Which GPU for Your AI Workload?

On this page

The four tiers in one paragraph

Memory bandwidth dominates LLM inference

What fits in 24 GB / 32 GB / 80 GB

When the RTX 5090 is the right answer

When you want H100, not 4090

$/token economics

Image, video and audio workloads

What about RTX 5090 vs RTX A6000 / A100?

What we ship and what to pick

GPU buying — frequently asked

Keep reading

How to Choose an Offshore Hosting Jurisdiction in 2026

VPS vs Dedicated Server for Privacy-Critical Workloads

Self-Hosted VPN on a No-KYC VPS: WireGuard vs OpenVPN

Offshore Windows RDP for MT4 / MT5 / cTrader Forex Trading

Crypto Payments for Hosting: Monero vs Bitcoin vs USDT

Ready to deploy your AI box?