Self-Hosted LLM Inference Costs Explained

A finance team looks at a growing API bill, sees the words "per token," and assumes that renting a GPU and running an open-weight model will be cheaper. Sometimes it is. Often it isn't, and the gap comes down to one number almost nobody measures before they commit: GPU utilization. Self-hosting an LLM turns your AI infrastructure into a fixed-cost asset, and a fixed cost only beats a variable one when you keep the asset busy.

That's the whole game. A rented H100 costs the same whether it serves one request per second or forty. The API provider, by contrast, charges you for exactly the tokens you used and amortizes idle capacity across thousands of other customers. So the real question isn't "which is cheaper per token" — it's "can we drive enough sustained traffic to a box we're paying for around the clock."

We've helped teams at Laxaar run both sides of this trade. This post walks through the cost structure honestly, including the parts that look small on a spreadsheet and then dominate the bill in production.

What you'll learn

Why GPU economics, not token prices, drive the decision
The full cost stack of a self-hosted deployment
How batching and utilization change everything
Quantization: trading precision for throughput
Self-hosted versus API: a break-even comparison
A back-of-envelope cost model you can reuse
Frequently Asked Questions

Why GPU economics, not token prices, drive the decision

Inference infrastructure cost is the total spend required to serve model responses: the accelerators, the memory, the supporting compute, and the engineering time to keep it running. For an API, that cost is hidden inside a published per-token rate. For self-hosting, you own every line item, and the dominant one is the GPU.

GPUs are billed by time, not by work done. Whether you rent on-demand from a cloud provider or buy hardware outright, you pay for the hours the card exists, not the tokens it generates. That single fact flips the usual SaaS intuition. With an API, low traffic means a low bill. With self-hosting, low traffic means an expensive idle machine.

Here's the opinionated take: most teams that "save money" by self-hosting are actually subsidizing a poorly-used GPU with engineering salaries they forgot to count. The math can absolutely work, but it works through high, steady load and disciplined batching — not through the open-weight model being free to download.

The full cost stack of a self-hosted deployment

The GPU is the headline, but it's never the whole invoice. A self-hosted LLM serving stack has at least five cost layers, and skipping any of them produces a forecast that's wrong by a wide margin.

Accelerator time. The GPU itself, billed hourly or as depreciated capital. This is usually 60 to 80 percent of the run-rate.
Host compute and memory. vCPUs and system RAM to feed the GPU, run the serving framework, and handle tokenization and request routing.
Storage and network egress. Model weights can run tens of gigabytes; loading them, snapshotting them, and moving responses out all cost money.
Redundancy. One GPU is a single point of failure. Real availability means at least two, which roughly doubles the floor cost before you serve a single extra request.
Engineering and on-call. Someone has to patch drivers, tune the server, watch for OOM crashes, and wake up at 3am. This is the line item spreadsheets always omit.

The honest trade-off lives here. An API hides all five layers behind one number and a status page. Self-hosting hands you control and data residency, and in exchange you absorb operational risk that used to be someone else's problem. If your team can't staff that, the cheaper-per-token model is a mirage.

How batching and utilization change everything

A GPU running one request at a time wastes most of its compute. LLM inference is memory-bandwidth bound during token generation, which means the card spends a lot of cycles waiting. Continuous batching — packing many in-flight requests through the model together — is what turns a $2-an-hour card into something that serves real volume.

Modern serving frameworks like vLLM and TGI implement continuous (also called in-flight) batching, where new requests join the batch as soon as a slot frees up instead of waiting for the whole batch to finish. The throughput difference between naive single-request serving and good batching is often 10x or more.

Utilization is the multiplier on top. If your GPU sits at 15 percent average utilization because traffic is spiky and you provisioned for the peak, your effective cost per token is roughly seven times the theoretical best case. Spiky workloads punish self-hosting hardest.

# Effective cost per 1M tokens for a self-hosted GPU.
def cost_per_million_tokens(gpu_hourly_usd, tokens_per_sec_at_full_load, utilization):
    if not utilization > 0:
        raise ValueError("utilization must be greater than 0")
    effective_tps = tokens_per_sec_at_full_load * utilization
    tokens_per_hour = effective_tps * 3600
    cost_per_token = gpu_hourly_usd / tokens_per_hour
    return cost_per_token * 1_000_000

# A card at 2 USD/hr doing 2,500 tok/s flat-out, but only 20% utilized.
print(round(cost_per_million_tokens(2.0, 2500, 0.20), 3))  # ~1.111 USD
# Same card kept busy at 80% utilization.
print(round(cost_per_million_tokens(2.0, 2500, 0.80), 3))  # ~0.278 USD

That 4x swing comes purely from keeping the card busy. No model change, no quantization — just utilization. It's why we tell clients to measure their real traffic shape before pricing any self-hosted plan.

Quantization: trading precision for throughput

Quantization shrinks model weights from 16-bit floats down to 8-bit or 4-bit integers. Smaller weights mean less memory bandwidth per token and a smaller memory footprint, so you can fit a bigger model on a smaller GPU or serve more concurrent requests on the same one. Both outcomes cut cost per token.

The catch is quality. Aggressive 4-bit quantization can degrade output on hard reasoning tasks, and the degradation is uneven across model families. You don't find out from a benchmark headline — you find out from your own evals on your own prompts. We never ship a quantized model to production without a regression suite that compares it against the full-precision baseline on real workloads.

A rough guide to where each precision level lands:

Precision	Relative memory	Throughput gain	Quality risk
FP16 / BF16	1.0x (baseline)	baseline	none
INT8 / FP8	about 0.5x	roughly 1.5x to 2x	low, usually safe
INT4 / NF4	about 0.25x	roughly 2x to 3x	moderate, test carefully

The pragmatic default we reach for is 8-bit. It nearly halves memory with minimal quality loss for most tasks, which is often enough to drop down a GPU tier. Going to 4-bit is worth it for high-volume, latency-sensitive serving, but only after the evals say the quality holds.

Self-hosted versus API: a break-even comparison

Put the two models side by side and the decision stops being ideological. Each column has a regime where it clearly wins.

Factor	Self-hosted LLM	Hosted API
Cost shape	Fixed (pay for time)	Variable (pay per token)
Best at	High, steady volume	Low or spiky volume
Cost at low utilization	Very high per token	Low, you only pay for use
Operational burden	You own it	Provider owns it
Data residency	Full control	Depends on provider terms
Time to first request	Days to weeks	Minutes
Scaling to a spike	Provision ahead or fail	Mostly automatic

The break-even point is where your monthly API spend equals the fully loaded cost of a self-hosted box kept at realistic utilization. Below that line, the API wins on both cost and effort. Above it, and only if your traffic is steady enough to keep the GPU busy, self-hosting starts to pay. For teams weighing this against a broader build, our AI development services sizing work usually starts with measuring that line before anyone provisions hardware.

There's also a middle path we use a lot: route the bulk of cheap, high-volume traffic to a self-hosted small model and spill the hard, rare requests to a frontier API. That hybrid often beats either pure strategy.

A back-of-envelope cost model you can reuse

Before any procurement conversation, run the numbers yourself. The inputs are knowable and the model is simple enough to keep in a spreadsheet.

Monthly self-hosted cost =
    (GPU hourly rate * 730 hours * replica count)
  + host + storage + egress (~15-25% of GPU cost)
  + engineering load (allocate fractional headcount honestly)

Effective tokens/month =
    tokens_per_sec_full_load * 3600 * 730 * average_utilization

Cost per 1M tokens (self-hosted) =
    Monthly self-hosted cost / (Effective tokens/month / 1,000,000)

Compare against:
    API list price per 1M tokens * your blended input/output mix

Two numbers decide the outcome: replica count (driven by your redundancy and peak needs) and average utilization (driven by traffic shape and batching quality). Get honest estimates for those and the rest falls out. If your utilization estimate is below roughly 30 percent, self-hosting almost never wins on cost alone.

When the model says self-hosting is close, we usually prototype both for a week with real traffic mirrored to each, then decide on measured numbers rather than the forecast. If you'd rather not run that experiment in-house, the Laxaar team does this evaluation as a fixed-scope engagement; you can start a conversation through our contact page or scope it directly via a project quote.

Frequently Asked Questions

Is self-hosting an LLM always cheaper than using an API?

No. Self-hosting only wins when you keep the GPU at high, steady utilization, because you pay for the card by the hour whether it's busy or idle. For low or spiky traffic, a per-token API is usually cheaper and far less work. The crossover depends on your volume, traffic shape, and how well your serving stack batches requests.

What GPU do you need to self-host a model?

It depends on model size and the chosen precision. A quantized 7-to-8 billion parameter model can run on a single mid-tier GPU, while a 70 billion parameter model in full precision needs multiple high-memory cards. Quantization is often the lever that lets you drop to a cheaper card, so size the GPU after you've decided on precision, not before.

How much can quantization actually save?

8-bit quantization roughly halves memory use with little quality loss for most tasks, which can move you down a GPU tier and cut cost meaningfully. 4-bit can shrink memory to about a quarter and boost throughput further, but it carries real quality risk on hard tasks. Always validate with your own eval suite before shipping a quantized model.

What's the most overlooked cost in self-hosting?

Engineering and on-call time. The GPU bill is visible on a cloud invoice, but the hours your team spends patching drivers, tuning the server, and recovering from out-of-memory crashes rarely make it into the comparison. That hidden labor is what turns a "cheaper" self-hosted plan into a more expensive one.

Can you mix self-hosting and APIs?

Yes, and it's often the smartest setup. Route high-volume, simple requests to a self-hosted small model to keep its GPU busy, and send rare, hard requests to a frontier API. This hybrid captures the cost advantage of self-hosting where utilization is high while avoiding the operational risk of running everything yourself.

Picking the right inference strategy is one of those decisions that's cheap to get right early and painful to unwind later. If you're staring at a rising API bill or planning a new LLM feature and want a clear-eyed cost model before you commit to hardware, explore our AI development services or tell us about your workload through the contact page. We'll help you find the line where the numbers actually work.

Inference Infrastructure Costs for Self-Hosted LLMs