AI Engineering

Inference Infrastructure Costs for Self-Hosted LLMs

A clear breakdown of AI infrastructure costs for self-hosted LLMs, covering GPU economics, batching, quantization, and the utilization math that decides ROI.

By Laxaar Engineering Team Jun 10, 2026 11 min read
Inference Infrastructure Costs for Self-Hosted LLMs

A finance team looks at a growing API bill, sees the words "per token," and assumes that renting a GPU and running an open-weight model will be cheaper. Sometimes it is. Often it isn't, and the gap comes down to one number almost nobody measures before they commit: GPU utilization. Self-hosting an LLM turns your AI infrastructure into a fixed-cost asset, and a fixed cost only beats a variable one when you keep the asset busy.

That's the whole game. A rented H100 costs the same whether it serves one request per second or forty. The API provider, by contrast, charges you for exactly the tokens you used and amortizes idle capacity across thousands of other customers. So the real question isn't "which is cheaper per token" — it's "can we drive enough sustained traffic to a box we're paying for around the clock."

We've helped teams at Laxaar run both sides of this trade. This post walks through the cost structure honestly, including the parts that look small on a spreadsheet and then dominate the bill in production.

What you'll learn

Why GPU economics, not token prices, drive the decision

Inference infrastructure cost is the total spend required to serve model responses: the accelerators, the memory, the supporting compute, and the engineering time to keep it running. For an API, that cost is hidden inside a published per-token rate. For self-hosting, you own every line item, and the dominant one is the GPU.

GPUs are billed by time, not by work done. Whether you rent on-demand from a cloud provider or buy hardware outright, you pay for the hours the card exists, not the tokens it generates. That single fact flips the usual SaaS intuition. With an API, low traffic means a low bill. With self-hosting, low traffic means an expensive idle machine.

Here's the opinionated take: most teams that "save money" by self-hosting are actually subsidizing a poorly-used GPU with engineering salaries they forgot to count. The math can absolutely work, but it works through high, steady load and disciplined batching — not through the open-weight model being free to download.

The full cost stack of a self-hosted deployment

The GPU is the headline, but it's never the whole invoice. A self-hosted LLM serving stack has at least five cost layers, and skipping any of them produces a forecast that's wrong by a wide margin.

  • Accelerator time. The GPU itself, billed hourly or as depreciated capital. This is usually 60 to 80 percent of the run-rate.
  • Host compute and memory. vCPUs and system RAM to feed the GPU, run the serving framework, and handle tokenization and request routing.
  • Storage and network egress. Model weights can run tens of gigabytes; loading them, snapshotting them, and moving responses out all cost money.
  • Redundancy. One GPU is a single point of failure. Real availability means at least two, which roughly doubles the floor cost before you serve a single extra request.
  • Engineering and on-call. Someone has to patch drivers, tune the server, watch for OOM crashes, and wake up at 3am. This is the line item spreadsheets always omit.

The honest trade-off lives here. An API hides all five layers behind one number and a status page. Self-hosting hands you control and data residency, and in exchange you absorb operational risk that used to be someone else's problem. If your team can't staff that, the cheaper-per-token model is a mirage.

How batching and utilization change everything

A GPU running one request at a time wastes most of its compute. LLM inference is memory-bandwidth bound during token generation, which means the card spends a lot of cycles waiting. Continuous batching — packing many in-flight requests through the model together — is what turns a $2-an-hour card into something that serves real volume.

Modern serving frameworks like vLLM and TGI implement continuous (also called in-flight) batching, where new requests join the batch as soon as a slot frees up instead of waiting for the whole batch to finish. The throughput difference between naive single-request serving and good batching is often 10x or more.

Utilization is the multiplier on top. If your GPU sits at 15 percent average utilization because traffic is spiky and you provisioned for the peak, your effective cost per token is roughly seven times the theoretical best case. Spiky workloads punish self-hosting hardest.

# Effective cost per 1M tokens for a self-hosted GPU.
def cost_per_million_tokens(gpu_hourly_usd, tokens_per_sec_at_full_load, utilization):
    if not utilization > 0:
        raise ValueError("utilization must be greater than 0")
    effective_tps = tokens_per_sec_at_full_load * utilization
    tokens_per_hour = effective_tps * 3600
    cost_per_token = gpu_hourly_usd / tokens_per_hour
    return cost_per_token * 1_000_000

# A card at 2 USD/hr doing 2,500 tok/s flat-out, but only 20% utilized.
print(round(cost_per_million_tokens(2.0, 2500, 0.20), 3))  # ~1.111 USD
# Same card kept busy at 80% utilization.
print(round(cost_per_million_tokens(2.0, 2500, 0.80), 3))  # ~0.278 USD

That 4x swing comes purely from keeping the card busy. No model change, no quantization — just utilization. It's why we tell clients to measure their real traffic shape before pricing any self-hosted plan.

Quantization: trading precision for throughput

Quantization shrinks model weights from 16-bit floats down to 8-bit or 4-bit integers. Smaller weights mean less memory bandwidth per token and a smaller memory footprint, so you can fit a bigger model on a smaller GPU or serve more concurrent requests on the same one. Both outcomes cut cost per token.

The catch is quality. Aggressive 4-bit quantization can degrade output on hard reasoning tasks, and the degradation is uneven across model families. You don't find out from a benchmark headline — you find out from your own evals on your own prompts. We never ship a quantized model to production without a regression suite that compares it against the full-precision baseline on real workloads.

A rough guide to where each precision level lands:

PrecisionRelative memoryThroughput gainQuality risk
FP16 / BF161.0x (baseline)baselinenone
INT8 / FP8about 0.5xroughly 1.5x to 2xlow, usually safe
INT4 / NF4about 0.25xroughly 2x to 3xmoderate, test carefully

The pragmatic default we reach for is 8-bit. It nearly halves memory with minimal quality loss for most tasks, which is often enough to drop down a GPU tier. Going to 4-bit is worth it for high-volume, latency-sensitive serving, but only after the evals say the quality holds.

Self-hosted versus API: a break-even comparison

Put the two models side by side and the decision stops being ideological. Each column has a regime where it clearly wins.

FactorSelf-hosted LLMHosted API
Cost shapeFixed (pay for time)Variable (pay per token)
Best atHigh, steady volumeLow or spiky volume
Cost at low utilizationVery high per tokenLow, you only pay for use
Operational burdenYou own itProvider owns it
Data residencyFull controlDepends on provider terms
Time to first requestDays to weeksMinutes
Scaling to a spikeProvision ahead or failMostly automatic

The break-even point is where your monthly API spend equals the fully loaded cost of a self-hosted box kept at realistic utilization. Below that line, the API wins on both cost and effort. Above it, and only if your traffic is steady enough to keep the GPU busy, self-hosting starts to pay. For teams weighing this against a broader build, our AI development services sizing work usually starts with measuring that line before anyone provisions hardware.

There's also a middle path we use a lot: route the bulk of cheap, high-volume traffic to a self-hosted small model and spill the hard, rare requests to a frontier API. That hybrid often beats either pure strategy.

A back-of-envelope cost model you can reuse

Before any procurement conversation, run the numbers yourself. The inputs are knowable and the model is simple enough to keep in a spreadsheet.

Monthly self-hosted cost =
    (GPU hourly rate * 730 hours * replica count)
  + host + storage + egress (~15-25% of GPU cost)
  + engineering load (allocate fractional headcount honestly)

Effective tokens/month =
    tokens_per_sec_full_load * 3600 * 730 * average_utilization

Cost per 1M tokens (self-hosted) =
    Monthly self-hosted cost / (Effective tokens/month / 1,000,000)

Compare against:
    API list price per 1M tokens * your blended input/output mix

Two numbers decide the outcome: replica count (driven by your redundancy and peak needs) and average utilization (driven by traffic shape and batching quality). Get honest estimates for those and the rest falls out. If your utilization estimate is below roughly 30 percent, self-hosting almost never wins on cost alone.

When the model says self-hosting is close, we usually prototype both for a week with real traffic mirrored to each, then decide on measured numbers rather than the forecast. If you'd rather not run that experiment in-house, the Laxaar team does this evaluation as a fixed-scope engagement; you can start a conversation through our contact page or scope it directly via a project quote.

Frequently Asked Questions

Is self-hosting an LLM always cheaper than using an API?

No. Self-hosting only wins when you keep the GPU at high, steady utilization, because you pay for the card by the hour whether it's busy or idle. For low or spiky traffic, a per-token API is usually cheaper and far less work. The crossover depends on your volume, traffic shape, and how well your serving stack batches requests.

What GPU do you need to self-host a model?

It depends on model size and the chosen precision. A quantized 7-to-8 billion parameter model can run on a single mid-tier GPU, while a 70 billion parameter model in full precision needs multiple high-memory cards. Quantization is often the lever that lets you drop to a cheaper card, so size the GPU after you've decided on precision, not before.

How much can quantization actually save?

8-bit quantization roughly halves memory use with little quality loss for most tasks, which can move you down a GPU tier and cut cost meaningfully. 4-bit can shrink memory to about a quarter and boost throughput further, but it carries real quality risk on hard tasks. Always validate with your own eval suite before shipping a quantized model.

What's the most overlooked cost in self-hosting?

Engineering and on-call time. The GPU bill is visible on a cloud invoice, but the hours your team spends patching drivers, tuning the server, and recovering from out-of-memory crashes rarely make it into the comparison. That hidden labor is what turns a "cheaper" self-hosted plan into a more expensive one.

Can you mix self-hosting and APIs?

Yes, and it's often the smartest setup. Route high-volume, simple requests to a self-hosted small model to keep its GPU busy, and send rare, hard requests to a frontier API. This hybrid captures the cost advantage of self-hosting where utilization is high while avoiding the operational risk of running everything yourself.

Picking the right inference strategy is one of those decisions that's cheap to get right early and painful to unwind later. If you're staring at a rising API bill or planning a new LLM feature and want a clear-eyed cost model before you commit to hardware, explore our AI development services or tell us about your workload through the contact page. We'll help you find the line where the numbers actually work.

Working on something like this?

Get a fixed scope, timeline, and price within one business day — no obligation.

LLM InferenceSelf-Hosted LLMsLLM Cost Optimization
Grow your business with us

Take your business to the next level.

Tell us what you're building. We'll come back inside one business day with a fixed scope, timeline, and team — or an honest “this isn't a fit”.

ENGINEERING PHILOSOPHY

Code is useless if it's not comprehensible to those who maintain it. We write code the next person can actually understand.