Multi-Tenant LoRA Serving: One Base Model, Hundreds of Fine-Tunes
The definitive 2026 guide to serving many LoRA fine-tunes on a shared base model: how LoRA works, S-LoRA and Punica architectures, vLLM and TGI multi-LoRA implementations, dynamic adapter loading, scheduling strategies, throughput math, hot-cold tiering, and the economics that make per-customer fine-tuning viable.
A 7B-parameter LLM costs ~14 GB of HBM at FP16 and tens of thousands of dollars per year to serve at production QPS. Standing up a separate instance for every customer who wants a fine-tuned model is unaffordable. The breakthrough — and it has become the dominant serving pattern in 2026 — is to keep one base model resident on the GPU and load thousands of small LoRA adapters on top of it dynamically, picking the right adapter per request. The math turns SaaS-style per-customer fine-tuning from "expensive enterprise feature" into "default product capability."
The take. A 7B base model with FP16 weights occupies ~14 GB; a typical LoRA adapter for the same model is 10–80 MB. You can hold hundreds of adapters in the same HBM you'd otherwise need for two separate instances. The serving stacks that matter — vLLM, SGLang, TGI, TensorRT-LLM — all ship multi-LoRA support, with S-LoRA-style (Sheng et al., arXiv:2311.03285) batched-heterogeneous kernels under the hood. The real engineering work is the adapter-management layer: hot/cold tiering, prefetch, scheduling requests with different adapters in the same batch, and accounting per-tenant for cost and quality. Punica and S-LoRA solved the kernel problem; the scheduler problem is where production teams still spend their week.
This guide covers the full multi-tenant LoRA stack: how LoRA actually works, the kernel-level innovations (Punica's segmented matrix multiplication, S-LoRA's unified paging), how vLLM and other stacks implement multi-LoRA in 2026, the scheduling decisions that determine throughput, hot/cold tiering for thousands of adapters, dynamic loading at request time, the cost model that decides when LoRA beats full fine-tuning, eval considerations for fleets of fine-tunes, and the production failure modes. Cross-links: post-training (RLHF, DPO), vLLM and PagedAttention, KV cache inference memory math, agent serving infrastructure, AI inference cost economics, quantization tradeoffs, RAG in production.
Table of contents
- Key takeaways
- Mental model: multi-tenant LoRA in one minute
- Why multi-tenant LoRA exists
- LoRA in 60 seconds
- The serving challenge
- Punica: batching heterogeneous LoRAs
- S-LoRA: scaling to thousands of adapters
- vLLM and TGI multi-LoRA in 2026
- Adapter management: hot, warm, cold
- Dynamic loading and prefetch
- Scheduling with adapters
- Throughput economics
- LoRA quality vs full fine-tuning
- Cost math: LoRA vs separate instances
- Multi-tenant operations
- Production failure modes
- Per-adapter math: rank, target modules, MB sizes
- Adapter quantization: LoRA on FP8, QLoRA serving, NVFP4
- Hot-swap dynamics: cold load, prefetch, cache eviction
- Per-tenant SLA isolation and fairness scheduling
- Customer-onboarding flow: training to serving
- Enterprise multi-tenant: RBAC, audit, compliance
- GPU-class economics for adapter serving
- LoRA vs FT vs RAG vs few-shot: decision matrix
- Real production deployments in 2026
- Customer onboarding deep dive: from upload to GA
- Deployment patterns: SaaS, private cloud, hybrid
- MoE bases and LoRA: where the pattern breaks down
- Catastrophic forgetting, overfit, and training pitfalls
- Adapter format, registry, and supply-chain hygiene
- Debugging multi-LoRA in production
- Security: adapters as attack vectors
- The bottom line
- FAQ
- Extended FAQ
- Eighteen-month outlook
- Glossary
- References
Key takeaways
- Multi-tenant LoRA means one base model on GPU + many small adapters loaded on-demand. Standard pattern for per-customer fine-tuning in 2026.
- A LoRA adapter for a 7B base is 10–80 MB. You can hold hundreds in the same HBM as one base model copy.
- Punica (Chen et al., arXiv:2310.18547) introduced the kernel pattern: segment-aware GEMM that batches requests with different adapters in one forward pass.
- S-LoRA (Sheng et al., arXiv:2311.03285) generalised it: unified paging for adapters and KV cache, thousands of concurrent LoRAs.
- All major serving stacks now support multi-LoRA: vLLM, SGLang, TGI, TensorRT-LLM, LMDeploy.
- Throughput hit from running multi-LoRA vs single-model is 10–20% in 2026; was 50%+ in 2023.
- Hot adapters live in HBM, warm on CPU, cold on disk. Migration is automated by adapter-aware schedulers.
- LoRA quality is typically 90–98% of full fine-tuning at 0.1–1% of the parameters. Sufficient for almost all customisation use cases.
- The economics: 200 customer fine-tunes on a single H200 cost ~$15k/month. Separate instances would cost ~$200k+.
- The production complexity moves up the stack: per-tenant eval, per-tenant cost accounting, adapter versioning, A/B testing across the fleet.
Mental model: multi-tenant LoRA in one minute
Name the problem first: the adapter-economics gap. Fine-tuning a separate full model per customer is unaffordable — a 7B FP16 instance is roughly 14 GB of HBM and $20k+/year of GPU. The only way per-customer fine-tuning works as a product is if you can pack many tenants onto one GPU. Multi-tenant LoRA closes that gap by keeping a single base model resident and hot-swapping tiny adapters per request.
Analogy: a base operating system with plugins. The OS (base weights) stays loaded; each tenant's plugin (LoRA adapter) is small, cheap to install, and can be activated per session. The kernel work — Punica's segmented GEMM, S-LoRA's unified paging — is the equivalent of letting different plugins run inside the same process without forking.
Side-by-side:
| Strategy | HBM per tenant | Cold-start | Quality vs full FT | Tenants per H200 |
|---|---|---|---|---|
| Separate full instance | ~14 GB (7B FP16) | minutes | 100% | 1–2 |
| Separate full FP8 instance | ~7 GB | minutes | ~99% | 2–4 |
| LoRA on shared base (hot) | 10–80 MB | ms | 90–98% | 200–1000+ |
| LoRA on shared base (cold disk) | 0 (paged in) | 50–200 ms | 90–98% | thousands |
| QLoRA | 10–80 MB | ms | 88–96% | 200–1000+ |
Pseudocode for the production hot path (vLLM-style):
engine = LLM(model="meta-llama/Llama-3-8B", enable_lora=True, max_loras=64)
engine.generate(prompt, lora_request=LoRARequest("cust-123", 1, "s3://adapters/cust-123"))
Sticky number to remember: S-LoRA and its descendants serve 1000+ concurrent adapters per GPU with <5% throughput drop versus a single-model baseline in 2026 — the kernel cost of multi-LoRA is essentially priced in.
Why multi-tenant LoRA exists
Three trends converged in 2023–2024 to make multi-tenant LoRA the dominant serving pattern.
1. Per-customer fine-tuning became a product feature. SaaS companies wanted to offer "your data, your model" — a fine-tuned model per customer, trained on their tickets, their docs, their style. With one model per customer, even a few hundred customers meant hundreds of separate GPU instances at $20k+/year each. Unaffordable for products with revenue under $50/month per customer.
2. LoRA matured. Hu et al.'s original LoRA paper (arXiv:2106.09685) made low-rank fine-tuning practical in 2021. By 2023, LoRA fine-tuned models were within 1–2 points of fully fine-tuned models on most evals — and trained in hours on a single GPU instead of days on a cluster.
3. The kernel work caught up. The naive way to serve LoRA is to merge the adapter into the base weights at request time — fast for single-adapter inference, useless for multi-tenant because you can't batch requests with different adapters. Punica (2023) and S-LoRA (2023) solved this by computing the adapter contribution as a separate, batched-aware kernel that runs alongside the base model in one forward pass.
Once those three landed, the product economics flipped. A single H200 GPU serving a Llama-3.3 70B base model can hold hundreds of LoRA adapters in HBM and route requests to the right one with single-digit-percent overhead vs serving the base model alone. SaaS per-customer fine-tuning became viable as a default tier, not an enterprise upcharge.
By 2026 most production LLM products that offer customisation use multi-tenant LoRA under the hood. The user pastes their style guide; the platform trains a LoRA adapter in 20 minutes; the adapter is registered with the multi-tenant serving cluster and the user's queries route to their adapter. No dedicated infrastructure per customer.
Products that visibly use multi-tenant LoRA
The pattern is invisible to end users but pervasive in 2026 in:
- OpenAI fine-tuning API. Customers upload data, get a fine-tuned model ID, query it like any other model. Under the hood, multi-tenant LoRA on shared infrastructure.
- Anthropic fine-tuning (preview / enterprise). Same model.
- Vertex AI Tuning. Google offers LoRA-based tuning on Gemini and PaLM models with shared serving.
- Predibase, OpenPipe, Together AI fine-tuning. Whole companies built around multi-tenant LoRA serving for open-weight models.
- Cohere fine-tuning. Customer-specific embedding and rerank fine-tunes.
- Customer-facing AI products (Intercom Fin, Salesforce Einstein, internal SaaS tools) that offer "your data, your model" features.
The cost economics that make these products possible are mostly invisible to buyers. The seller's choice between multi-tenant LoRA and dedicated instances changes their gross margin by 10-40 points; the buyer just sees a "fine-tune available" feature.
LoRA in 60 seconds
Skip if you already know this. A pragmatic summary if you don't.
A transformer model is a stack of layers; each layer contains matrices (query, key, value, output projections in attention; up and down projections in feed-forward). Full fine-tuning updates all these matrices — that's billions of parameters changing.
LoRA's insight: instead of updating each large matrix W, freeze W and learn a small additive update ΔW alongside it. The update is parameterised as ΔW = BA, where A and B are low-rank — A is (rank × in_features) and B is (out_features × rank). The product BA has the same shape as W, but only rank × (in + out) parameters versus in × out.
Typical settings: rank r = 8–64, applied to query and value projections (and sometimes others). For a 7B model, a typical LoRA at rank 16 is ~10 MB. At rank 64, ~80 MB. Compare to the base model: 14 GB at FP16.
At inference time, the LoRA's contribution is BA · x, added to the frozen base's output W · x:
y = W · x + α · B · A · x
where α is a scaling factor (typically α = rank or α = 2 × rank).
This means a LoRA-served forward pass is structurally:
- Compute the base model forward pass (as usual).
- In parallel (or fused), compute the LoRA addition for each layer where an adapter is attached.
- Add the two together.
For single-adapter inference, you can merge the LoRA into the base (W' = W + α·BA) once at load time and serve it as if it were a fully fine-tuned model — zero overhead. For multi-tenant, you can't, because each request might need a different adapter. That's the kernel problem Punica and S-LoRA solve.
Variants:
- QLoRA (Dettmers et al., arXiv:2305.14314) — quantize the base model to 4-bit, train LoRA on top. Dropped fine-tuning memory by 10×; the dominant fine-tuning pattern by mid-2024.
- DoRA (Liu et al., arXiv:2402.09353) — decomposes the update into magnitude + direction; small quality gain over plain LoRA.
- LoRA+ (Hayou et al., arXiv:2402.12354) — different learning rates for A and B; modest gain.
- VeRA, MoRA, AdaLoRA — research variants. Most production stays with plain LoRA + reasonable rank.
For serving, the variant matters less than people think — Punica and S-LoRA-style kernels handle them all with the same machinery.
LoRA size in megabytes — actual numbers
For common base models at common ranks, attention-only (Q+V) vs all-linear-layer (Q+K+V+O+FFN gate+up+down) targeting:
| Base model | Rank | Attention-only | All linear layers |
|---|---|---|---|
| Llama-3.1 8B | 8 | ~6 MB | ~25 MB |
| Llama-3.1 8B | 16 | ~12 MB | ~50 MB |
| Llama-3.1 8B | 32 | ~24 MB | ~100 MB |
| Llama-3.1 8B | 64 | ~48 MB | ~200 MB |
| Llama-3.3 70B | 8 | ~25 MB | ~100 MB |
| Llama-3.3 70B | 16 | ~50 MB | ~200 MB |
| Llama-3.3 70B | 32 | ~100 MB | ~400 MB |
| Qwen2.5 32B | 16 | ~30 MB | ~120 MB |
| Mistral Small 22B | 16 | ~20 MB | ~80 MB |
At rank 16 targeting all linear layers — a strong-quality production default — a 70B adapter is ~200 MB. Eighty of these fit in 16 GB of HBM, which is a tiny fraction of an H100's 80 GB. Memory is rarely the constraint; the kernel and scheduler are.
The serving challenge
If you've never thought about why multi-LoRA is hard, here's the constraint.
Batching is everything. Modern LLM serving (see LLM serving) processes many requests in one forward pass — the GEMMs are large, the HBM read of the model weights is amortised across many tokens. Batch size 1 wastes 80%+ of GPU bandwidth on most decode steps.
Different adapters break batching. If request A uses adapter Customer_42 and request B uses adapter Customer_177, you can't batch them naively. The adapter is part of the forward computation, and a single forward pass uses one set of weights. Either you don't batch (terrible utilization) or you batch and apply each adapter individually inside the kernel.
The naive approaches:
- Merge adapters at request time. For each request, copy base + adapter into a working buffer, compute, discard. Costs HBM bandwidth on every request. Wastes compute on the merge.
- Don't batch across adapters. Group requests by adapter; serve one adapter at a time. Each adapter gets a small batch. Decode utilization tanks.
- Replicate the model per adapter. Different GPU for each adapter. Defeats the purpose of multi-tenant.
Punica and S-LoRA do the right thing: batch requests with different adapters in one forward pass, computing each adapter's contribution as a separate aware GEMM.
The HBM-bandwidth view
LoRA's BA · x adds two small GEMMs per LoRA-attached layer per request. For a 70B model with adapters on all attention + FFN matrices, that's ~80 layers × 7 matrices × 2 GEMMs per token = ~1100 small GEMMs added to the forward pass. The catch: each of these GEMMs uses different A and B matrices per request in the batch. Without a fused, segment-aware kernel, this generates 1100 × batch_size separate kernel launches per token — orders of magnitude more than the base model's already-large kernel count.
The Punica/S-LoRA trick collapses these into one launched kernel per matrix per layer, with each thread block handling one request's adapter. The kernel reads the per-request adapter pointer from a small index buffer, then runs the matrix-vector multiply on the slice of the batch tensor that belongs to that request. Same bandwidth profile as the base model; minimal launch overhead.
Punica: batching heterogeneous LoRAs
Punica (Chen et al., 2023, arXiv:2310.18547) introduced the kernel pattern that made multi-LoRA practical.
The core idea. When a batch of N requests uses N different adapters (or even the same adapter), structure the LoRA computation as a segmented GEMM: each segment of the batch uses its own A and B matrices, but the whole operation runs in one CUDA kernel.
The math. For a batch of tokens X with shape [batch, seq, hidden], and adapters A_0, A_1, ..., A_N each of shape [rank, hidden] and B_0, B_1, ..., B_N each of shape [hidden, rank]:
For each request i with adapter k_i:
ΔX_i = X_i · A_{k_i}^T · B_{k_i}^T
output_i = base_output_i + α · ΔX_i
Punica's CUDA kernels compute all the ΔX_i operations in one launch, with each thread block handling a different adapter. The base model's GEMM is unchanged — it operates on the full batch as usual.
Memory. All adapter matrices stay resident in HBM as a single tensor block, indexed by adapter ID per request. The base model is single-copy.
Throughput. Punica showed multi-LoRA serving with <10% overhead vs single-model serving when batch sizes are moderate, and <5% overhead at large batches. Compared to dedicated-instance-per-LoRA, throughput per adapter is 5–10× higher.
Limitations. Punica's original implementation required all adapters to have the same rank. S-LoRA generalised to heterogeneous ranks.
Punica's segmented BGMV kernel in plain English
The performance trick is that all adapters in a batch can be treated as one big block-diagonal matrix multiplication. Naive implementations launch one CUDA kernel per adapter — terrible for the GPU's scheduler because each launch has overhead and the per-adapter work is small. Punica instead packs the adapter operations into a single kernel that processes all of them in parallel: each thread block in the grid handles one request and its specific adapter. The kernel name in the codebase — BGMV for "Batched Grouped Matrix-Vector" — captures the pattern. The same idea generalises to SGMV (Segmented GEMM-Vector) for prefill, where each request has many tokens and the adapter applies to all of them.
S-LoRA: scaling to thousands of adapters
S-LoRA (Sheng et al., 2023, arXiv:2311.03285) extends Punica in three important ways for production:
1. Unified paging. S-LoRA treats adapter weights and KV cache as participating in the same paged memory pool (extending vLLM's PagedAttention paradigm). Adapters and KV cache compete for HBM; the page allocator handles both. This means you can hold thousands of adapters in the same HBM that hot KV cache uses, with eviction policies that account for adapter usage frequency.
2. Heterogeneous ranks. Adapters can have different ranks (8, 16, 32, 64). The kernel handles this with a padding-aware structure rather than requiring uniform rank.
3. Tensor parallelism support. When the base model is sharded across GPUs (tensor parallel), the adapter computation is sharded the same way. No serialization point for the LoRA contribution.
Numbers. S-LoRA's paper showed serving 2,000 concurrent LoRA adapters on a single A100 with under 10% throughput loss vs the base model alone. Per-request latency increased <5%. The economics of multi-tenant LoRA changed once that paper landed.
Practical impact. Every major serving stack (vLLM, SGLang, TGI, TensorRT-LLM) shipped S-LoRA-style kernels by 2024–2025. The numbers are now typical across the field — multi-LoRA isn't a research benchmark, it's the production default for any platform offering customisation.
vLLM and TGI multi-LoRA in 2026
The serving stacks in production for multi-LoRA in 2026 and how they differ:
vLLM
- Multi-LoRA support added in 0.3.x; production-ready by 0.6.x.
- Configuration:
--enable-lora --max-lora-rank 64 --max-loras N --max-cpu-loras M. - Adapter discovery: launch with a list, or hot-add via the OpenAI-compatible API extension (
/v1/lora_adapters). - HBM-resident adapter limit set by
--max-loras. Spillover goes to CPU memory, loaded back to HBM on demand. - Tensor parallelism + multi-LoRA both supported and composable.
- Production scale: easily 50–200 hot adapters per H100, 200–500 across HBM + CPU.
SGLang
- Multi-LoRA in production since 0.3.x.
- RadixAttention prefix caching works with LoRA adapters — same prefix + same adapter = cache hit.
- Strong on throughput in mixed-adapter workloads.
TGI (Hugging Face Text Generation Inference)
- Multi-LoRA via the
lora-adaptersfeature. - Simpler operationally than vLLM if your inference is already on TGI.
- Smaller community than vLLM in 2026 but stable.
TensorRT-LLM
- NVIDIA's stack. Multi-LoRA via the
lora-plugin. - Best raw throughput on NVIDIA hardware; requires engine compilation per (base, max-LoRA-config) combination.
- Production fit: best when you have stable adapter configurations and want maximum performance.
LMDeploy (InternLM)
- Multi-LoRA support; strong on Qwen and InternLM base families.
Comparison. For most teams in 2026: vLLM is the safe default. If you're on NVIDIA-only hardware and want maximum throughput, TensorRT-LLM. If you're already on HF stack, TGI is fine.
Serving stack feature matrix
| Feature | vLLM | SGLang | TGI | TensorRT-LLM | LMDeploy | LoRAX |
|---|---|---|---|---|---|---|
| Multi-LoRA | yes | yes | yes | yes | yes | yes (purpose-built) |
| Heterogeneous ranks in one batch | yes | yes | yes | yes | yes | yes |
| Dynamic adapter hot-add via API | yes | yes | yes | limited | yes | yes |
| Prefix caching with adapters | yes | yes (RadixAttention) | partial | yes | yes | partial |
| QLoRA / quantized base + LoRA | yes (AWQ/GPTQ/FP8) | yes | yes | yes (FP8/NVFP4) | yes | yes |
| Tensor parallelism | yes | yes | yes | yes | yes | yes |
| Speculative decoding | yes | yes | partial | yes | yes | limited |
| OpenAI-compatible API | yes | yes | yes | partial | yes | yes |
| Community size in 2026 | largest | growing fast | mid | NVIDIA-led | InternLM-led | Predibase-led |
LoRAX (Predibase) is worth a callout — it was the first stack purpose-built for multi-LoRA serving and remains the cleanest experience for "I just want to serve hundreds of adapters" workloads. vLLM caught up on functionality; LoRAX is still simpler operationally.
Adapter management: hot, warm, cold
A multi-tenant LoRA stack with thousands of adapters runs a three-tier memory hierarchy.
Hot (HBM). The adapters currently being served. Sized to fit the busiest N adapters. A 80GB H100 with a 7B base (14 GB) and 64 GB free can hold ~640 rank-32 adapters at ~100 MB each, or ~6400 rank-16 adapters at ~10 MB each. KV cache competes for this same space.
Warm (CPU RAM / system memory). Adapters that have been used recently but aren't currently in HBM. Loaded on demand by DMA from CPU RAM into HBM (50–500 ms transfer time depending on adapter size and PCIe / NVLink speed). A typical server has 256 GB–1 TB RAM; can hold tens of thousands of adapters.
Cold (object storage / local disk). All adapters that ever existed for this base model. Loaded from S3 / GCS / local SSD when a request arrives for an adapter not in warm. Tens of seconds to load and verify the first time.
Promotion / eviction. Adapter access patterns drive the migration:
- LRU is the baseline policy.
- LFU and frequency-aware policies work better when access is bursty per-tenant.
- Pre-warming by tenant: when a customer starts a session, prefetch their adapter to HBM.
- TTL-based eviction: drop adapters not used in N hours from HBM to free space for newcomers.
Most production stacks expose hooks for custom promotion policies. The default is fine for most workloads; tune when you have many cold-start latency spikes from cold-tier loads.
Dynamic loading and prefetch
When a request arrives for an adapter not in HBM, the system has two choices: stall the request while the adapter loads, or batch around it with other in-HBM requests until the load completes.
Cold-load latency budget.
- Cold (S3): 1–5 seconds for the network round trip + decompress + load to RAM + DMA to HBM. Painful for interactive requests.
- Warm (CPU RAM): 50–500 ms DMA. Tolerable.
- Hot (HBM): zero. The goal.
Prefetch patterns:
- On session start. When a user logs in or begins a multi-turn conversation, pre-warm their adapter to HBM. Their first message hits hot.
- On request-pattern prediction. If user A typically follows user A's request with another in 30 seconds, keep A's adapter hot for 60 seconds after each request.
- Bulk preload. For deployments with a stable adapter fleet, preload all adapters at server start. Costs cold-start time; runs at full performance.
Cold-start handling:
- Queue and wait. Accept the latency hit. Acceptable for occasional requests.
- Fall back to the base model. Serve the request from the base model while the adapter loads, then switch on the next turn. Quality is sometimes acceptable, sometimes not — depends on how much the adapter changes the base behaviour.
- Reject and ask for retry. Send back a "model warming up, try again in 5s" response. UX is poor; rarely the right choice.
The right answer depends on tenant SLA and how often cold loads happen. Well-tuned production stacks see <1% of requests hitting cold; most teams over-engineer the cold path before having data showing it's a problem.
Scheduling with adapters
A multi-LoRA scheduler decides which requests to batch together when each may want a different adapter. Several patterns:
Adapter-mixed batching (the default in S-LoRA / vLLM / SGLang). Pull whatever requests are ready, regardless of adapter. The Punica/S-LoRA kernel handles the heterogeneous adapter case. Maximises GPU utilization; per-request latency is slightly affected by the segment-aware GEMM overhead.
Adapter-grouped batching. Wait briefly for requests with the same adapter to group; serve as a homogeneous batch. Maximises per-batch efficiency; introduces queuing latency. Used when many requests share an adapter and the workload has known periodicity.
Priority-aware. SLA-sensitive adapters (paid tier, real-time use cases) get scheduled ahead of background batch traffic.
KV-cache-aware. If two requests share a KV-cache prefix and both use the same adapter, scheduling them together can hit prefix-cache + adapter-cache together. SGLang's RadixAttention does this natively.
Fairness. When one tenant generates a flood of requests, naive scheduling can starve others. Token-bucket per tenant or weighted fair queuing prevents single-tenant starvation.
In practice: vLLM's default mixed-adapter batching is fine for most workloads. Tune the scheduler when you observe specific issues — long tail latency from one tenant, KV cache thrashing across adapters, etc.
Worked example: scheduling decision in a real cluster
A typical Saturday-morning workload on a customer-support SaaS:
- 8× H100 cluster, Llama 3.3 70B base at fp8.
- 1,200 customers, each with their own LoRA adapter.
- Traffic: 90% of requests go to 50 popular adapters; the other 1,150 share 10% of traffic.
- Burst: customer #7 (one of the top 5) suddenly sends 400 requests in 30 seconds (their product just got featured somewhere).
What the scheduler does well:
- Customer #7's adapter is hot in HBM (it's a top-50 adapter).
- Requests for customer #7 batch together via grouped batching, hitting the same KV-cache prefix optimisation.
- Other adapters' requests continue to be served in mixed batches; small per-request overhead from the segment-aware GEMM but no starvation.
What fails without a per-tenant rate limit:
- Customer #7's burst saturates the GPUs at the expense of the other 1,199 customers.
- Token-bucket per tenant (e.g., 50 RPS sustained, 200 RPS burst) caps the bad-neighbour impact.
The lesson: schedulers handle the kernel level well by default; the fairness layer needs explicit policy.
Throughput economics
What does the math actually look like?
A reference setup: Llama-3.3 70B base model at FP8, served on 4× H100 (4-way tensor parallel). Without any LoRA, this serves ~50–100 concurrent requests at moderate decode rate.
With multi-LoRA (S-LoRA-style):
- 200 LoRA adapters at rank 32 (~50 MB each) = 10 GB of adapter memory.
- Adds ~10–15% latency to each forward pass due to the segment-aware LoRA GEMM.
- Throughput per GPU drops 10–15%.
- Effective per-adapter cost: 1/200 × the cluster cost.
Vs separate instances:
- 200 separate Llama-70B FP8 instances = 200 × 4 × H100 = 800 H100s. Absurd.
- Even with consolidation (each instance shared across 5 adapters), you'd need 40 4-GPU clusters = 160 H100s.
The multi-LoRA approach costs 4 GPUs (one cluster). The separate-instance approach costs 160 GPUs. The 40× cost ratio is what made multi-LoRA the default.
At smaller scale: A single H200 with a Llama-3.1 8B base (~16 GB at FP16) can hold 500+ rank-16 adapters with KV cache to spare. Serving 50 concurrent users across those adapters at <50ms per-token latency is straightforward in 2026.
Throughput by hardware tier
| Hardware | Base model size | Max hot adapters | Aggregate TPS | Cost ($/hr) |
|---|---|---|---|---|
| 1× L40S (48 GB) | 8B at fp16 | ~100 rank-16 | ~2k tokens/sec | ~$1.50 |
| 1× H100 80 GB | 8B at fp16 | ~600 rank-16 | ~6k tokens/sec | ~$4 |
| 1× H100 80 GB | 8B at fp8 + INT4 base | ~3000 rank-16 | ~7k tokens/sec | ~$4 |
| 1× H200 141 GB | 8B at fp16 | ~1500 rank-16 | ~8k tokens/sec | ~$6 |
| 4× H100 80 GB (TP=4) | 70B at fp8 | ~400 rank-16 | ~15k tokens/sec | ~$16 |
| 8× H100 80 GB (TP=8) | 70B at fp16 | ~600 rank-16 | ~25k tokens/sec | ~$32 |
| 8× H200 141 GB (TP=8) | 405B at fp8 | ~300 rank-16 | ~20k tokens/sec | ~$48 |
| 8× B200 192 GB (TP=8) | 405B at fp8 | ~600 rank-16 | ~35k tokens/sec | ~$60 |
Numbers are approximate and assume the workload doesn't bottleneck on cold-tier loads. Real-world throughput depends heavily on input/output length distributions.
The break-even point for going multi-tenant vs dedicated:
- Per adapter QPS < 1 request/sec average → multi-tenant is clearly better.
- Per adapter QPS > 50 request/sec average → dedicated might pay off (you saturate the GPU with one adapter anyway).
- In between → multi-tenant with adapter-grouped batching for the heavy hitters.
Few SaaS workloads have any individual tenant exceeding 50 req/sec sustained, so the multi-tenant pattern dominates.
LoRA quality vs full fine-tuning
A persistent question: does LoRA actually match full fine-tuning?
For most use cases: yes, within 1–3 points on most benchmarks.
The classic results — LoRA paper (Hu et al., 2021), QLoRA paper (Dettmers et al., 2023), and many subsequent fine-tuning studies — show LoRA fine-tuned models within ~1 point of fully fine-tuned models on:
- Instruction following.
- Domain adaptation (legal, medical, code).
- Stylistic alignment (specific tone, format).
- Task-specific classification or extraction.
LoRA underperforms full fine-tuning by larger margins (5–10+ points) on:
- Multi-turn complex reasoning that benefits from many parameter updates.
- Tasks requiring large distribution shifts from the base model (e.g., entirely new language families).
- Aggressive vocabulary / tokeniser changes.
Rank matters.
- Rank 4–8: very small, works for narrow style adaptation.
- Rank 16–32: the sweet spot. Most production fine-tunes.
- Rank 64–128: closer to full fine-tuning quality at moderate cost.
- Rank 256+: diminishing returns; if you need this, consider full fine-tuning instead.
Module targeting.
- Default: attention QKV projections.
- Stronger: include FFN projections (up, gate, down). 2–4× more parameters but better quality.
- Comprehensive: all linear layers. Approaches full fine-tuning quality.
The conventional wisdom in 2026: use rank 32, target all attention + FFN linear layers, and you'll be within 1–2 points of full fine-tuning for most customisation tasks. Saves 99% of training cost.
Quality table: LoRA configurations on a typical fine-tune task
Rough numbers from internal evals at several teams running customer fine-tuning in 2026, on a domain-style-adaptation task (customer support tone):
| Configuration | Trainable params | Quality vs full FT | Training time (10k examples, 70B base) |
|---|---|---|---|
| Full fine-tuning | 100% | 100 (baseline) | 30 h on 8× H100 |
| LoRA rank 4, attention only | 0.05% | 91 | 30 min on 4× H100 |
| LoRA rank 16, attention only | 0.2% | 96 | 60 min on 4× H100 |
| LoRA rank 32, attention only | 0.4% | 97 | 90 min on 4× H100 |
| LoRA rank 16, all linear | 0.4% | 98 | 90 min on 4× H100 |
| LoRA rank 32, all linear | 0.8% | 99 | 2 h on 4× H100 |
| LoRA rank 64, all linear | 1.5% | 99.5 | 3 h on 4× H100 |
| DoRA rank 32, all linear | 0.9% | 99.5 | 2.5 h on 4× H100 |
| QLoRA (4-bit base) rank 32, all linear | 0.8% | 98.5 | 2.5 h on 4× H100 (12 GB peak memory) |
QLoRA gives up ~0.5 quality points for a 10× memory reduction during training. For most teams that's the right trade — you can fine-tune a 70B base on a single 80 GB GPU instead of needing 4.
Cost math: LoRA vs separate instances
A concrete pricing example. Llama-3.3 70B base, 200 customer fine-tunes, on AWS in 2026:
Multi-tenant LoRA setup:
- 1× p5.24xlarge (8× H100) at ~$50/hour = $36k/month.
- 200 rank-32 LoRAs at 50 MB each = 10 GB total, fits easily.
- Serves 5,000 RPS aggregate (peak), 1,500 RPS sustained.
- Effective cost per adapter: $180/month for unlimited inference.
Dedicated instances:
- 200 × p5.4xlarge (4× H100) = 200 instances × $25/hr = $3.6M/month.
- Most idle 99% of the time.
The hybrid optimisation. Real workloads have a few heavy-traffic tenants and many low-traffic. The 2026 pattern: multi-tenant for the long tail; consider dedicated only for tenants generating sustained >50 RPS. Even then, dedicated is rarely the right call — multi-tenant scales fine to higher per-adapter QPS than people think.
Training cost.
- LoRA training of a 70B base on a customer's 10k-example dataset: ~3 hours on 4× H100 = $300 of GPU time. Per customer per fine-tune.
- Full fine-tuning the same: ~30 hours on 8× H100 = $6000 per customer per fine-tune.
- 20× cost reduction in training, in addition to the 40× cost reduction in serving.
For a SaaS offering customer fine-tunes, the total economics are:
- Per-customer training: $300 amortised over the customer's lifetime ≈ $5/month if customers stay 5 years.
- Per-customer serving share: ~$180/month for a moderately popular product.
- Total per-customer cost: ~$185/month for unlimited fine-tuned inference.
Charge $200/month for the customisation tier and you have a viable product. Without LoRA, the same product would cost $20k+/month to deliver.
Multi-tenant LoRA vs RAG vs prompt customisation
The three ways to serve per-customer behaviour, compared:
| Approach | Quality on style | Quality on knowledge | Setup cost per customer | Serving cost per customer | Iteration speed |
|---|---|---|---|---|---|
| LoRA fine-tune | High | Medium | $300 (3 h training) | ~$180/mo | Hours |
| RAG over customer docs | Low | High | ~$1 (embedding) | ~$5/mo + query cost | Seconds |
| Few-shot examples in system prompt | Medium | Low | None | Higher per-query (longer prompt) | Instant |
| LoRA + RAG hybrid | High | High | $300 + indexing | ~$185/mo + query cost | Hours |
LoRA is the right tool for style, tone, and format adaptation. RAG is the right tool for "the model needs to know facts that aren't in its training data." For most production products that want both, the answer is both: a LoRA adapter that shapes style, plus RAG over the customer's content. See RAG in production for the retrieval side.
Multi-tenant operations
The production complexity of multi-tenant LoRA isn't in the kernels — that's solved. It's in operations.
The operational team for a 1000-adapter platform
What a typical team looks like running a 1000-adapter multi-tenant LoRA platform:
- 1–2 ML platform engineers (serving stack, kernel debugging, performance tuning).
- 1 ML researcher (fine-tuning recipes, eval, adapter quality).
- 1 backend engineer (gateway, scheduling, billing integration).
- 1 SRE (on-call, monitoring, incident response).
- 0.5 product/PM (customer-facing tooling, onboarding flow).
- 0.5 compliance (SOC 2 audit, customer contracts).
Total: 5 FTEs. At full-loaded cost ($300k/FTE), $1.5M/year. The platform needs to generate $5M+/year in revenue for the team economics to work, which lands at roughly $500/customer/year average on 10,000 customers, or $50/customer/year on 100,000.
Per-tenant evaluation. Each adapter has its own quality. You can't run one eval suite and call the platform "good." Most teams build a per-tenant eval pipeline: each customer's adapter is evaluated against their own labelled data (collected via in-product feedback or a small ground-truth set).
Adapter versioning. Customers iterate on their fine-tunes. v1, v2, v3 of the same customer's adapter coexist; rollback when v3 regresses. Adapter versions are tagged, served, and evicted independently.
A/B testing. When you upgrade the base model, you need to fine-tune every existing adapter on the new base and validate quality before cutting over. Multi-tenant tooling has to support running two base models with two adapter sets simultaneously during migration.
Cost accounting. Per-tenant billing requires knowing each tenant's compute share. Token counting per adapter is straightforward; HBM occupancy attribution is fuzzier (one adapter resident in HBM "uses" the same HBM whether it serves 1 or 1000 requests/hour). Most platforms bill by tokens served, not by HBM occupancy, and amortise the cluster overhead.
Adapter store. Object storage for cold-tier adapters, with versioning and integrity checks. Adapters are small but the catalogue grows quickly — 10k adapters × 50 MB = 500 GB of object storage. Cheap; do it well from the start.
Permissioning. Tenant A's adapter must not serve tenant B's requests. Trivial at the API gateway level (auth → tenant ID → adapter selection), but worth double-checking in code paths that touch adapter IDs.
Monitoring. Per-adapter latency, per-adapter error rate, per-adapter cold-load frequency, per-adapter cost. The dashboard with these four metrics catches 80% of production issues.
Adapter lifecycle: from upload to retirement
A typical adapter's lifecycle in a production multi-tenant system:
- Upload. Customer pushes training data (JSONL or similar). System validates schema and size.
- Validation. Schema check, PII scan, content policy filter. Reject obviously bad uploads.
- Training. LoRA training on a separate compute pool. Typically 30 min – 3 h depending on base size and dataset.
- Quality eval. Train/validation split; LLM-as-judge or task-specific metrics. Block promotion if quality regresses vs base.
- Canary. Adapter loaded to a small fraction of traffic, real-user feedback collected for 24–72 hours.
- Promotion. Adapter goes live for the customer. Version tag stored.
- Production. Adapter is hot/warm/cold tiered based on access patterns.
- Retraining. Triggered by base model upgrade, data drift, or customer request.
- Deprecation. Old versions kept for rollback for 30–90 days, then deleted from cold storage.
- Audit. Adapter file retained per compliance policy (often 1–7 years) even after deletion from serving.
Most platforms automate steps 2–4 and 7–9; steps 1, 5, 6, and 10 typically have human approval or compliance checkpoints depending on the regulated nature of the customer base.
Production failure modes
Cold-tier load thrashing. HBM is full, adapters get evicted to CPU as new ones are loaded; the new ones eviction-cascade further. Symptom: tail-latency spikes during traffic shifts. Fix: increase --max-loras to hold more in HBM, prefetch on session start, increase the warm tier on CPU.
Adapter corruption. A bad adapter file (truncated, wrong shape, NaN weights) crashes the worker. Validate adapters on upload; canary new adapters on a small traffic slice before promotion.
Rank mismatch. Adapter trained at rank 64; serving stack configured for --max-lora-rank 32. Worker fails to load. Validate max-lora-rank matches your training pipeline.
Tensor-parallel sharding mismatches. Adapter sharded with TP=4 served on a TP=2 stack. Modern stacks handle this transparently but bugs exist.
Adapter drift across base versions. When the base model is updated, old adapters trained on the old base may behave incorrectly on the new base. Treat the (base, adapter) pair as the versioned artifact; never serve an adapter against a base version it wasn't trained on.
Bursty single-tenant traffic. One customer hammers the system, hot-adapter pressure starves others. Fix: weighted fair queuing per tenant, per-tenant rate limits.
Eval blind spots. Aggregate quality looks fine; one tenant's quality silently regressed. Per-tenant eval and per-tenant quality dashboards catch this.
Memory leak in the adapter pool. Old adapter versions not properly freed when replaced; HBM fills over time. Restart cadence catches this in practice; a real fix requires care in the adapter loader.
Cross-tenant cache pollution. KV-cache prefix caching is per-(adapter, prefix); a bug that ignores the adapter dimension would leak cached state across tenants. Test this; it has happened.
LoRA over-fit on small datasets. Customers upload 50 examples; the adapter memorises them and parrots them back instead of generalising. Mitigations: minimum dataset size before allowing fine-tune (a few hundred examples), explicit dropout in the LoRA layers, validate against a held-out split before promoting.
Catastrophic forgetting on adjacent capabilities. Training a LoRA on legal text degrades the model's coding ability. Even though LoRA touches a small subset of parameters, aggressive training can shift the model's behaviour broadly. Mitigations: lower learning rate, include a small fraction of general-purpose data in the training mix, eval on capability benchmarks before deploying.
Tokeniser mismatch. Adapter trained with one tokeniser, served with another (this happens when teams switch from Llama 3.1 to Llama 3.3 without re-training). The adapter weights are nominally compatible but the embedding alignment shifts. Always tie the adapter to a specific base version.
Per-adapter math: rank, target modules, MB sizes
This section gives you the actual numbers — adapter sizes in MB across base models, ranks, and target-module choices. Useful for capacity planning and cost modeling.
Adapter size determines how many you can hold hot in HBM. The math is mechanical once you know the architecture's hidden dimensions.
For a transformer layer with hidden size d and intermediate size d_ff (typically 4d for dense, 14d/3 for SwiGLU FFN), the per-layer LoRA parameter counts are:
- Attention Q, K, V, O projections at rank
r: 4 × 2 × r × d parameters per layer. - FFN gate, up, down projections at rank
r: 3 × 2 × r × d_ff parameters per layer.
For Llama-3.1 8B (d=4096, d_ff=14336, 32 layers), rank 16:
- Attention-only: 32 × 4 × 2 × 16 × 4096 = ~16.8M params × 2 bytes (BF16) = ~33 MB.
- All-linear: ~16.8M + 32 × 3 × 2 × 16 × 14336 = 16.8M + 44M ≈ ~120 MB.
For Llama-3.3 70B (d=8192, d_ff=28672, 80 layers), rank 16:
- Attention-only: 80 × 4 × 2 × 16 × 8192 = ~84M × 2 = ~168 MB.
- All-linear: 84M + 80 × 3 × 2 × 16 × 28672 = 84M + 220M ≈ ~600 MB.
For Llama-3.1 405B (d=16384, d_ff=53248, 126 layers), rank 16:
- Attention-only: ~530M × 2 = ~1.1 GB.
- All-linear: ~3.0 GB.
The 405B case is where multi-tenant strategy bends. A B200 with 192 GB HBM holds the FP8 weights (~400 GB across 8 GPUs at TP=8 = ~50 GB per GPU) plus per-GPU adapter shards. At 3 GB per all-linear-rank-16 adapter spread across 8 GPUs = ~400 MB per GPU per adapter — you fit ~100 hot adapters per node, not thousands.
Target-module choices and their trade-offs
| Module set | Quality vs full FT | Size multiplier |
|---|---|---|
| Q only | ~85% | 1.0× (baseline) |
| Q + V | ~90% | 2.0× |
| Q + K + V + O | ~95% | 4.0× |
| Q + V + FFN gate/up/down | ~97% | ~9× |
| All linear layers | ~98% | ~10× |
| All linear + embedding | ~99% (rarely worth it) | ~11× |
The default in 2026 is Q + V + all FFN — captures ~97% of quality at ~9× the smallest adapter. For storage-constrained edge deployments, Q + V only at rank 8 gives 88–92% quality at <30 MB per adapter on an 8B base.
Why rank doesn't matter as much as you think
Doubling rank doubles adapter size and trainable params, but quality gains beyond rank 32 are sub-linear. The published evidence: LoRA Tutor (Predibase), Hu et al.'s original ablations, and many subsequent studies all show diminishing returns above rank 64. The practical sweet spot is rank 16–32 for cost-conscious deployments, rank 64 only if eval shows a measurable lift.
Adapter quantization: LoRA on FP8, QLoRA serving, NVFP4
Quantization stacks on multi-tenant LoRA in non-obvious ways. The base model and the adapter are independently quantizable, with consequences for both quality and memory.
Base model at FP8 with FP16 adapter
The 2026 default. Base in FP8 (E4M3 or E5M2 via Transformer Engine on H100/H200/B200) reduces HBM use by 2× vs FP16. The LoRA adapter stays at BF16/FP16. At forward time:
- The base GEMM runs in FP8 with TensorCore acceleration.
- The LoRA GEMM runs in BF16/FP16 — small kernel, low overhead.
- The two outputs are added in FP32 accumulator, downcast to BF16 for the next layer's input.
Quality: within 0.5 points of FP16 base. Memory: ~half.
Base model at INT4 (AWQ/GPTQ) with FP16 adapter
QLoRA's serving equivalent. Base weights packed to INT4 in HBM, dequantized just-in-time per layer. LoRA stays at BF16.
- HBM footprint: 4× smaller than FP16 base (a 70B model takes ~35 GB vs ~140 GB).
- Throughput: comparable to FP8 base on H100; faster on hardware without FP8 (A100s).
- Quality: 1–2 points worse than FP16 base on hard reasoning; equivalent on most other tasks.
This is the dominant 2026 pattern for cost-constrained multi-tenant deployments on A100s and consumer-grade GPUs.
NVFP4 (Blackwell) with FP8 adapter
NVFP4 is NVIDIA's new 4-bit format introduced with Blackwell (B200/B300). Two new dimensions:
- Microscaling — each block of 16–32 elements has its own scale factor stored at FP8 precision.
- Direct TensorCore support — no dequantize step required for FP4 GEMM.
For multi-LoRA on Blackwell: base at NVFP4 (8× smaller than FP16), adapter at FP8 or BF16. Memory savings vs FP16 base: 8×. A B200 with 192 GB can hold a 405B model at NVFP4 (50 GB) plus hundreds of adapters with room for KV cache. This is what "multi-tenant 405B on a single B200 node" looks like in 2026.
Mixed precision for the adapter itself
Some research and production teams quantize LoRA adapters too — INT8 or INT4 — to fit more in HBM. The quality cost is real (5–10% degradation) because adapters are small and dense; quantization noise has nowhere to hide. Most production stacks keep adapters at FP16/BF16 and quantize only the base.
Quantization compatibility matrix
| Base format | LoRA format | Serving stack support | Quality vs FP16 base |
|---|---|---|---|
| FP16 | BF16 | All | 100% (baseline) |
| BF16 | BF16 | All | ~100% |
| FP8 (E4M3) | BF16 | vLLM, TRT-LLM, SGLang | ~99.5% |
| INT8 | BF16 | vLLM, TGI, TRT-LLM | ~99% |
| INT4 (AWQ) | BF16 | vLLM, TGI, SGLang | ~98% |
| INT4 (GPTQ) | BF16 | vLLM, TGI | ~98% |
| NVFP4 (Blackwell) | FP8 / BF16 | TRT-LLM, vLLM nightly | ~99% |
| FP8 | INT8 LoRA | Experimental | ~95% |
| INT4 | INT4 LoRA | Research only | ~90% |
See quantization tradeoffs for the underlying precision arithmetic.
Mixing adapter quantization with base quantization at serving
Some 2026 production stacks (vLLM nightly, TRT-LLM) support FP8 quantized adapters served on FP8 quantized bases. The quality cost is ~1 point on most evals; the memory gain is small for the adapter portion alone but stacks with the base savings.
A practical decision: keep adapters at BF16 unless your fleet has >10,000 adapters per worker and HBM is the bottleneck. The complexity isn't worth it for smaller fleets.
Choosing adapter format: safetensors vs PEFT vs custom
- HF
safetensorswithpeft_config.json— the de-facto standard. All serving stacks accept this format. - PEFT pickle (
adapter_model.bin) — deprecated due to security concerns. Don't use. - Custom binary format — some platforms (Predibase historically) use proprietary formats for compression and faster loading. Lock-in cost is real; prefer standard.
For interoperability, train and store in HF safetensors even if your serving stack supports proprietary formats. You retain the option to switch stacks later.
Hot-swap dynamics: cold load, prefetch, cache eviction
The adapter-management layer's job is to keep the right adapters resident in HBM at the right time. Three sub-systems do the work.
Cold load timing breakdown
Loading an adapter from cold (object storage) to hot (HBM) involves:
- HTTP/HTTPS request to S3/GCS/MinIO: 20–200 ms (depending on region, file size, edge caching).
- TLS handshake (if not pooled): 10–50 ms.
- File download: typical 50 MB adapter at 100 MB/s = 500 ms.
- Decompression (if compressed): 50–200 ms.
- Deserialize (PyTorch/HF safetensors): 50–500 ms.
- CPU-to-GPU DMA over PCIe (Gen4: 32 GB/s, Gen5: 64 GB/s): a 50 MB adapter transfers in ~1.5 ms on PCIe Gen5.
- Register adapter with serving stack: 5–20 ms.
End-to-end: 600–1500 ms for a typical 50 MB cold load. Multi-second tail latency for the request that triggered the load.
Optimisations:
- Pre-decompressed safetensors on local NVMe. Cuts download to local IO (200 MB/s for a 50 MB file = 250 ms).
- mmap from local SSD. Skips deserialize.
- Local NVMe adapter cache. Common pattern: 1 TB local SSD per worker holding ~20k adapters, S3 as backing store.
- Region-local object storage. S3 cross-region adds 100+ ms; same-region is fast.
Warm tier sizing
CPU RAM (typically 256–1024 GB on a serving node) holds adapters that have been used recently. A 1 TB node can hold 20,000 50-MB adapters in RAM. The warm-to-hot transition is dominated by PCIe DMA (1.5 ms for 50 MB on PCIe Gen5).
Eviction policies
Standard LRU is the baseline. Variants:
- LRU-K — track access at multiple historical points; less susceptible to one-time accesses.
- TinyLFU — frequency-aware LRU; better hit rates for skewed workloads.
- 2Q (two-queue) — separates recently-added from frequently-accessed adapters; prevents one-shot loads from evicting hot items.
- Custom (tenant-priority) — paid-tier adapters are pinned; free-tier eligible for eviction.
Most production stacks use a 2Q variant with tenant-priority overlay.
Prefetch patterns in detail
- Session-start prefetch. On user login, prefetch their adapter from cold to warm. First message hits hot or warm.
- Predictive prefetch. If user A's messages follow a 30-second cadence, keep adapter A hot for 60 seconds after each response.
- Bulk preload at startup. For deployments with <1000 adapters, load all of them at server start. Trades 5-minute startup for 100% hot-rate.
- Geographic prefetch. A multi-region deployment can prefetch a tenant's adapter to the region they're connecting from before their first request.
Adapter prefetch under realistic traffic
Trace data from a real customer-support SaaS, 2026:
- 2,000 customer adapters in catalog.
- 95% of daily traffic concentrates on the top 100 adapters (Zipf-like distribution).
- Daily active adapter count: ~400 (some used once, many used briefly).
- p99 adapter latency: 240 ms when hot, 850 ms when warm (cold load), 2.8 s on full cold fetch.
The prefetch strategy that worked:
- Hot tier sized for top 200 adapters (HBM budget).
- Warm tier holds top 2,000 in CPU RAM with LRU eviction (1 TB RAM, plenty of headroom).
- Session-start prefetch: when a session begins, the gateway warms the adapter (CPU→HBM) before the first request lands.
- Cold-fetch only happens for adapters used <1× per day, and represents <0.3% of requests.
Result: 99.7% of requests serve from hot adapters, p99 stays under 280 ms (lifted slightly by the warm-tier loads). Cold-fetch tail is <1% of total RPS.
What happens during a hot-tier eviction storm
Pathological scenario: 50 new tenants log in simultaneously (e.g., a marketing campaign drove cohort onboarding). Each session-start triggers a prefetch. HBM is full; 50 existing hot adapters get evicted to make room. Meanwhile, requests for the evicted adapters generate cold-load demands, causing further eviction churn.
The mitigation: rate-limit the prefetch ingress. New session prefetches queue and execute at a rate that allows the displaced adapters to drain naturally (e.g., 5 prefetches/sec). Customers see a small first-request latency hit (warm load) but the cluster stays healthy.
A typical hot/warm/cold ratio
For a SaaS workload with 10,000 customer adapters:
- ~50 adapters hot in HBM (top 0.5% by recent activity).
- ~2,000 adapters warm in CPU RAM.
- ~10,000 adapters cold in object storage.
- Cold-load rate at steady state: <0.5% of requests, with caching tuned correctly.
Per-tenant SLA isolation and fairness scheduling
The hardest part of multi-tenant serving isn't the kernels — it's preventing one customer from ruining the experience for everyone else.
The noisy-neighbour problem
Customer A sends 1000 requests in 10 seconds. Without isolation:
- A's adapter saturates HBM caching.
- A's requests fill the inference queue.
- Other customers see 5–10× latency increase.
- p99 latency across all tenants spikes.
Fair scheduling techniques
Weighted fair queueing (WFQ). Each tenant has a weight; requests are dequeued in proportion to weights. Production-grade implementations exist as gateway middleware (Envoy, Kong, custom).
Token bucket per tenant. Each tenant has a refill rate and burst capacity. Requests exceeding the bucket are queued or rejected. Common pattern: 50 RPS sustained, 200 RPS burst for paid tier; 5 RPS sustained, 20 RPS burst for free tier.
Quality-of-service tiers.
- Platinum: dedicated reserved compute, guaranteed sub-500ms p99.
- Gold: shared compute, fair-queue priority, sub-1s p99 best-effort.
- Silver: shared compute, lower priority, sub-3s p99 best-effort.
- Bronze: batch tier, no real-time SLA.
Admission control. When the queue depth exceeds a threshold, new requests from low-priority tenants are rejected with a 503. Prevents queue collapse.
Latency-aware routing. A gateway monitors per-instance latency and routes around overloaded workers. Works well when you have spare capacity; fails when the whole fleet is hot.
A worked SLA isolation example
A platform with three tenant tiers, on 8× H100 (4 inference workers):
- Platinum tenants (10 customers): 50% of compute reserved, p99 < 500 ms guaranteed.
- Gold tenants (100 customers): 35% of compute, p99 < 1 s best-effort.
- Free tenants (10,000 customers): 15% of compute, p99 < 5 s best-effort, request rejection above threshold.
How it's enforced:
- Token bucket per tenant in the gateway.
- WFQ in the inference queue, with tier weights 10:3:1.
- Admission control at queue depth >100; free tenants rejected first.
- Per-tier dashboards: p99 latency, request success rate, queue depth.
Worked SLA example: a hospital tenant on a healthcare AI platform
A healthcare AI platform serves 200 hospital customers via multi-tenant LoRA on Llama-3.3 70B (4× H200 cluster).
- Hospital A — Platinum tier, $20k/mo contract. SLA: p99 < 800 ms, 99.95% uptime, dedicated burst capacity for emergency-department surges.
- Hospital B — Gold tier, $5k/mo. SLA: p99 < 2 s, 99.9% uptime.
- Hospital C — Trial tier, free. No SLA; service-level "best effort."
When Hospital A's ED hits a major incident (3× normal traffic for 2 hours), the platform's response:
- Hospital A's token bucket allows the burst (paid tier capacity).
- WFQ moves their requests to front of queue.
- Free tier (Hospital C and others) sees increased latency, possibly admission-control rejections.
- Gold tier sees minor p99 lift (e.g., 1.4 s vs baseline 1.0 s) but stays within SLA.
- Hospital A's experience: no degradation — pays for the priority.
Implementation: WFQ in the gateway with tier weights 100:10:1, per-tenant token buckets, admission-control thresholds. Total dev cost for the tier system: 2 engineers × 6 weeks. The platform charges 4× more for paid tiers than the cost of dedicated infrastructure would imply because dedicated isolation is more valuable than the underlying compute cost.
KV-cache fairness
Beyond compute fairness: KV cache memory is a finite resource per worker. One tenant with very long contexts can monopolise KV cache, starving others. Mitigations:
- Per-tenant KV cache budget (vLLM supports limits via
--max-num-seqsadjacent settings). - Eviction policy that prefers evicting low-tier tenants' KV state first.
- Hard timeout on stalled requests (free tier's 10-minute idle gets evicted before paid tier's).
Customer-onboarding flow: training to serving
The product flow for adding a new customer fine-tune. Six steps, each with operational nuance.
Step 1: data upload and validation
Customer uploads training data — typically JSONL with prompt/response pairs. Platform validates:
- File size and row count (minimum 100 examples; warn at <500; reject at >1M).
- Schema (correct fields, valid JSON).
- Token counts per example (no example exceeding context length).
- Content policy scan (no obvious PII unless customer is on a data-residency tier; no obviously toxic content per terms of service).
- Format consistency (mix of styles or formats hurts training; warn).
Step 2: training-config selection
Customer picks (or platform picks for them):
- Base model: usually fixed by platform, sometimes selectable from a small menu.
- Rank: default 16 or 32, adjustable up to 64.
- Target modules: default attention + FFN, adjustable.
- Epochs: typically 3.
- Learning rate: platform default with override allowed.
Most platforms hide all these behind "Quick" / "Standard" / "Premium" preset levels.
Step 3: training execution
A separate training compute pool (typically 1–4 H100s) handles the training:
- Spin up a worker on demand (or pull from a hot pool).
- Run training with the customer's data + chosen config.
- Periodic checkpoint to S3.
- Total wall time: 20 minutes – 4 hours depending on dataset size and rank.
Step 4: evaluation
Auto-eval against:
- Held-out customer split (20% of their data not seen during training).
- Platform-provided "general capability" eval (HumanEval, IFEval, etc. — make sure the adapter didn't break general capabilities).
- Customer-specific eval if provided.
If quality regresses vs the base, block promotion and notify customer.
Step 5: canary deployment
Adapter goes live for 1–10% of the customer's traffic. The remaining traffic uses the customer's previous adapter or the base model. Real-user metrics (latency, success rate, customer feedback if any) collected for 24–72 hours.
Step 5b: cost guardrails during training
A common failure: customer uploads a huge dataset and triggers a $2,000 training run. Surprise bills shred trust. The mitigation: a cost estimate before training starts, with explicit confirmation. The estimate is computable from dataset size, base model size, rank, and epoch count.
Example estimate display: "Training a rank-32 all-linear LoRA on Llama-3.3 70B over your 50,000 examples for 3 epochs is estimated to take 6 hours on 4× H100 at $25/hour each = $600. Confirm or adjust parameters." This sounds heavy but is a one-time fee; customers compare it to the recurring savings from a working fine-tune.
Most platforms also enforce hard caps: maximum 1M training tokens for free-tier accounts; maximum 100M tokens for standard; unlimited for enterprise. Above the cap, training is rejected with a "contact sales" link.
Step 6: full promotion
If canary passes, adapter goes to 100% traffic. Previous version retained for rollback for 30+ days. New adapter added to the hot/warm/cold tier rotation.
Total time from upload to live: 1–6 hours
Most platforms in 2026 land between 1–2 hours for typical 5k-example LoRA on a 7B–70B base. Longer for larger bases, larger datasets, or stricter canary windows.
Enterprise multi-tenant: RBAC, audit, compliance
When the customers are enterprises rather than individual developers, the operational requirements multiply.
Role-based access control (RBAC)
Within a customer's workspace, multiple users with different permissions:
- Admin: create/delete adapters, manage billing, view all data.
- Trainer: upload data, train new adapters, view training metrics.
- User: query the deployed adapter, view their own usage.
- Auditor: read-only across the workspace including training data.
Enterprise platforms in 2026 (OpenAI Enterprise, Anthropic Enterprise, AWS Bedrock fine-tuning, Vertex AI Tuning) all expose role hierarchies that map to existing IAM (SSO via SAML/OIDC, group provisioning via SCIM).
Audit logs
Every meaningful action logged:
- Training data upload (who, when, file hash).
- Training run (start, end, hyperparameters, eval results).
- Adapter promotion (canary→prod, version, approver).
- Adapter query (who, when, prompt summary, response summary, tokens used).
- Adapter deletion or version rollback.
Retention typically 7+ years for SOC 2 / ISO 27001 compliance. Audit logs are themselves a security-sensitive dataset; access via separate role.
Data residency
Enterprise contracts often require training data and adapters to never leave a specified region (EU, US-only, in-country). Multi-tenant platforms support this by:
- Region-pinned training pools.
- Region-pinned object storage for adapter weights.
- Region-pinned inference fleets.
- Adapter portability disabled cross-region.
PII handling
For sensitive data (healthcare, financial), platforms offer:
- Pre-training PII scanning (Presidio, custom DLP).
- Differential-privacy LoRA training (lower utility, stronger guarantees against extraction attacks).
- Memorisation audits post-training (probe the model with training-data prefixes; check it doesn't complete them verbatim).
- BAA (Business Associate Agreement) for HIPAA; SOC 2 Type II; ISO 27001 / 27017 / 27018 attestations.
Multi-region adapter replication
Enterprise customers spanning regions often need their adapter accessible in multiple geographies. Patterns:
- Master-region adapter store with edge replication. Train in the customer's home region; replicate the adapter file to other regions for serving. Adapter loads stay local; quality is consistent (same weights).
- Per-region inference fleets. A customer's adapter is registered with multiple regional fleets. The gateway routes by user location.
- Cross-region failover. If the home region's inference fleet goes down, traffic fails over to a peer region. Adapter must already be present there; pre-replicate.
Compliance complications: a contract requiring "EU-only data" forbids replication to the US even for failover. Multi-region deployments need careful policy enforcement.
Tenant isolation at the kernel level
Beyond logical isolation (auth → tenant ID → adapter selection), enterprise platforms often offer cryptographic separation:
- Tenant-specific encryption keys for adapter storage (customer-managed keys via KMS).
- Per-tenant inference workers in dedicated tiers (no shared GPU with other tenants).
- VPC-isolated deployments (the adapter never lives on shared infrastructure).
These cost more — dedicated inference at AWS Bedrock or Azure OpenAI's "Provisioned Throughput Units" is 5–10× the shared-tier price — but unlock regulated-industry contracts.
GPU-class economics for adapter serving
The right GPU for multi-tenant LoRA depends on base model size and adapter count.
Small bases (≤8B): L40S sweet spot
For Llama-3.1 8B and similar, the L40S (48 GB GDDR6, $1.50–$3/hour rental) is the cost leader. The 8B base at FP16 fits in ~16 GB, leaving 32 GB for adapters + KV cache. Throughput per dollar beats H100 for small-base multi-tenant workloads.
Tradeoffs:
- L40S has no NVLink — multi-GPU inference is bottlenecked by PCIe (~32 GB/s vs NVLink's 900 GB/s on H100).
- Memory bandwidth is GDDR6 (864 GB/s) vs HBM3 (3.35 TB/s on H100). Decode-heavy workloads bottleneck on bandwidth.
Mid bases (32B–70B): H100/H200 default
For Llama-3.3 70B, Qwen2.5 32B, Mistral Small 22B, the H100 SXM (80 GB) at $15–$25/hour is the workhorse. H200 (141 GB) is preferred for higher adapter density (more HBM = more hot adapters). The 2-GPU H100 tensor-parallel deployment at FP8 is the production standard.
Large bases (405B): H200/B200
Llama-3.1 405B at FP8 needs ~400 GB of HBM for weights alone. 8× H100 (640 GB total) handles it with thin KV cache margin. 8× H200 (1128 GB) is comfortable. B200 (1536 GB across 8 GPUs) is generous.
For multi-tenant 405B at scale: B200 is the right call. NVFP4 on B200 fits the 405B base in <100 GB, leaving 90+ GB per GPU for adapters and KV cache. Per-token serving cost approaches what you'd pay for 70B on H100.
GH200 / GB200 — when massive memory matters
GH200 (Grace Hopper) and GB200 (Grace Blackwell) pair a Grace ARM CPU with one or more Blackwell GPUs over NVLink-C2C (900 GB/s). The 480 GB LPDDR5X on the CPU side acts as extended GPU memory. For multi-tenant LoRA, this means:
- The warm-tier of adapters can live in Grace's LPDDR5X, accessible at memory speeds.
- Tens of thousands of adapters warm-accessible per node, not just thousands.
- Cold loads from S3 become rare; warm cache is huge.
GB200 NVL72 racks (72× B200 + 36× Grace) take this to extreme scale: petabytes of unified memory across the rack, exabyte-scale fleets of warm adapters.
Hardware cost vs adapter capacity table
| Hardware | Base size sweet spot | Hot adapters (rank-16, all-linear) | Hourly rental | Per-adapter $/month |
|---|---|---|---|---|
| 1× L40S | 8B | ~150 | $2 | $9.60 |
| 1× H100 80 GB | 8B | ~500 | $4 | $5.76 |
| 1× H100 80 GB | 32B | ~150 | $4 | $19.20 |
| 2× H100 (TP=2) | 70B | ~100 | $8 | $57.60 |
| 4× H100 (TP=4) | 70B | ~300 | $16 | $38.40 |
| 4× H200 (TP=4) | 70B | ~600 | $24 | $28.80 |
| 8× H100 (TP=8) | 405B | ~80 | $32 | $288 |
| 8× B200 (TP=8) | 405B at NVFP4 | ~400 | $60 | $108 |
| 1× GH200 | 8B with 10k warm | thousands warm | $4 | <$1 |
The right-most column — per-adapter $/month at 100% utilisation — is the production unit-economics number that determines pricing for fine-tuning tiers.
Cost-per-tenant break-even analysis
For a typical SaaS offering customer-specific fine-tunes, the unit economics:
| Scenario | GPU cost | Tenants served | Cost per tenant | Customer-price | Margin |
|---|---|---|---|---|---|
| 4× H100, 70B base, 100 tenants | $11,520/mo | 100 | $115 | $200 | 42% |
| 4× H200, 70B base, 300 tenants | $17,280/mo | 300 | $58 | $150 | 61% |
| 8× B200, 405B base, 200 tenants | $43,200/mo | 200 | $216 | $400 | 46% |
| 1× L40S, 8B base, 50 tenants | $1,440/mo | 50 | $29 | $50 | 42% |
The economics work because of the multiplier on tenant count. Below ~30 tenants per cluster, multi-tenant doesn't beat dedicated serving by enough to justify the operational complexity. Above 100 tenants per cluster, the per-tenant cost drops below most sensible price points and the business turns into a money printer.
How adapter density affects unit economics
The 2026 reality of adapter density across hardware tiers:
- 8B base, 50 adapters per L40S. Per-adapter cost: $29/mo at $1.50/hr. Viable for small SaaS at $50–$100/mo price.
- 70B base, 100 adapters per 2× H100. Per-adapter cost: $86/mo. Viable for mid-market at $200/mo price.
- 70B base, 600 adapters per 4× H200. Per-adapter cost: $29/mo. Viable for high-volume SaaS at $50/mo.
- 405B base, 100 adapters per 8× B200. Per-adapter cost: $432/mo. Only viable for enterprise customers at $1000+/mo.
The denser the adapter pool, the lower the per-adapter cost; this is what makes "your fine-tune at $50/mo" a viable product even on $50k/month GPU infrastructure.
Why rank doesn't scale linearly with hardware
A subtle but important point: doubling adapter rank doesn't double the throughput hit on serving. The segment-aware LoRA GEMM is small compared to base-model GEMMs even at high rank. The throughput cost is roughly 1 + (LoRA_FLOPs / Base_FLOPs), which is a few percent at rank 16 and still under 10% at rank 128.
So if eval shows rank 64 is meaningfully better than rank 16, take it — the serving cost difference is in the noise.
When to bump rank vs target more modules
Two ways to give a LoRA more capacity: increase rank (deeper adaptation per module) or include more modules (wider adaptation across the network). They're not equivalent.
- Bump rank when the task requires fine-grained adaptation of specific behaviours (e.g., specific writing style, specific reasoning patterns).
- Add modules (FFN, embeddings, output) when the task spans many capabilities (e.g., domain-wide adaptation, multi-task fine-tuning).
Empirically: for narrow style tasks, rank 64 attention-only beats rank 16 all-linear. For broad domain adaptation, rank 16 all-linear beats rank 64 attention-only. The eval set determines which axis matters.
LoRA vs FT vs RAG vs few-shot: decision matrix
When should you customise via what mechanism? A practical decision matrix.
| Need | Best tool | Why |
|---|---|---|
| Style/tone adaptation | LoRA | Captures linguistic style efficiently |
| Domain vocabulary | LoRA or fine-tuning | Bake in terminology |
| Up-to-date facts | RAG | Inject current info per query |
| Bounded narrow task | LoRA on small base | Saves cost vs prompting larger model |
| Multi-step reasoning | Few-shot or reasoning model | Hard to bake in via LoRA |
| Format compliance | LoRA + JSON-mode | Structural outputs |
| Brand voice + product docs | LoRA + RAG | Combine style and freshness |
| Compliance-driven refusals | System prompt + safety LoRA | Layered defence |
| Personalisation per user | Few-shot or thin per-user RAG | LoRA at user granularity rarely viable |
| Per-tenant customisation | LoRA + RAG | Production default |
When LoRA isn't enough
Three cases where full fine-tuning beats LoRA materially:
- New language families. Adding Vietnamese or Swahili capability to a model that wasn't trained on much of those languages. The token-level distribution shift exceeds LoRA's capacity at reasonable ranks.
- Massive task reorientation. Repurposing a chat model as a code reasoning model. Too much of the base behaviour needs to change.
- Distillation. Distilling a frontier model into a smaller one. Full fine-tuning of the small model on outputs from the large one beats LoRA.
For most other use cases, LoRA + adequate training data beats full fine-tuning on the cost-quality frontier.
Cost-quality crossover at different volumes
For a fixed quality target, the cheapest customization technique depends on monthly query volume:
| Volume | Cheapest path |
|---|---|
| <10k queries/month | Few-shot in system prompt |
| 10k–100k queries/month | RAG over customer corpus |
| 100k–1M queries/month | LoRA fine-tune + RAG |
| 1M–10M queries/month | LoRA + small base + RAG |
| >10M queries/month | Full fine-tune + RAG, or smaller distilled model |
The break-points shift with model price changes. As frontier prices drop, the volume where few-shot wins extends upward; as smaller-base fine-tuning gets easier, the LoRA threshold expands.
The hybrid pattern (LoRA + RAG)
In 2026 production, the most common pattern for per-customer products:
- LoRA shapes the model's style, tone, terminology, and basic behaviour. Trained once per customer (or per customer-segment), retrained occasionally.
- RAG provides current facts, customer-specific knowledge, and references. Updated as customer data changes.
- Few-shot examples in the system prompt handle edge cases not worth fine-tuning for.
This composes cleanly: LoRA changes the model; RAG changes the input; few-shot examples are part of RAG context. The serving stack handles all three uniformly.
Real production deployments in 2026
A walk through the architectures of the multi-tenant LoRA stacks running at scale in 2026.
What does multi-tenant LoRA look like at the major commercial deployments?
OpenAI fine-tuning
OpenAI's fine-tuning API supports GPT-4o-mini, GPT-4o, GPT-3.5-turbo, and (in preview) GPT-5. The underlying architecture is multi-tenant LoRA — confirmed indirectly by the pricing model (training is cheap per token, inference is the same per-token price as the base model). Public details are limited; technical details are not disclosed, but the operational pattern matches the multi-tenant LoRA economics described in this guide.
Anthropic fine-tuning
Anthropic offers Claude fine-tuning to enterprise customers (Claude 3.5 Haiku, Claude 3.7 Sonnet as of 2026). The pricing is per-token at base-model rates, no separate fine-tune-instance cost — characteristic of multi-tenant LoRA serving.
AWS Bedrock model customisation
Bedrock supports fine-tuning for Llama 3.x, Titan, Cohere, and others. The customisation is LoRA-based; serving uses dedicated provisioned throughput units (PTU) or on-demand. On-demand serving is multi-tenant; PTU is dedicated.
Vertex AI Tuning
Google's Vertex AI Tuning supports LoRA fine-tuning of Gemini, PaLM, and select open-weight models. The serving infrastructure is multi-tenant (shared base, per-customer adapters), with optional dedicated endpoints for guaranteed throughput.
Together AI and Fireworks AI
Both offer LoRA fine-tuning on open-weight models (Llama, Qwen, Mistral, DeepSeek). Inference pricing per-token at base-model rates regardless of how many fine-tunes the customer maintains. Both companies have publicly described their multi-LoRA architectures in conference talks and blog posts.
Predibase
Founded specifically for multi-tenant LoRA serving with their LoRAX serving stack. Predibase customers (typically mid-market AI products) deploy thousands of customer-specific adapters via a managed multi-tenant cluster. Their published case studies describe deployments with 10,000+ adapters per cluster at sub-100ms p99 latency.
Cohere
Cohere offers fine-tuning of Command-R+ and Command-R models with multi-tenant serving. The Rerank model also supports per-customer fine-tunes.
RunPod, Replicate, Modal
Newer per-second-priced compute platforms increasingly offer multi-LoRA serving as a managed service. RunPod's "Serverless LoRA" pattern lets developers bring their own adapters to a shared base model fleet, paying only for the seconds of inference.
Hugging Face Inference Endpoints
Inference Endpoints now offer multi-LoRA serving as a managed option. Customers deploy a base model endpoint, then add LoRA adapters via the API. Pricing per-second of endpoint runtime regardless of adapter count. Good fit for smaller deployments (10–100 adapters).
Modal Labs
Modal's serverless GPU platform supports multi-LoRA via custom server functions. Developers bring a base model image and load adapters per request. Pricing per-second of GPU time; idle workers scale to zero. Sweet spot: variable workload with infrequent requests across many adapters.
Replicate
Replicate offers per-second GPU billing. Their multi-LoRA story is strongest for image bases (SDXL, FLUX.1) where their LoRA registry sees heaviest community usage; LLM multi-LoRA on Replicate is supported but less frequently the canonical path. Frequently used for image-generation LoRAs at consumer scale.
Mosaic / Databricks
Databricks Model Serving supports multi-LoRA for foundation models via their serving stack. Tight integration with Databricks Lakehouse — training data lives in Unity Catalog, adapters served from MLflow registry. Used heavily for internal enterprise fine-tunes.
Enterprise self-hosted
Larger enterprises (financial services, government, healthcare) deploy multi-LoRA stacks on their own GPUs. Common stacks: vLLM + custom adapter store + LiteLLM gateway. The work is the operational integration with internal IAM, audit, and compliance systems — not the serving stack itself.
Case study: image-generation multi-LoRA at scale
The pattern isn't limited to LLMs. Stable Diffusion XL and FLUX.1 LoRAs are also served multi-tenant via stacks like Replicate's, fal.ai's, and Civitai's. The economics are similar: keep one large base model resident on GPU; load small per-style adapters (often 10–50 MB) on demand.
A few differences from LLM multi-LoRA:
- Image LoRAs are smaller relative to the base (an SDXL LoRA is ~30 MB vs ~6 GB base).
- Image generation is compute-bound (long forward passes), so the per-request adapter overhead is proportionally smaller.
- Inference batches are smaller (1–4 images per request typical), so the segmented GEMM pattern matters less.
Production stacks like Diffusers (Hugging Face), ComfyUI workflows, and custom servers all support multi-LoRA for image models with similar mechanics to vLLM's for LLMs.
Internal vs external multi-tenant LoRA at large companies
Big tech companies running large LLM fleets internally often deploy multi-LoRA for organisation-level personalisation:
- One LoRA per major business unit (Marketing, Engineering, Sales).
- One LoRA per common task pattern (customer-email-reply, sales-call-summary).
- Shared base, hundreds of internal adapters.
The economics here are different — there's no per-customer billing, but the operational discipline is the same. Internal multi-tenant tends to be sloppier on monitoring (because failures don't lose revenue directly), tighter on integration with internal IAM, and looser on quality eval.
Customer onboarding deep dive: from upload to GA
The product flow from a customer signing up to a fully promoted, generally-available fine-tune touches a surprising number of subsystems. The reference flow used by mature platforms in 2026:
Step 0: account and base-model selection
The customer creates a workspace and picks a base model. Most platforms offer a small menu (e.g., Llama-3.1 8B, Llama-3.3 70B, Qwen-2.5 32B). Picking too large a base costs more and is rarely worth it for narrow customizations; the UI should nudge toward smaller models with a "you can upgrade later" affordance.
Step 1: training data upload UI
A reasonable upload UI accepts JSONL with explicit prompt/response fields, validates schema in the browser before the upload completes, and surfaces three numbers immediately: total examples, total tokens, estimated training cost. Estimated cost is computed deterministically from dataset tokens, base size, rank, and epochs — there's no excuse for surprise bills if the estimate is shown up front.
Step 2: preflight validation
Server-side checks:
- Schema and JSON validity.
- Token counts per example (warn if any exceed context length minus a margin).
- Distribution checks: warn if response lengths are extremely bimodal or if the dataset contains <50 unique prompts.
- PII scan (Presidio or equivalent); offer to redact or warn if PII is present in unexpected places.
- Content-policy scan; block obviously prohibited training material.
Step 3: training config preview
Customer sees the chosen recipe (rank, target modules, learning rate, epochs) with explanations. Advanced users can override; default flow keeps these hidden.
Step 4: training execution and progress
Training runs in a separate compute pool. The customer sees a progress bar, the running validation loss curve, and ETA. Cancel-and-refund is supported up to a checkpoint.
Step 5: automated eval
Post-training, the adapter is evaluated against:
- The customer's held-out split (20% reserved at upload).
- A platform-provided general-capability eval (IFEval or similar, ~100 prompts, fast).
- A platform-provided safety eval (refusals on prohibited content, prompt-injection resistance).
- Optional: customer-provided eval set.
Results are shown as a scorecard. Quality regressions vs base are highlighted; safety regressions block promotion.
Step 6: A/B framework (canary)
The adapter is deployed to 1–10% of the customer's traffic. The remaining traffic uses the previous adapter or base. Production metrics (latency, success rate, customer-feedback signals if available) collected for 24–72 hours.
Step 7: full promotion (GA)
If canary metrics are healthy, the customer promotes to 100%. Old version retained for rollback for 30 days. New adapter is registered in the hot/warm/cold tier and traffic shifts.
Step 8: ongoing monitoring
The customer's adapter is tracked in the per-tenant quality dashboard:
- Daily eval against their held-out set.
- Drift detection on production inputs vs training inputs.
- Customer-facing feedback aggregation (thumbs-up/down, escalation rate).
- Cost and usage trends.
Alerts trigger on quality regression, drift past threshold, or sudden cost spike.
Step 9: retrain or deprecate
When eval shows quality degradation past a threshold, or the base model is being upgraded, the platform offers an automatic retrain using cached training data plus optionally recent production data. The customer approves; the cycle starts over at Step 4.
Operational cost of the onboarding flow
For a platform running this end-to-end with mostly automated handoffs, the cost per customer-onboarding is in the $50–$500 range (training compute dominates). Customer-success time is the biggest variable cost for enterprise customers who need help shaping their training data.
Deployment patterns: SaaS multi-tenant vs private cloud vs hybrid
Three deployment patterns dominate multi-tenant LoRA in 2026, each with different operational profiles.
Pattern A: pure multi-tenant SaaS
A single shared fleet serves all customers' adapters. Each request is routed by (auth, adapter_id) to the right base and adapter. This is the default for OpenAI, Anthropic, Together AI, Fireworks, Predibase, and most commercial fine-tuning platforms.
Operational properties:
- Highest density and lowest per-customer cost; viable at $50–$200/month price points.
- Shared blast radius — a kernel bug or HBM eviction storm affects many customers.
- Compliance ceiling — pure shared infrastructure rarely satisfies regulated-industry contracts that demand cryptographic isolation.
Pattern B: single-tenant private cloud
The platform deploys a dedicated cluster per customer (or per customer-segment). Same base, same adapter management software, but no other tenants share the workers. Used for healthcare, financial services, government, and large-enterprise customers that demand dedicated infrastructure.
Operational properties:
- Adapter density is much lower — one customer's 5 adapters do not justify an 8× H100 cluster on their own.
- Per-customer cost is 5–20× higher than shared.
- Compliance is straightforward — physical and logical isolation, no shared state.
- Common revenue tier: $5k–$50k/month per customer.
Pattern C: hybrid
Most customers run on the shared multi-tenant fleet. A handful of regulated or high-spend customers get dedicated clusters using the same software, deployed in the customer's VPC or in a regional isolation zone. The same control plane (adapter registry, training orchestration, eval) operates both.
Operational properties:
- Best of both worlds; standard for platforms serving both SMB and enterprise.
- Control-plane code paths must be tenant-isolation-aware; bugs in this layer cause cross-deployment issues.
- Most common in 2026 for ambitious mid-market platforms (Predibase, Cohere, Together AI for enterprise tier).
The hybrid pattern is what wins when a platform tries to be both broadly accessible and enterprise-compliant. The engineering investment in the control plane is meaningful — typically 3–6 months of platform work to get tenant-aware deployment, audit, and key management right.
Reference architecture for a hybrid platform
A reference deployment for a 5000-customer SaaS plus 50 enterprise clusters:
- Shared multi-tenant fleet: 64× H200 across 8 nodes, serving Llama-3.3 70B FP8 with 4000+ adapters total (200–600 hot per worker).
- Per-enterprise clusters: 4× H100 or 4× H200 per dedicated cluster, same software image, deployed in customer VPC.
- Control plane: Regional control planes (US-east, US-west, EU-west, AP-south) with replicated adapter store and global tenant directory.
- Training fleet: Shared pool of 32× H100 for training jobs, scheduled by priority across customers.
- Eval fleet: Smaller pool of GPUs running per-adapter eval on a rolling cadence.
- Gateway: LiteLLM-style proxy with per-tenant rate limits, WFQ, and routing to the right deployment for each customer.
Total infrastructure cost at 2026 prices: roughly $200k/month for the shared fleet and gateway, plus per-enterprise cluster costs (~$20–60k/month each for dedicated 4× H100/H200). Revenue at typical pricing supports this comfortably once customer count passes a few hundred shared plus ten or twenty enterprise.
MoE bases and LoRA: where the pattern breaks down
Mixture-of-experts bases — Mixtral 8x22B, DeepSeek-V3 (671B with ~37B activated), Qwen MoE variants, the rumored MoE structure of GPT-4o-class models — make multi-tenant LoRA materially harder. The trouble is structural: a dense model routes every token through the same matrices, so a single LoRA on Q/V projections applies to every token uniformly. An MoE model routes each token to a subset of experts (typically 2–8 of 64–256), so a "single" LoRA on, say, expert 17's down projection only affects tokens that happened to be routed to expert 17. The adapter capacity-per-token is much smaller than nominal parameter count suggests.
The 2026 landscape on this:
- Per-expert LoRA. Train one rank-r LoRA per expert. Adapter size scales linearly with expert count, so a 64-expert MoE adapter is roughly 64× larger than a dense-equivalent LoRA. For DeepSeek-V3 (256 routed experts), per-expert LoRA at rank 16 weighs in at multi-GB per adapter — close to full-fine-tune territory for the activated subset.
- Shared LoRA on attention only. Cheaper: apply LoRA only to attention QKVO (which is shared across experts) and skip FFN/expert matrices. Quality cost is real because most parameter mass in MoE lives in the experts, but for style or tone adaptation this can be enough.
- Routing-aware LoRA (MoLA-style). Research-stage approach: factor the adapter into a "router LoRA" plus a small expert-specific term. Reduces parameter count vs per-expert at modest quality cost. Cited examples include MoLA (Mixture of LoRA Experts) variants and routing-aware adapter papers from 2024–2025.
- Full fine-tune the gating network only. Some teams find that keeping experts frozen but fine-tuning the router + a small attention LoRA recovers most of the benefit for narrow customizations.
In 2026 production, MoE adapter serving is rare outside research deployments. Companies offering MoE-based fine-tuning (DeepSeek, Mistral with their MoE family, hyperscalers wrapping these models) typically either route fine-tunes to a dense sibling model or use dedicated full-fine-tune instances at higher price points. Multi-tenant LoRA on MoE will probably become standard by 2027–2028 as the kernels and recipes mature.
MoE-LoRA size comparison
For a 70B dense vs MoE-equivalent base, rank-16 all-linear LoRA:
| Base type | Adapter approach | Adapter size | Quality vs full FT |
|---|---|---|---|
| Llama-3.3 70B dense | Standard LoRA, all linear | ~200 MB | ~98% |
| Mixtral 8x22B (~141B total) | LoRA on attention only | ~60 MB | ~85% |
| Mixtral 8x22B | Per-expert LoRA, rank 16 | ~1.5 GB | ~95% |
| DeepSeek-V3 (671B / 37B activated) | LoRA on attention only | ~80 MB | ~80% |
| DeepSeek-V3 | Per-expert LoRA, rank 8 | ~6 GB | ~92% |
The economics on the right column are why dense bases dominate multi-tenant LoRA in 2026: a 70B dense base at 200 MB/adapter packs ten times more adapters into the same HBM than a per-expert MoE adapter on a similarly-priced cluster.
Catastrophic forgetting, overfit, and training pitfalls
The serving stack is the boring half of multi-tenant LoRA. The interesting half — and where most quality regressions originate — is the training pipeline. Three failure patterns dominate, plus a handful of subtler ones.
Overfit on tiny datasets
Customer uploads 80 examples of "the way we write support replies." A rank-32 LoRA at default learning rate and 3 epochs will memorize them. The adapter scores 100% on the training set, regurgitates verbatim phrases from those 80 examples for the next 6 weeks, and fails on every prompt that isn't structurally similar to a training example.
Production defenses:
- Reject training jobs below a minimum dataset size (typical: 100 examples for a warning, 500 to pass without a warning).
- Force a held-out validation split (10–20%) and block promotion if validation loss diverges from training loss past a threshold.
- Default to lower rank (4–8) and one or two epochs when the dataset is small; the customer can opt into higher rank if their eval supports it.
- Add explicit LoRA dropout (0.05–0.1) which materially helps small-dataset regimes.
Catastrophic forgetting of adjacent capabilities
A LoRA trained on contract-review text degrades the model's ability to write Python. A medical-coding LoRA loses general instruction following. The adapter doesn't touch most of the base model parameters, but it shifts the distribution enough that adjacent capabilities visibly suffer.
Mitigations:
- Replay data. Mix 5–20% general-purpose data (instruction-following examples, code, math) into the customer's training set. Reduces forgetting at modest quality cost on the target task.
- Lower learning rate. Customers with strong opinions about quality often want higher learning rates and more epochs; the platform default should be conservative.
- DoRA over LoRA. DoRA's magnitude-direction decomposition empirically forgets less. Slightly slower training; usually worth it for production.
- Capability eval gates. Before promoting any adapter, run it against a small general-capability eval (HumanEval, IFEval, MMLU subset). If any capability drops more than a configured threshold (5%) vs the base, require a human approval.
Distribution shift between training and serving
The customer trains on prompts shaped like "Hey, please draft a reply to this email: [email]". In production, their app sends prompts shaped like "Email: [email]\n\nReply:". The adapter was trained on one prompt template; it's being served another. Quality looks fine on the customer's eval set (which uses the training template) but is mediocre in real use.
Mitigations:
- The training pipeline should accept and apply the same chat template the customer's app will use at serving time.
- The platform should document the chat-template requirement and validate that the customer's training data is in the expected format.
- Eval should include some adversarially-formatted prompts to catch template fragility.
Tokenizer drift across base versions
Mentioned in the debugging section, but worth re-emphasizing here: a LoRA trained against the Llama-3.1 tokenizer is not portable to Llama-3.3 even if the model is "compatible." The token IDs may align but the embedding-row semantics shift subtly. Always retrain on base upgrade.
Tokenizer drift across forks
Customer trains on a community-finetuned base (e.g., Nous-Hermes-Llama-3) with a slightly modified vocab. They expect to serve the adapter on stock Llama-3. Token IDs partially overlap; the adapter behaves erratically. Production platforms either pin adapters to specific base hashes or reject loads on tokenizer hash mismatch.
Hyperparameter heuristics that work in 2026
For mid-sized customer datasets (1k–10k examples) on Llama 3.x or Qwen 2.5 base, a reasonable default recipe:
- Rank: 16 (32 if eval supports it).
- Target modules: Q, K, V, O, FFN gate, up, down.
- LoRA alpha: 32 (or 2× rank).
- Learning rate: 1e-4 to 2e-4 (lower for larger bases; 5e-5 for 70B+).
- Epochs: 2–4 (more risks overfit on small data).
- Weight decay: 0.01.
- Warmup: 3–5% of training steps.
- Optimizer: 8-bit AdamW (paged for QLoRA).
- Dropout: 0.05 on LoRA layers.
These defaults are loosely consistent with what Predibase, Together AI, Fireworks, and Unsloth publish as their starting recipes. Customers who want to tune further usually need a held-out eval to justify the change.
When to retrain
Adapters age. The triggers for retraining a customer's adapter:
- Base model upgrade. Always retrain; never serve cross-version.
- Customer's data drifted significantly. Eval shows quality drop on recent inputs; collect new training data and refresh.
- Customer reports degradation. Track per-tenant satisfaction; trigger retrain proactively if signals turn negative.
- Routine cadence. Many platforms retrain every 90 days regardless, to incorporate accumulated new customer data and to align with base-model patch cadence.
Worked example: a customer-support adapter going bad over 6 months
Month 0: customer trains on 4,000 historical tickets. LoRA performs well; eval score 92%.
Month 3: customer's product changes (new features, new SKU naming). New tickets reference things the training data never saw. Eval on recent tickets drops to 84%. Customer notices but doesn't flag.
Month 5: a few public reviews complain about "AI bot being out of date." Customer flags. Platform's per-tenant quality dashboard had been showing a slow decline since month 3 but wasn't yet alarming.
Month 5.5: customer retrains with their last 90 days of tickets added. Eval back to 91%. Issue closed.
The lesson: automatic eval on a rolling window of recent inputs catches data drift before customers do; platforms that don't do this rely on customer complaints, which lag by 1–3 months.
Adapter format, registry, and supply-chain hygiene
By 2026 the de-facto adapter format is Hugging Face safetensors plus a peft_config.json. The format question is mostly settled. The registry and supply-chain question is not.
What a production adapter store should track per adapter
| Field | Purpose |
|---|---|
adapter_id |
Unique per (customer, version) |
customer_id |
Owner; access control key |
base_model_name |
e.g., meta-llama/Llama-3.3-70B-Instruct |
base_model_hash |
SHA-256 of the base weights; reject load on mismatch |
tokenizer_hash |
SHA-256 of the tokenizer; reject load on mismatch |
rank |
LoRA rank |
target_modules |
Which projections the adapter touches |
alpha, dropout |
LoRA hyperparameters |
training_data_hash |
Reproducibility; right-to-be-forgotten tracking |
training_config_hash |
Reproducibility of training run |
trained_at, promoted_at |
Provenance timestamps |
eval_scores |
Per-eval-suite scores at promotion |
quality_baseline_version |
Adapter version that this one was diffed against in promotion eval |
signed_by |
Cryptographic signature of the admin who promoted |
retention_policy |
When the adapter is eligible for deletion |
compliance_flags |
HIPAA, SOC 2 scope, GDPR data origin |
Object-storage path convention used by several platforms in 2026: s3://adapters/{customer_id}/{adapter_name}/{version}/adapter_model.safetensors. Versioning is at the path level; the latest pointer is a separate manifest.
Cryptographic signing and verification
For regulated industries:
- Every adapter promotion is signed by an authorized admin using a per-customer (or per-platform) signing key.
- The signature is stored separately from the adapter file — same bucket but different prefix, or a separate signing service.
- Workers verify the signature on adapter load. Unsigned or invalid-signature adapters are rejected.
- Key rotation is a documented process; old signatures remain valid until adapters are re-signed.
Right-to-be-forgotten and GDPR data deletion
When a customer requests deletion of their training data, the operational cost is non-trivial because the trained adapter is a derivative of that data:
- Delete raw training data from object storage (easy).
- Delete the adapter file from cold storage (easy).
- Evict the adapter from hot/warm tiers across all workers (needs coordination).
- Delete training logs that contain training-data echoes (depends on log granularity).
- Delete derivative analytics (eval scores tied to specific examples, etc.).
- Document the deletion in the audit log.
Most platforms commit to a 30-day deletion SLA from request to "all artifacts removed." Beyond that, surfaces like backups and disaster-recovery snapshots may still hold copies; this typically requires a longer window (90 days) and is documented in the privacy policy. The EU AI Act and GDPR both require that training data deletion be reflected in derivative models within a reasonable timeframe; "reasonable" is interpreted as 30–90 days in current guidance.
Adapter integrity at load
Workers verify several invariants at load:
- File-level SHA-256 matches the registry record.
- Cryptographic signature verifies against the admin signing key.
- Adapter's declared
base_model_hashmatches the running base. - Adapter's declared
tokenizer_hashmatches the running tokenizer. - Adapter's rank ≤ the worker's
--max-lora-rank. - Adapter's target modules are a subset of the worker's supported targets.
Failure on any check rejects the load and emits an alert. A handful of regressions in production multi-LoRA stacks over 2024–2025 came from skipping one of these checks under time pressure; the cost in incident recovery far exceeded the engineering cost of strict validation up front.
Cross-region replication strategy
For multi-region SaaS:
- Master adapter store in the customer's home region (configurable; default to data-residency).
- Async replication to peer regions for low-latency reads — but bounded by the customer's data-residency contract.
- Replicated adapters get a region-tagged hash so loaders can verify region origin matches policy.
- Cross-region failover is opt-in per customer; some customers require strict single-region service even at the cost of availability during regional outages.
Debugging multi-LoRA in production
Six failure modes that frequently bite production multi-LoRA deployments, and how to diagnose them.
Adapter / base version mismatch
Symptom. Adapter trained on Llama-3.1 8B; served on Llama-3.3 8B. Outputs are slightly off — sometimes garbled, sometimes subtly wrong. Quality eval flags regressions vs the previous deployment.
Diagnosis. Compare the adapter's metadata (base model name + version) to the serving stack's loaded base. Both vLLM and Hugging Face safetensors include base-model metadata in the adapter file.
Fix. Re-train the adapter on the current base, or pin the serving stack to the matching base version.
Tokenizer drift
Symptom. Adapter trained with Llama 3.1 tokenizer; served with a slightly different tokenizer (a fork, a custom merge). Token IDs don't align; embeddings are read from the wrong vocabulary slot.
Diagnosis. Hash the tokenizer's vocab file on training and serving. Compare hashes.
Fix. Always include the exact tokenizer with adapter exports. Reject loads on hash mismatch.
Training / serving precision mismatch
Symptom. Adapter trained in BF16; served in FP16. Outputs subtly different from training-time validation. Hard to spot in eval; shows up in user feedback.
Diagnosis. Check the adapter file's dtype. Compare to the serving stack's compute dtype.
Fix. Match dtype between training and serving, or document the expected drift in your eval pipeline.
Adapter overfit
Symptom. Adapter performs great on training-distribution prompts; degenerates outside it. Especially common with small datasets (<500 examples).
Diagnosis. Run the adapter against a held-out general-capability eval (HumanEval, IFEval). Compare to the base model.
Fix. Lower rank, add dropout to LoRA layers, mix general-purpose data into training, or simply gather more customer data.
Catastrophic forgetting
Symptom. Fine-tuned a LoRA for legal text generation; the model's general coding ability degrades. Customer complains "it used to write Python, now it can't."
Diagnosis. Capability eval before and after training across multiple domains.
Fix. Mix general-purpose data (5–20%) into training. Use lower learning rates. Train fewer epochs. Consider DoRA which is more conservative.
KV cache poisoning
Symptom. With prefix caching enabled, requests from tenant A occasionally return responses that look like tenant B's behaviour. Cross-tenant pollution.
Diagnosis. Check the prefix-cache key construction. It must include the adapter ID.
Fix. Patch the cache key to (tenant_id, adapter_id, prefix_hash). Audit the cache infrastructure for any code path that drops the adapter ID.
Memory growth over time
Symptom. Worker HBM usage grows over hours of operation; eventually OOM kills the worker.
Diagnosis. Track adapter pool size over time. Look for adapters not being properly freed when replaced or evicted.
Fix. Identify the leaking code path (often in custom adapter loader). Restart cadence (rolling restarts every 24–48 hours) is a workable band-aid while you fix the root cause.
Performance regression after upgrade
Symptom. After upgrading vLLM from 0.6.x to 0.7.x (or any major serving stack upgrade), throughput drops 20% with no other changes.
Diagnosis. Compare kernel auto-tuning artifacts; some serving stack upgrades reset cached kernel selections. Newer kernels can be slower on certain shapes during initial profiling.
Fix. Re-warm autotuning, pin kernel selections, or roll back to the previous version. Document the regression and report upstream. vLLM and SGLang both have active regression bounty programs for major perf issues.
Adapter quality silently degrades on context-length variation
Symptom. Adapter performs well on short prompts (training distribution), but quality collapses on long prompts (10k+ tokens) that the training data didn't cover.
Diagnosis. Eval the adapter on multiple context-length buckets (1k, 4k, 16k, 64k). Compare to base model at the same buckets.
Fix. Include long-context examples in training data, or set an explicit "max context for this adapter" parameter that routes longer prompts to the base model.
Monitoring patterns
The production dashboard for a multi-tenant LoRA service has five panels:
- Per-adapter p99 latency. Catches noisy-neighbour issues.
- Adapter cold-load rate. Catches tier-sizing problems.
- HBM utilization by adapter. Catches memory pressure.
- Adapter quality regression alerts. Catches training problems.
- Per-tenant cost. Catches billing anomalies and runaway customers.
Worked debugging session
A real-world debugging session for a multi-LoRA p99 latency spike from 240 ms to 1.4 s overnight, no deployments in between.
Step 1: Check monitoring. P99 latency only on workers 3 and 5 of 8. Workers 1, 2, 4, 6, 7, 8 normal.
Step 2: SSH to worker 3. nvidia-smi shows 99% GPU memory utilization. Other workers show 80%.
Step 3: Query adapter pool state via vLLM API. Worker 3 has 80 hot adapters; baseline is 60. New adapters got added without proper eviction.
Step 4: Check adapter version log. A traffic spike onboarded 25 new customers, prefetching their adapters to the busiest workers. Worker 3 and 5 got hit because of shard locality.
Step 5: Tune --max-loras from 64 to 80 and restart with rolling. Workers stabilize at 80 hot adapters; p99 returns to baseline within 30 minutes.
Total debugging time: ~45 minutes once the team was on it. Catching it in monitoring before customer complaints: depends on whether you trended hot-adapter count per worker (most production teams should).
Failure-mode runbook
The 24/7 on-call playbook for a multi-tenant LoRA service:
- Symptom: rising p99 latency, normal aggregate throughput. → Check hot-adapter count per worker; look for eviction churn. Increase
--max-lorasor pin hot adapters. - Symptom: aggregate throughput drop. → Check GPU utilization. If low, check queue depth and admission control. If high, check for kernel autotuning regression after upgrade.
- Symptom: one tenant complaining about quality. → Check adapter version. Roll back if recent upgrade. Run per-tenant eval to confirm.
- Symptom: cold-load rate spike. → Check warm-tier cache hit rate. May need to bump CPU RAM or improve prefetch.
- Symptom: OOM on workers. → Check for adapter pool memory leak. Restart and capture for analysis.
- Symptom: cross-tenant data anomaly. → Suspect KV cache key bug. Take affected workers out of rotation immediately; investigate cache poisoning.
Security: adapters as attack vectors
A LoRA adapter is a file uploaded by an untrusted source (the customer). It runs in your model's forward pass. The attack surface is non-trivial.
Training-data extraction
A LoRA trained on sensitive data can leak that data through carefully-crafted prompts. Research has shown verbatim training-text extraction from LoRA-fine-tuned models with completion prompts that prefix the target training example.
Mitigations:
- Differential-privacy LoRA training (DP-LoRA).
- Memorisation audits post-training — probe with prefixes of training examples; ensure the model doesn't complete them verbatim.
- Treat adapters as classified at the data sensitivity level of their training corpus.
Backdoor injection
A malicious customer trains an adapter with hidden triggers — specific token sequences that cause the model to misbehave (leak data, output harmful content, bypass safety).
Mitigations:
- Behavioural eval against the platform's safety suite before promoting any adapter.
- Anomaly detection on adapter weights — most adapters fall within a statistical band; weights far outside the band warrant manual review.
- Customer attestations / contractual restrictions.
- Don't allow adapters to bypass higher-priority system prompts.
Adversarial prompts crossing tenants
Even without backdoors, malicious prompts to one tenant's adapter could theoretically affect the shared base or KV cache state in ways that leak information. Mitigations: rigorous KV cache keying by (tenant, adapter, prefix); no shared state across tenants in the serving stack beyond the read-only base weights.
Adapter exfiltration
A customer who has access to query their adapter can, with enough queries, sometimes reconstruct portions of the adapter weights. The economics rarely favour the attacker, but for high-IP adapters (large companies' proprietary tunings) this is real.
Mitigations:
- Rate limit per-customer queries.
- Watermark adapter outputs.
- For ultra-sensitive cases, serve via secure enclaves (Confidential Compute on H100, AWS Nitro Enclaves, Azure CCF).
Supply-chain attacks
A compromised adapter file in the object store could be served to other tenants if the access controls fail. Mitigations: per-adapter integrity hashes verified at load time, write-once storage policy, separate IAM roles for adapter writes vs reads.
Cross-tenant prompt injection
Multi-tenant systems where adapter outputs feed into agentic workflows that share tools across tenants create a new attack surface: a malicious adapter could emit tool calls that, when interpreted in another tenant's context, leak data.
Mitigations:
- Tenant-scoped tool registries — adapter X can only call tools registered to tenant X.
- Adapter outputs cannot rename the calling tenant or escape tenant context.
- Auditing of tool-call patterns; alerts on anomalies.
Pre-release adapter audit
Before promoting a customer-trained adapter to production, run an automated audit:
- Behavioural eval on platform safety suite (refusals, jailbreaks).
- Memorisation probe (training-data extraction attempts).
- Output distribution comparison vs base model — flag adapters that produce wildly different outputs on neutral prompts.
- Statistical weight-norm analysis — flag adapters whose weight magnitudes are anomalously large (potential backdoor signal).
Adapters that fail the audit don't auto-promote; they go to a human reviewer with the failed metrics surfaced.
Adversarial adapter detection in practice
A growing area in 2026: automated detection of malicious adapters before they reach production. Three techniques worth knowing:
Weight-norm anomaly detection. Most LoRA adapters have weight magnitudes within a narrow band (a few standard deviations from a baseline distribution). Adapters with extreme norms (very large or very small) get flagged for manual review.
Activation-pattern probing. Run the candidate adapter on a fixed probe set of prompts. Compare its activation patterns to the base model and to other adapters trained on similar data. Outliers warrant inspection.
Behavioural red-team. Run an automated adversarial probe — known jailbreak prompts, sensitive-content prompts, refused-category prompts. Compare the adapter's responses to the base model's. Adapters that bypass safety filters more than the base get rejected.
These detections are imperfect — sophisticated backdoors can evade them — but they catch the bulk of accidental and casual-attack cases.
Audit-trail requirements for regulated industries
Healthcare, financial services, and government deployments often require:
- Every adapter query logged with hash of input + hash of output. Retained 7+ years.
- Adapter version history with cryptographic signing. Each promotion signed by an authorised admin; signatures stored separately from the adapter.
- Reproducibility of any past prediction. Given a (timestamp, customer_id, query_hash), the system must be able to reproduce the exact prediction. Requires snapshotting model versions and inference parameters.
- Right-to-be-forgotten support. If a customer requests data deletion, training data, the adapter trained on it, and any cached predictions all must be removable. Operationally expensive; design for it from day one.
Compliance frameworks
- SOC 2 Type II. Most multi-tenant LoRA platforms have this. Specifies controls around access, auditability, and incident response.
- ISO 27001 / 27017 / 27018. Cloud-specific information security standards. Common for enterprise platforms.
- HIPAA / HITRUST. Healthcare data. Requires BAA, additional controls.
- EU AI Act (in force 2025–2026). High-risk AI systems include some fine-tuned models. Customisation pipelines need documentation, eval, and incident reporting.
The bottom line
The adapter-economics gap is what made per-customer fine-tuning a viable product feature. By 2026, the kernel problem is solved — Punica and S-LoRA collapsed the multi-LoRA throughput tax from 50% in 2023 to under 5% today — and the work has moved up the stack into scheduling, tiering, and per-tenant operations. The biggest lever is the adapter-management layer: which adapters stay in HBM, which migrate to CPU or disk, and how the scheduler co-batches requests across tenants.
Operational takeaways:
- Keep one base model per GPU; treat adapters as cache lines, not as instances.
- Tier hot/warm/cold by recent QPS; prefetch on tenant-activity signals.
- Pin adapters to a specific base version; never silently re-target across base upgrades.
- Eval per tenant — fleet-level metrics hide individual-tenant regressions.
- Default to LoRA over full fine-tuning unless eval shows a >2% quality gap that matters.
Pair this with vLLM and PagedAttention for the underlying batching mechanics and AI inference cost economics for the per-tenant unit-economics model.
FAQ
Is LoRA really as good as full fine-tuning? Within 1–3 points on most benchmarks at rank 32 targeting attention + FFN. For instruction tuning, domain adaptation, and style adaptation: yes. For tasks requiring large distribution shifts (new languages, very different domains): no.
Can I use QLoRA at serving time?
Quantize the base model to 4-bit, serve LoRA on top. Yes, supported by vLLM (--quantization awq or --quantization gptq plus --enable-lora). Reduces HBM use by ~3× at small quality cost.
How many adapters can I fit on one GPU? For a 7B base on an H100: 200–1000 rank-32 adapters in HBM, more on CPU. For a 70B base on 8× H100: 200–500 rank-32 in HBM.
What's the latency overhead vs single-model? 10–15% in 2026 with mixed-adapter batching. Down from 50%+ in 2023.
Can I mix adapter ranks?
Yes, S-LoRA and modern vLLM support heterogeneous ranks per adapter. Set --max-lora-rank to the largest rank you'll use.
Hot-add adapters without restart? Yes. vLLM supports dynamic adapter loading via API. Adapter discovery from a directory at startup is also common.
LoRA on top of a quantized base? Standard 2026 pattern. Base at INT8 or INT4 (AWQ, GPTQ, or NVFP4); LoRA at FP16 or BF16. Total memory savings of 3–6×.
Should I use LoRA or full fine-tuning? LoRA for: customisation, multi-tenant, when you have <1M training examples, when you need fast iteration. Full fine-tuning for: distillation, new languages, tasks that demonstrably need more than LoRA can provide. Default to LoRA.
How do I train LoRA adapters? Axolotl, Hugging Face PEFT + Trainer, Unsloth (faster), LoRAX, LLaMA-Factory. All work. Unsloth is the speed leader on small-to-mid models in 2026.
DoRA, LoRA+, AdaLoRA — worth it? Small quality gains; serving stacks support them via the standard LoRA interface. Use if you're already extracting the last percentage points of quality; not necessary for most workloads.
Prefix caching with LoRA? Yes — keyed by (adapter, prefix). SGLang's RadixAttention handles this natively; vLLM supports it. Same-prefix-same-adapter requests share KV cache.
Speculative decoding with LoRA? Possible. The draft model and target model both need LoRA support if the speculation uses the adapter context. EAGLE-2-style speculation is the most LoRA-compatible.
Multi-base-model multi-tenant? Yes. A single cluster can serve adapters for Llama-3.1-8B, Llama-3.3-70B, Qwen-2.5-7B, etc., each with their own adapter pool. Manage as separate model families.
What about MoE base models? LoRA on MoE is harder — each expert has its own matrices, and adapter design must account for which experts are active. Open research; production rare in 2026. Most MoE serving stays full-model.
Per-customer fine-tuning at <1k examples? Usually OK with LoRA, sometimes overfits. Use a small rank (4–8), low learning rate, and validate on a held-out customer set. Below 100 examples, customisation via in-context examples (few-shot) often beats fine-tuning anyway.
What's the minimum dataset size to bother with LoRA? 500–2000 examples is the sweet spot for most customisation tasks. Below 100, few-shot in the system prompt beats LoRA. Between 100–500, it's task-dependent — for narrow style adaptation, LoRA at rank 4 can work; for anything requiring real generalisation, gather more data first.
How long should a LoRA adapter live before retraining? Until the base model changes, or the customer's data drifts noticeably. Most adapters in 2026 live 3–12 months between retrains. The cadence is usually triggered by base-model upgrades (new Llama version, new Qwen version) rather than data drift.
Do all LoRA layers need to use the same alpha?
The standard is alpha = rank or alpha = 2 × rank across all layers; some recipes (LoRA+) use different learning rates for A and B matrices. The vast majority of production fine-tunes use uniform alpha; the gains from tuning per-layer alpha are small and rarely worth the experimental overhead.
Can I serve LoRA adapters for an MoE base? Difficult. The expert routing changes which matrices a token sees, so the adapter has to apply to many experts or pre-route. Some MoE-aware LoRA techniques exist in research (MoLA, MoE-LoRA) but production support is thin in 2026. Most MoE serving stays full-model. See mixture-of-experts serving.
How does multi-LoRA interact with continuous batching? Cleanly. vLLM, SGLang, and TGI all integrate multi-LoRA with their continuous-batching schedulers. Requests with different adapters batch together; the segment-aware GEMM handles the heterogeneity. The visible effect is 10–15% throughput overhead vs single-adapter serving, plus a small impact on prefill / decode timing.
Can I use LoRA on a fp8 or NVFP4 quantized base? Yes. The standard pattern in 2026 is base at fp8 (or NVFP4 on B200/Blackwell) with the LoRA at bf16. The LoRA's contribution is computed in higher precision and added back into the quantized base's output. vLLM, TensorRT-LLM, and SGLang all support this. See quantization tradeoffs and FP8 training tradeoffs.
Speculative decoding with LoRA: how does it interact? The draft model and target model both need to apply the LoRA. EAGLE-2 and Medusa-style speculation, which use a small head appended to the target model, work cleanly with multi-LoRA because the draft uses the same base + adapter as the target. Independent-draft speculation (a separate small model) is harder; the draft doesn't have the adapter, which can produce more rejection.
How do I A/B test a new adapter version? Tag adapter versions; route a small fraction of traffic to v2 while keeping v1 as the default. Log eval metrics per version. Promote v2 if it passes the eval and rollback if it regresses. Most multi-LoRA platforms (LoRAX, vLLM with custom routing) support this natively or with a thin gateway layer.
Can adapters from different teams collide in HBM?
Only as a memory-budgeting problem, not a correctness one. Each adapter is identified by name and used only when explicitly requested. The HBM allocator might evict adapter A to make room for adapter B; the next request for A will reload it from CPU or disk. Avoid this thrashing by sizing --max-loras to your hot working set.
Is there a privacy concern with adapters trained on customer data? Yes. The adapter is a small artifact derived from the customer's data; in theory it can leak information about training examples (membership inference, training-data extraction attacks). For sensitive data (medical, financial, PII), add differential privacy to the LoRA training, audit the adapter for memorisation, and treat the adapter file with the same access controls as the source data.
How much QPS can I push through one adapter? Depends on the base model and hardware. For a 70B base on 4×H100 with multi-LoRA enabled, one adapter can absorb 50–200 RPS sustained before the kernel-level segment-aware GEMM starts to bottleneck on that adapter specifically. Above that, the GPU is saturated regardless of how many adapters you have.
What's the right monitoring dashboard for a multi-LoRA service? Five metrics: per-adapter QPS, per-adapter p99 latency, per-adapter cold-load rate, HBM occupancy by adapter, aggregate throughput. Alerts on: cold-load rate above threshold, p99 latency 3× baseline, HBM occupancy >90%. Most production issues show up in these five before anything else.
Can I combine LoRA with RAG? Yes, and this is the dominant production pattern in 2026 for per-customer products. LoRA shapes style and tone; RAG provides facts. They compose cleanly because LoRA modifies the model's behaviour while RAG modifies the input. See RAG in production.
Does multi-LoRA work for embedding models? Yes, with the same mechanics. Multi-LoRA over embedding models lets you serve domain-specialised embeddings (legal, medical, code) from one base model. Less common in 2026 than generation LoRAs because embedding fine-tuning gains are smaller and the operational overhead similar.
Extended FAQ
How many adapters can I realistically fit in HBM on a B200?
A B200 with 192 GB HBM, serving a 70B FP8 base (70 GB), has ~120 GB free. At rank-32 all-linear adapters (600 MB each), that's 200 hot adapters. With NVFP4 base (40 GB), ~250 hot adapters. The warm tier (system memory) can extend this to tens of thousands.
What's the throughput penalty for using rank-64 instead of rank-16? Per-token compute roughly 4× more on the LoRA portion. Since LoRA is a small fraction of total compute (~5–10% at rank 16), going to rank 64 adds another 15–30% to LoRA compute, or 2–5% to total. Negligible in practice.
Can I train a LoRA adapter on a base model and serve on a quantized version? Yes. Train on FP16 base, serve on FP8 or INT4 quantized base. The adapter stays at FP16/BF16. Most production stacks (vLLM, SGLang, TGI) support this; quality cost is typically 0.5–1 point.
Should I use safetensors or PyTorch .pt files for adapters? Safetensors. Faster to load, no arbitrary code execution risk (PyTorch .pt files can deserialize arbitrary objects). All modern training stacks emit safetensors by default; serving stacks prefer them.
How do I version adapters cleanly?
Tag each adapter with (customer_id, version_number, base_model_hash, training_data_hash, train_timestamp). Store in object storage with the version in the path. Keep at least the last 3 versions for rollback. Reject loads where the base_model_hash doesn't match the running base.
What's the minimum dataset size for a useful LoRA? 500–1000 examples for a narrow style task; 5000+ for general domain adaptation. Below 100 examples, few-shot prompting in the system prompt is usually equivalent or better.
Do reasoning models need different LoRA hyperparameters? Yes — typically lower learning rates and fewer epochs, because reasoning traces are long and contain a lot of signal. Aggressive training collapses reasoning depth. Anthropic and OpenAI publish guidance for fine-tuning their reasoning models; follow it closely.
Can multi-LoRA reduce my cold-start latency to zero? No, but it can make cold starts rare. With proper hot/warm/cold tiering and prefetch, <0.5% of requests hit a cold load. That fraction has 1–3s latency; the rest hit hot with no overhead.
What's the right way to charge customers for fine-tuning? Standard pattern: training is a one-time charge per run (e.g., $50–$500 depending on base size and dataset size); inference is the same per-token price as the base model. Customers see fine-tuning as a feature, not a different product.
Can I have two adapters active in one request? Possible (adapter stacking, sometimes called "merge inference") but rarely supported in production stacks. The semantics get weird: which adapter wins on overlapping target modules? Most teams stack at training time (train a multi-task adapter) rather than at inference time.
How does multi-LoRA interact with FP8 attention? Cleanly. The LoRA contribution is computed in BF16/FP16, the base attention in FP8, both added in FP32 accumulator. The mixed precision is handled by the kernel. vLLM and TRT-LLM both support this on H100/H200/B200.
What's the difference between Punica and S-LoRA in 2026 production stacks? Punica's BGMV/SBMV kernels were a foundational contribution and are still used directly in some stacks. S-LoRA extended them with heterogeneous ranks, unified paging, and tensor parallelism. In 2026 vLLM and SGLang, the kernels are derivative of both — most production users don't think about which is underneath.
How do I migrate a customer's adapter to a new base model version? Three options. (a) Re-train from scratch on the new base — clean but expensive. (b) Continue training the existing adapter on the new base with a low learning rate — sometimes works, sometimes overfits or underfits. (c) "Distill" the old adapter's behaviour by generating outputs with the old (base + adapter), then training a new adapter on those outputs using the new base. The right choice depends on dataset size and quality requirements.
Can I share an adapter across multiple customers? Yes, if it's a "platform adapter" (e.g., a customer-support-tone adapter shipped by the platform vendor). Treat it as a separate tenancy class with its own version control. Be explicit about which adapters are customer-private vs platform-shared.
Do adapters help with safety / refusal behaviour? Yes. A small "safety LoRA" trained on examples of correctly-refused queries can be combined with a customer's style LoRA at inference time. This is a research pattern in 2026; production deployments increasingly use this for compliance-driven customisations.
What's "MoLA" or "MoE-LoRA" for MoE bases? Variants of LoRA designed for mixture-of-experts bases (Mixtral, DeepSeek V3, GPT-4 architecturally). They attach a per-expert LoRA or a routing-aware LoRA so the adapter applies meaningfully despite the sparse expert activation. Research-stage in 2026; production support thin. See mixture-of-experts serving.
How do I tell if a customer's adapter is underperforming vs being misused? Per-tenant quality dashboards split by adapter version and by input distribution. If the adapter regressed vs the previous version, it's a training problem. If quality is fine but the customer's prompts shifted (new use cases), they need to update their adapter. The product surface for both can be the same "your fine-tune may need retraining" recommendation.
Can I run multi-LoRA on consumer GPUs (RTX 4090, 5090)? Yes, for small bases. An RTX 4090 (24 GB) holds an 8B base at INT4 (~5 GB) plus hundreds of small LoRAs. llama.cpp and MLX have basic multi-LoRA support; vLLM's CUDA path also works on consumer cards. Use cases: prosumer products, edge devices, on-prem deployments with small fleets.
What's the cost of a cold-load to my P99 latency? A 500-ms cold load adds 500 ms to the P99 of one in N requests, where N is the inverse of your cold-load rate. At 0.5% cold-load rate, every 200th request is affected. If your P99 latency budget is 1.5 s and base latency is 500 ms, cold loads fit; below that, you need to reduce the cold-load rate via better prefetch.
How do I evaluate adapters at scale across many customers? Build a per-tenant eval pipeline: each tenant has a small held-out test set; on adapter promotion, the pipeline runs the new adapter against the test set and compares to the previous version. Aggregate quality dashboards across all tenants surface fleet-wide regressions. Most platforms in 2026 automate this with eval frameworks like RAGAS, OpenAI Evals, BrainTrust, or custom test runners.
Should I use B200 or H200 for a new multi-tenant LoRA cluster in 2026? For 70B-class bases with hundreds of adapters, H200 is usually the better value: ample HBM, mature kernels, and per-hour pricing has settled into a stable band. B200 wins when you need NVFP4 for 405B-class bases or when you want maximum adapter density per node. The pragmatic 2026 default: H200 for general workloads, B200 for frontier-base or extreme-density requirements.
What's the right way to handle adapter versioning when a customer iterates rapidly? Cap the number of retained versions per customer (typically 5–10) and apply LRU on top of that. Each version gets an immutable hash; the customer's "active" adapter is a movable pointer. Roll-back capability requires keeping the previous N versions warm in cold storage even if they're not loaded.
How do I think about cold start vs first-token latency budget for paid tiers? A paid tier with a 500 ms p99 budget cannot tolerate any cold loads; the platform must pin the customer's adapter to hot HBM during their active sessions. Free tiers absorb cold-load tail latency. Tiered pinning policy is one of the simplest ways to convert SLA promises into operational reality.
Does LoRA compose with prefix tuning, prompt tuning, or other PEFT methods? Mostly: LoRA stacks cleanly with prompt tuning (the prompt embeddings are independent of LoRA matrices). LoRA + prefix tuning has overlapping target spaces and gets weird in practice. In 2026 production, plain LoRA dominates; the other PEFT methods see niche use.
Can I deploy a multi-LoRA cluster across heterogeneous GPUs (mixing H100, H200, B200)? Yes operationally — each node runs its own base model with its own adapter pool. The platform's gateway routes requests to the right node. Quality is consistent (same base weights, same adapter files); throughput per node varies. Common at companies that accumulated mixed GPU fleets across procurement cycles.
How do I migrate from Llama-3.1 to Llama-3.3 across thousands of customer adapters? Long migration: announce a window (typically 30–60 days). For each customer adapter, automatically retrain on the new base using the cached training data. Stagger rollout to limit training capacity needed. Customers without cached training data are flagged; offer them either an automated retrain on a recent input sample or a manual escalation. Maintain the old base in service during the migration window for rollback.
What's the realistic ceiling on adapter count per worker in 2026? With H200 (141 GB) serving a 70B FP8 base: 1000+ rank-16 adapters hot at once is achievable, more in CPU warm tier. With B200 (192 GB) and NVFP4 base: 2000+ hot adapters. With GH200 / GB200 unified memory: tens of thousands of warm adapters per node. The real ceiling in 2026 production is set by scheduler complexity and per-adapter eval, not raw memory.
How do I detect prompt injection that comes via the customer's fine-tuning data? Pre-training scanning: pattern-match training examples against known injection corpora; behavioral red-team the trained adapter against the platform's safety suite before promotion. Catches the bulk of accidental cases; sophisticated backdoors require more elaborate detection (activation pattern analysis, weight-norm anomaly detection).
Is multi-tenant LoRA appropriate for life-or-death applications (medical diagnosis, legal advice)? Multi-tenant infrastructure is fine; the question is the eval and governance layer. For high-stakes deployments, the adapter promotion gate is much stricter — multiple expert reviewers, formal eval suites, sign-off requirements. The technology supports this; the policy layer carries most of the weight.
Are there any open-source multi-tenant LoRA reference platforms I can fork? LoRAX (Predibase, Apache 2.0) is the cleanest reference for a production multi-tenant LoRA server. vLLM's multi-LoRA support is more general-purpose but requires more glue to be a full platform. Many of the YC-era ML infra companies that exited in 2024–2025 left behind open-source kernels and adapters that are still useful starting points.
Do I need a dedicated control plane separate from the inference fleet? For anything beyond hundreds of adapters: yes. The control plane handles adapter registration, training orchestration, eval orchestration, promotion gating, and per-tenant accounting. The inference fleet handles requests. Coupling them creates operational fragility — control-plane mistakes can take down inference. Most teams that scale past 1000 adapters split these into separate services.
Can I serve different LoRA adapters across regions while keeping consistent quality? Yes if you replicate the adapters consistently. Quality risk comes from differing base model versions across regions, not from the adapter layer. Pin base versions globally; replicate adapters using the same version control as your inference code.
How does multi-LoRA interact with structured outputs (JSON mode, grammars)? Cleanly. The structured output layer (XGrammar, Outlines, vLLM's grammar support) operates on the logits regardless of how those logits were produced. A LoRA adapter just shifts the distribution; the grammar constraint applies on top.
What's the right team size for a 5000-adapter multi-tenant LoRA platform? Around 8–12 engineers when stable: 2–3 on the serving stack, 2 on training pipeline, 1–2 on eval / quality, 1–2 on platform/control-plane, 1 on SRE, 1 on security/compliance. Smaller during early growth; larger when serving regulated industries.
Can speculative decoding draft model also use a LoRA? Yes — EAGLE-style speculation, where the draft is a small head appended to the target, naturally inherits the target's LoRA. Independent-draft speculation (separate small model) doesn't get the adapter on the draft side, leading to higher rejection rates and weaker speedup.
Glossary
- Adapter — a small set of additional parameters layered on top of a base model. LoRA, DoRA, prefix tuning are all adapter techniques.
- A and B matrices — the two low-rank matrices that compose a LoRA update (
ΔW = BA). - Base model — the underlying frozen model that adapters modify.
- Hot / warm / cold tier — adapter residence in HBM / CPU RAM / object storage.
- LoRA — low-rank adaptation; the canonical PEFT technique.
- PEFT — parameter-efficient fine-tuning; the umbrella term covering LoRA and its relatives.
- Punica / S-LoRA — the kernel patterns that made multi-adapter batching efficient.
- QLoRA — LoRA trained on top of a quantized base.
- Rank — the inner dimension of the LoRA matrices; controls adapter capacity.
- Segmented GEMM — GEMM kernel that handles batch slices with different operand matrices.
Eighteen-month outlook
The kernel and serving stacks for multi-tenant LoRA are mature in 2026. The next two years are about scale and surface area:
- More adapters per GPU. B200's 192 GB HBM and the upcoming GB300 generation push the practical hot-adapter ceiling into the 10k+ range per node. The kernels are ready; the schedulers and adapter stores are the next bottleneck.
- Cross-base-version adapter migration. Tools that take a LoRA trained on Llama 3.3 and "port" it to Llama 4 without full retraining. Research-stage in 2026; production deployments are still a "retrain on the new base" affair.
- Multi-LoRA for reasoning models. Fine-tuning reasoning models (o3, Claude with extended thinking, DeepSeek-R1) for per-customer behaviour is harder because reasoning traces depend on long chains of thought. Multi-tenant serving works, but training recipes are still developing.
- LoRA + MoE. Per-expert LoRA, per-routing LoRA, and other techniques to make adapters work on MoE bases without quality collapse. Production rare today; likely standard by 2028.
- Adapter compression and sharing. Common substructures across many customer adapters can be factored out (a "base adapter" plus per-customer deltas). Cuts storage and HBM costs for large fleets.
- Edge multi-LoRA. Small base models with hundreds of adapters running on consumer GPUs and Apple Silicon. Already feasible with MLX and llama.cpp; productionising for consumer apps is the next step.
The architectural skeleton — base + adapters, segment-aware GEMM, hot/warm/cold tiering — is unlikely to change. What's changing is how big the fleets get and how cheap the marginal customer becomes.
Adapter marketplaces and shared LoRA registries
A 2026 trend worth watching: public LoRA registries (Hugging Face Hub, Civitai for image LoRAs, smaller hubs for LLM adapters) are converging on standard manifest formats. A small but real economy of "platform-shipped" LoRAs (a customer-support-tone LoRA, a legal-summarization LoRA, a code-reviewer LoRA) is forming, monetized by adapter authors or bundled into platform tiers. Multi-tenant infrastructure makes this viable — adding one more adapter to a pool of 1000 is essentially free; charging $10/month for it is pure margin.
Agentic LoRA
A nascent pattern: per-tool LoRA adapters in agentic stacks. An agent that calls many tools (search, code, browser, calculator) might use a small LoRA adapter conditioned on which tool is about to be called. Early experiments from agentic frameworks in 2026 suggest meaningful quality lifts on tool-specific formatting and behavior, at modest serving cost. Most production agentic systems in 2026 still use a single base + system prompt rather than per-tool LoRAs; the math is changing as multi-LoRA overhead approaches zero.
On-device multi-LoRA
Apple's MLX framework, llama.cpp, and a handful of mobile LLM stacks now support multi-LoRA on consumer hardware. A 7B base at INT4 on an Apple Silicon Mac (24 GB unified memory) can hold dozens of LoRAs and switch between them per request. The 2027 implication: per-user fine-tunes that live entirely on a user's device, with cloud sync only for the adapter file. Privacy and latency wins are obvious; the operational question is how platforms keep customer adapters consistent across devices.
Cross-checking the math for 2027
If you extrapolate H100 → H200 → B200 → B300 HBM growth, and parallel kernel maturity, by 2027 a single Blackwell-class node should be hosting 10,000+ hot adapters comfortably on a 70B base. The unit economics of per-customer fine-tuning likely drop below $5/month/customer at $1k/month price points; tiers below $10/month become viable for the first time. Whether the market wants 10x cheaper per-customer fine-tunes is a separate question.
References
- LoRA — Hu et al., 2021. arXiv:2106.09685. The original LoRA paper.
- QLoRA — Dettmers et al., 2023. arXiv:2305.14314. 4-bit quantized base + LoRA training.
- Punica — Chen et al., 2023. arXiv:2310.18547. Segment-aware GEMM for multi-LoRA serving.
- S-LoRA — Sheng et al., 2023. arXiv:2311.03285. Thousands-of-adapter serving with unified paging.
- DoRA — Liu et al., 2024. arXiv:2402.09353. Magnitude-direction decomposition for adapters.
- LoRA+ — Hayou et al., 2024. arXiv:2402.12354. Differential learning rates.
- AdaLoRA — Zhang et al., 2023. arXiv:2303.10512. Adaptive-rank LoRA.
- VeRA — Kopiczko et al., 2024. arXiv:2310.11454. Vector-based adapter parameters.
- vLLM multi-LoRA — docs.vllm.ai/en/latest/features/lora.html.
- TGI Multi-LoRA — github.com/huggingface/text-generation-inference.