Offline AI · April 24, 2026 · 9 min read

How to run LLMs on-prem without sending data to OpenAI

A practical reference architecture for deploying large language models entirely inside your network — so the compliance review is a 20-minute conversation, not a six-month project.

Every regulated enterprise we work with eventually hits the same wall: their lawyers and security team won’t let customer data flow to a hosted LLM API. The engineering team has already built a prototype. It works. It dies on the compliance review.

The fix isn’t a better privacy policy. It’s an architecture where data cannot leave the network, because the model runs inside it. Here is how we build that.

The four layers of a private LLM stack

Treat your on-prem LLM deployment as four concerns, scoped separately:

Model serving. The GPU and the inference engine.
Retrieval. Where your documents live and how they’re searched.
Orchestration. The glue that turns a user question into a grounded, cited answer.
Observability. Audit log, eval harness, drift detection.

Mixing these up is the most common mistake. If your team doesn’t know which layer owns prompt injection defense (hint: orchestration, not the model), the whole stack becomes hard to reason about.

Layer 1: Model serving

Your realistic options in 2026 for self-hosted open-weight models:

vLLM for most production workloads — strong throughput, solid API surface, good memory management with PagedAttention.
Text Generation Inference (TGI) from Hugging Face — similar niche, good ops tooling.
llama.cpp for CPU-only or low-resource edge deployments.
TensorRT-LLM if you’re deep in NVIDIA tooling and need the last 20% of throughput.

Model selection: Llama 3 family or Mistral/Mixtral are the sensible defaults for general reasoning. For domain-specific needs (biomedical, legal, financial), check if a purpose-trained open-weight model exists before you pay to fine-tune Llama yourself.

Hardware: the calculation is not "how powerful is the GPU" but "how many concurrent sessions do I need, at what latency, at what context length?" A single H100 handles a 70B-parameter model at reasonable throughput. For 7B models, an A10 or L40S is often enough. Do not buy hardware before you have a latency target and a concurrency target.

Layer 2: Retrieval

For regulated content, RAG (retrieval-augmented generation) is not optional. The model cites its sources; the auditor follows the citation. Without RAG, you have a black box making claims about your data.

Practical stack:

Vector store: Qdrant, Weaviate, or pgvector. All three run fully on-prem. Pick pgvector if you already run Postgres and need simple.
Embeddings: run a local embedding model (BGE, E5, or nomic-embed). Do not use the OpenAI embeddings API — that defeats the whole point.
Hybrid search: BM25 + dense. BM25 catches keyword matches that dense search misses; dense catches semantic matches BM25 misses. Almost every production system we’ve shipped uses both.
Reranker: a cross-encoder (BGE-reranker or Cohere’s open-weight reranker) re-orders the top-50 results to pick the top-5 for the model. Adds ~100ms, meaningfully improves accuracy.

Layer 3: Orchestration

Do not start with LangChain. Start with 50 lines of your own Python. LangChain and similar frameworks are fine eventually, but they obscure the structure while you’re still figuring it out.

Your orchestrator does three things:

Receives a user query, retrieves relevant chunks.
Formats a prompt with strict output schema (JSON with a citation field).
Validates the output — drops any answer without a valid citation, retries or escalates.

The validation step is what turns "LLM demo" into "production system." An answer without a citation never reaches the user. If the model can’t produce a citation, that’s a signal it doesn’t have grounded information — escalate to a human or say "I don’t know."

Layer 4: Observability

Bare minimum for a regulated deployment:

Every request logged: user, query, retrieved chunks, model output, citations. This is your audit trail.
Evaluation harness: a few hundred golden-question/golden-answer pairs that you run nightly. When accuracy drops, you find out before your users do.
PII redaction before logs hit storage. Build this in from day one, not after the first incident.
Drift detection on embedding distributions — if your retrieval starts surfacing different clusters, something upstream changed.

The compliance conversation, shortened

Here is what a 20-minute compliance review looks like when the architecture is right:

"Where does the model run?" → Inside your VPC/datacenter, on hardware you own or rent.
"Does data leave the network?" → No external API calls. Egress rules block the model pod entirely if desired.
"How do we audit decisions?" → Every query-response pair is logged with citations. Pull any record by request ID.
"What happens if the model hallucinates?" → Outputs without valid citations are rejected before they reach the user.
"How do we patch the model?" → Version-pinned deployments; new model versions go through your existing change-management process.

That’s the conversation. No vendor questionnaires, no data processing addenda, no cross-border transfer issues. You run the stack the same way you run the rest of your infrastructure.

Common mistakes to avoid

Buying GPUs before measuring latency. Benchmark with a borrowed cloud GPU first. Most teams over-provision 2-3×.
Skipping evaluation. Without a nightly eval harness, you cannot detect regressions when you change anything.
Letting the model answer without retrieval. If there’s no grounding, there’s no citation, and the answer shouldn’t ship.
Treating prompt engineering as the whole game. Retrieval quality dwarfs prompt quality. Fix retrieval first.
Building on a closed-weight "self-hosted" vendor. If you can’t inspect or fine-tune the weights, you’ve just moved the vendor lock-in behind a firewall.

If you’re starting from scratch

Our recommended first build for a regulated enterprise exploring offline LLMs:

Pick one well-defined use case (document Q&A on a specific policy corpus is ideal).
Deploy Llama-3-8B or Mistral-7B on a single GPU with vLLM. Measure latency.
Set up pgvector + BGE embeddings over your documents.
Write 100 golden questions with expected answers. Build the eval harness.
Ship an orchestrator that requires citations in outputs.
Pass the compliance review. Then — and only then — scale up.

Most teams can reach step 6 in six to ten weeks. That’s the shape of our Secure AI Build engagement.

Request a proposal More posts →