Deploying AI in air-gapped environments
Running modern AI inside a classified or fully disconnected network is a solved problem. Most of the difficulty is logistics, not research. Here’s what the end-to-end deployment actually looks like.
"Air-gapped" means different things to different organizations. For our purposes: a network with no internet egress, where everything installed has to be carried across the boundary as a vetted artifact, and where every update is a controlled event.
Defense, intelligence, critical infrastructure, and some regulated research environments operate this way. AI can work there — it just requires a specific build-and-deploy discipline most teams aren’t used to.
The six things you can’t rely on
- Model downloads. No `huggingface-cli login`. Weights must be staged on removable media and hash-verified.
- Package managers. `pip install` and `apt update` won’t reach anything. You need a private mirror or offline bundles.
- Containerregistries. No Docker Hub. Every image has to be re-hosted internally.
- Telemetry. No Sentry, Datadog, or Segment calling home. All observability must be local.
- Auto-updates. Nothing updates itself. Updates are scheduled events.
- External API fallbacks. No "if local model fails, call OpenAI." There is no fallback across the boundary.
The deployment pipeline
A clean air-gapped build looks like this:
- Build on a connected network. Pull model weights, package dependencies, container images, Python wheels. Generate a Bill of Materials.
- Vulnerability scan. Scan every dependency and image before packaging. Nothing unscanned crosses the boundary.
- Sign & hash. Every artifact gets a SHA-256 hash and (where policy requires) a detached signature.
- Transfer. Approved removable media, or a one-way diode if available, with documented chain-of-custody.
- Verify on arrival. Re-hash everything, verify signatures, reject anything that doesn’t match.
- Install to private registry. Push images to internal registry, wheels to internal PyPI mirror, weights to artifact store.
- Deploy. Normal MLOps deployment from there, but pinned to your internal artifacts only.
Model choices that actually ship on air-gapped networks
- Llama 3 family (Meta license review required per deployment)
- Mistral / Mixtral (Apache 2.0 — easiest license for government)
- Qwen (check policy on Chinese model sources; some environments prohibit)
- Phi (Microsoft; MIT license)
- BGE / E5 / Nomic for embeddings
Avoid anything that phones home by default. Some commercial "self-hosted" models have license-check callbacks. Read the model card carefully.
Observability without external tools
Prometheus + Grafana on-prem for metrics. Loki or OpenSearch for logs. Both run entirely offline, both are license-compatible with government deployments.
Build your eval harness as a cron job running against golden Q/A pairs. Report metrics into Prometheus. Dashboards surface regressions before users do.
The update cadence question
A common objection: "but models improve every quarter and we can’t pull updates." True, and usually fine. Model improvements yield 5-15% accuracy gains on average benchmarks. That gain rarely matters for a narrow production use case. What matters is that your deployed model is well-evaluated for your task.
Schedule updates quarterly. Evaluate each candidate before deployment. Keep the previous version available for rollback. A disciplined once-a-quarter upgrade cycle produces more stable systems than a continuous one.
The hardest part (people, not tech)
The pattern we see in failed air-gapped AI projects isn’t a technical gap. It’s a coordination gap between the AI team (wants to iterate fast), the security team (won’t approve iteration), and operations (wants to know what "done" looks like).
Three things fix it:
- Get security involved at architecture review, not at deployment. Their sign-off speed collapses when they feel blindsided.
- Publish a quarterly release cadence. Predictability is a form of control.
- Run a Table-Top exercise on the first update before you need it. The dry run reveals the process gaps.
What a typical engagement looks like
For air-gapped deployments we usually run a longer engagement than standard — 10-14 weeks rather than 6-10. The extra time is almost entirely about the deploy-verify-approve loop being slower. Budget accordingly.