Why this matters beyond the homelab: GPU virtualization and workload isolation translate directly to cloud ML infrastructure and compute resource management.
The Goal
Run local LLM inference at production speed, 50+ tokens/second on 70B parameter models, without sending a single byte of data to an external API. The hardware: an NVIDIA RTX 4000 Ada Generation GPU in a Proxmox hypervisor. The method: VFIO/IOMMU passthrough to a dedicated VM.
The Hardware
Host: Node-A (FCM2250 / Millennium Falcon)
CPU: Intel Core Ultra 9 285K
RAM: 64GB DDR5
GPU: NVIDIA RTX 4000 Ada Generation (16GB VRAM)
Storage: NVMe (LVM-thin)
The RTX 4000 Ada sits in Node-A’s PCIe slot. The goal is to pass the entire GPU through to a single VM (Tantive-III) so it has exclusive, bare-metal-equivalent access to the hardware.
IOMMU and VFIO Prerequisites
Enable IOMMU in BIOS
For Intel CPUs, enable VT-d (Intel Virtualization Technology for Directed I/O) in the BIOS. This is the hardware feature that allows the hypervisor to map PCIe devices to specific VMs.
Kernel Boot Parameters
Edit /etc/default/grub on the Proxmox host:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on IOMMU=pt pcie_aspm=off pci=noaer"
-
intel_iommu=on- Enables IOMMU -
IOMMU=pt- Passthrough mode for better performance -
pcie_aspm=off- Disables PCIe Active State Power Management (mitigation for the lockup - see below) -
pci=noaer- Disables Advanced Error Reporting (prevents kernel log spam from GPU power state transitions)
update-grub
reboot
Verify IOMMU Groups
find /sys/kernel/iommu_groups/ -maxdepth 1 -type d | wc -l
The GPU should appear in its own IOMMU group. If it shares a group with other devices, you’ll need ACS override patches or a different PCIe slot.
# Find the GPU's IOMMU group
for d in /sys/kernel/iommu_groups/*/devices/*; do
echo "IOMMU Group $(basename $(dirname $(dirname $d))): $(lspci -nns $(basename $d))"
done | grep -i NVIDIA
Load VFIO Modules
echo "VFIO" >> /etc/modules
echo "vfio_iommu_type1" >> /etc/modules
echo "vfio_pci" >> /etc/modules
echo "vfio_virqfd" >> /etc/modules
Blacklist the NVIDIA drivers on the host (the GPU belongs to the VM, not the hypervisor):
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist NVIDIA" >> /etc/modprobe.d/blacklist.conf
Bind the GPU to VFIO:
# Get the GPU's PCI IDs
lspci -nn | grep -i NVIDIA
# Output: 02:00.0 VGA compatible controller [0300]: NVIDIA Corporation [10de:xxxx] (rev a1)
echo "options VFIO-pci ids=10de:xxxx,10de:yyyy" >> /etc/modprobe.d/VFIO.conf
Replace xxxx and yyyy with your GPU and audio device IDs.
update-initramfs -u
reboot
VM Creation: Tantive-III
The Debian ISO Gotcha
ProxMenux created the VM shell but the Debian ISO URL was wrong:
wget HTTPS://cdimage.Debian.org/Debian-cd/current/amd64/iso-cd/Debian-13.0.0-amd64-netinst.iso
# 404 Not Found
The filename changes with point releases. Find the current one:
wget -qO- HTTPS://cdimage.Debian.org/Debian-cd/current/amd64/iso-cd/ | grep -oP 'Debian-[0-9.]+-amd64-netinst\.iso' | head -1
# Debian-13.3.0-amd64-netinst.iso
Download the correct ISO:
wget -O /var/lib/vz/template/iso/Debian-13.3.0-amd64-netinst.iso \
HTTPS://cdimage.Debian.org/Debian-cd/current/amd64/iso-cd/Debian-13.3.0-amd64-netinst.iso
VM Configuration
VM ID: 201
Name: Tantive-III
CPU: 8 cores (host type)
RAM: 32GB
Disk 1: 128GB NVMe (fast-lvm, scsi0)
Disk 2: 128GB NVMe (fast-lvm, scsi1)
GPU: PCIe passthrough (hostpci0: 0000:02:00, pcie=1)
Network: virtio, bridge=vmbr0
The VM config in /etc/pve/qemu-server/201.conf:
hostpci0: 0000:02:00,pcie=1
scsi0: fast-lvm:VM-201-disk-1,discard=on,size=128G,ssd=1
Partition Expansion
The VM booted with only 3GB usable despite 128GB allocated. Full fix documented in Post 011 - involves kpartx, parted GPT repair, and resize2fs from the Proxmox host.
After expansion:
df -h /
# /dev/sda1 126G 2.4G 118G 2% /
NVIDIA Driver Installation
Inside the Tantive-III VM:
apt update && apt upgrade -y
apt install -y NVIDIA-driver firmware-misc-nonfree
reboot
After reboot:
NVIDIA-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.xx Driver Version: 550.xx CUDA Version: 12.x |
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| 0 NVIDIA RTX 4000 Ada Gen Off | 00000000:01:00.0 Off | Off |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| 30% 35C P8 9W / 130W | 0MiB / 16376MiB | 0% Default |
+-----------------------------------------------------------------------------------------+
GPU visible, driver loaded, 16GB VRAM available.
AI Stack Deployment
With the GPU accessible, deploy the AI services:
Ollama
curl -fsSL HTTPS://Ollama.com/install.sh | sh
Ollama pull llama3:8b
Ollama pull llama3:70b
Ollama pull codellama:34b
Performance: ~50 tokens/second on the 70B model with the RTX 4000 Ada.
OpenWebUI
docker run -d --name openwebui \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=HTTP://host.docker.internal:11434 \
ghcr.io/open-webui/open-webui:main
AnythingLLM
For RAG (Retrieval-Augmented Generation) over local documents:
docker run -d --name anythingllm \
-p 3001:3001 \
-v /home/user/anythingllm:/app/server/storage \
mintplexlabs/anythingllm
Connected to the Ollama instance for embedding and inference. 500+ documents ingested into the RAG pipeline.
ComfyUI
For image generation workloads:
# Clone and install ComfyUI with GPU support
git clone HTTPS://GitHub.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
python main.py --listen 0.0.0.0 --port 8188
The Lockup and the Kernel Parameters
Two weeks after deployment, Node-A hard-locked. No kernel panic, no crash dump, no journal entries. The full investigation is in Post 007, but the short version:
Root cause: PCIe bus stall induced by the NVIDIA GPU under VFIO passthrough. The GPU entered a power state transition that caused the PCIe bus to hang, which cascaded into a full system lockup.
Mitigation:
# Added to GRUB_CMDLINE_LINUX_DEFAULT
pcie_aspm=off # Disable PCIe Active State Power Management
pci=noaer # Disable Advanced Error Reporting
These parameters prevent the GPU from entering the power states that trigger the stall. The system has been stable since.
Current State
Service Port Status
Ollama 11434 Running - 6 models pulled
OpenWebUI 3000 Running - Authentik SSO
AnythingLLM 3001 Running - 500+ doc RAG pipeline
ComfyUI 8188 Running - text-to-image workflows
GPU metrics flow through Telegraf → InfluxDB → Grafana with real-time temperature, utilization, VRAM, and power monitoring.
All inference happens locally. Zero data egress. The entire AI stack is air-gapped from external APIs.
Related: Post 007 - VFIO Forensic Postmortem | Post 011 - Partition Expansion | Post 017 - TIG Stack Observability