我采用的步骤是手动安装 bin 文件,毕竟服务器我基本上不升级维护,秉持着只要开始瞎搞,就一定会崩的原则,我已经放弃瞎搞了((

下载 bin 文件的过程就不说了,网上一大堆,就在 nv 官网。

在 pve 宿主机安装 bin 文件,直接运行就可以了。

chmod +x NVIDIA-Linux-x86_64-550.142.run
./NVIDIA-Linux-x86_64-550.142.run

之后重启服务器,检查一下驱动的工作情况。

root@pve:~# lspci -v | grep -i nv
06:00.0 Non-Volatile memory controller: Intel Corporation NVMe Optane Memory Series (prog-if 02 [NVM Express])
Kernel driver in use: nvme
Kernel modules: nvme
07:00.0 Non-Volatile memory controller: Intel Corporation NVMe Optane Memory Series (prog-if 02 [NVM Express])
Kernel driver in use: nvme
Kernel modules: nvme
81:00.0 VGA compatible controller: NVIDIA Corporation TU116 [GeForce GTX 1660 SUPER] (rev a1) (prog-if 00 [VGA controller])
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
81:00.1 Audio device: NVIDIA Corporation TU116 High Definition Audio Controller (rev a1)
81:00.2 USB controller: NVIDIA Corporation TU116 USB 3.1 Host Controller (rev a1) (prog-if 30 [XHCI])
81:00.3 Serial bus controller: NVIDIA Corporation TU116 USB Type-C UCSI Controller (rev a1)
Kernel driver in use: nvidia-gpu
Kernel modules: i2c_nvidia_gpu
root@pve:~# ls -l /dev/dri/
total 0
drwxr-xr-x 2 root root 80 Jan 5 12:40 by-path
crw-rw---- 1 root video 226, 0 Jan 5 12:40 card0
crw-rw---- 1 root render 226, 128 Jan 5 12:40 renderD128
root@pve:~# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 Jan 1 18:18 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jan 1 18:18 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Jan 1 18:18 /dev/nvidia-modeset
crw-rw-rw- 1 root root 508, 0 Jan 1 18:18 /dev/nvidia-uvm
crw-rw-rw- 1 root root 508, 1 Jan 1 18:18 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
cr-------- 1 root root 511, 1 Jan 1 18:18 nvidia-cap1
cr--r--r-- 1 root root 511, 2 Jan 1 18:18 nvidia-cap2

宿主机工作正常以后,创建或者编辑现有的 lxc 配置文件。将这些字段添加进去。

lxc.mount.auto: "proc:rw sys:rw"
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-caps/nvidia-cap1 none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-caps/nvidia-cap2 none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.apparmor.profile: unconfined
lxc.cap.drop:
lxc.cgroup2.devices.allow: a
lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.cgroup2.devices.allow: c 508:* rwm
lxc.cgroup2.devices.allow: c 511:0 rwm

这几行配置文件就是将设备文件挂载进 lxc 中,并且允许权限。需要注意的是,pve8 已经在使用 cgroup2 了,网上有些教程仍然使用的 lxc.cgroup.devices, 而不是 lxc.cgroup2.devices,在看别人的资料时需要严格留意。

之后启动 lxc,安装同样的 bin 文件,但是需要添加一个参数,因为 lxc 和宿主机共享内核,我们不需要在 lxc 中安装 dkms。

./NVIDIA-Linux-x86_64-550.142.run --no-kernel-module

安装完成以后重启 lxc 容器,就可以执行 nvidia-smi 测试了。

root@k8s:~# nvidia-smi
Sun Jan 5 12:50:18 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.142 Driver Version: 550.142 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1660 ... Off | 00000000:81:00.0 Off | N/A |
| 0% 42C P0 N/A / 125W | 1MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+