Linux & NVIDIA Drivers
Last updated
Last updated
This tutorial assumes that you have already deployed a GPU server on the TensorDock platform:
NVIDIA drivers automatically update. Once the drivers update, they require a reboot for the GPUs to become usable again. By default, our templates lock your driver image to the version they were built with so that the GPUs never become unusable.
To unlock the driver version, run the following command:
Once you upgrade to a new driver version, you can lock the new driver version to prevent the driver from updating automatically in the future. Run the following command as the root
user to do this.
Our NVIDIA H100 SXM5 servers require the installation of the nvidia-fabricmanager-535
driver for the GPU driver to properly use the NVSwitch fabric installed. If you do not install this package, CUDA will NOT work properly, even if all GPUs are deployed.
Our TensorML operating system packages include this package, but our base templates do not.
First, we'll need to unhold the default drivers included with our operating system templates:
Then, we'll need to install the NVSwitch FabricManager package:
Finally, we'll upgrade all of our packages before rebooting, which will bring the GPU driver up to date with the FabricManager package.
As pictured, nvidia-smi -q
should show Fabric State = Completed after the reboot. This indicates the GPUs are ready for usage!
First, search for your GPU through the link below and copy the link to the NVIDIA driver.
For instance, for a GeForce 4090:
Product Type: GeForce
Product Series: GeForce RTX 40 Series
Product: NVIDIA GeForce RTX 4090
Operating System: Linux 64-bit
Download Type: Production Branch
Language: English (US)
Once you get redirected to the driver, click on the "Download" button. Don't worry; it won't actually initiate a download. It will simply redirect you to a page where you'll confirm NVIDIA's EULA.
Now, you can copy the link to the actual driver.
Use the port forwarded into port 22 as your SSH port. You should see something like the following:
Use wget
and then append the driver's URL. This will save the driver in whatever directory you're in.
Run chmod +x
and then append the file name
Run sudo ./[DRIVER_FILENAME]
Complete the questionarie, and then run sudo reboot
to reboot your virtual machine!
Now, nvidia-smi
should work!
If you're still facing issues, come email us at support@tensordock.com. For reference, these were the commands we ran while making this tutorial: