# Linux & NVIDIA Drivers

This tutorial assumes that you have already deployed a GPU server on the TensorDock platform:

{% embed url="<https://marketplace.tensordock.com/order_list>" fullWidth="false" %}

## Important Note - Holding & Unholding NVIDIA driver versions

NVIDIA drivers automatically update. Once the drivers update, they require a reboot for the GPUs to become usable again. By default, our templates lock your driver image to the version they were built with so that the GPUs never become unusable.&#x20;

<figure><img src="https://276866638-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FScYFYZoiZXazILi6lxJ5%2Fuploads%2Fc7hpQiojZHSD8F8wmfpV%2Fimage.png?alt=media&#x26;token=797c744a-a20c-41d3-9260-57748be9bc5a" alt=""><figcaption><p>When NVIDIA drivers automatically update, the GPUs becomes unusable. Thus, you should always lock a working driver version</p></figcaption></figure>

To unlock the driver version, run the following command:

```
sudo apt-mark unhold nvidia* libnvidia*
```

Once you upgrade to a new driver version, you can lock the new driver version to prevent the driver from updating automatically in the future. Run the following command as the `root` user to do this.&#x20;

<pre><code><strong>dpkg-query -W --showformat='${Package} ${Status}\n' | grep -v deinstall | awk '{ print $1 }' | grep -E 'nvidia.*-[0-9]+$' | xargs -r -L 1 apt-mark hold
</strong></code></pre>

## Important Note - NVIDIA H100 SXM5

Our NVIDIA H100 SXM5 servers require the installation of the `nvidia-fabricmanager-535` driver for the GPU driver to properly use the NVSwitch fabric installed. **NVLink is only enabled for 8x H100 VMs. If you do not install this package, CUDA will NOT work properly.**&#x20;

Our TensorML operating system packages include this package, but our base templates do not.&#x20;

First, we'll need to unhold the default drivers included with our operating system templates:

```
sudo apt-mark unhold nvidia* libnvidia*
```

Then, we'll need to install the NVSwitch FabricManager package:

```
sudo apt update
sudo apt install nvidia-fabricmanager-535
```

Finally, we'll upgrade all of our packages before rebooting, which will bring the GPU driver up to date with the FabricManager package.&#x20;

```
sudo apt upgrade -y 
sudo reboot
```

<figure><img src="https://276866638-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FScYFYZoiZXazILi6lxJ5%2Fuploads%2FSyIkiEmeguVmATeHgM2y%2Fimage.png?alt=media&#x26;token=23a8a98b-754c-4cfe-8da2-741704e7940d" alt=""><figcaption></figcaption></figure>

As pictured, `nvidia-smi -q` should show Fabric State = Completed after the reboot. This indicates the GPUs are ready for usage!&#x20;

## Installing a new driver

### 1. Search for your NVIDIA driver

First, search for your GPU through the link below and copy the link to the NVIDIA driver.&#x20;

For instance, for a GeForce 4090:

* Product Type: GeForce
* Product Series: GeForce RTX 40 Series
* Product: NVIDIA GeForce RTX 4090
* Operating System: Linux 64-bit
* Download Type: Production Branch
* Language: English (US)

{% embed url="<https://www.nvidia.com/download/index.aspx>" %}
Click on this link to search for the NVIDIA driver for your graphics card
{% endembed %}

### 2. Visit the downloads page

Once you get redirected to the driver, click on the "Download" button. Don't worry; it won't actually initiate a download. It will simply redirect you to a page where you'll confirm NVIDIA's EULA.&#x20;

<figure><img src="https://276866638-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FScYFYZoiZXazILi6lxJ5%2Fuploads%2FPAju88GjsCwLDcZOYd7H%2Fimage.png?alt=media&#x26;token=ce3b2f84-18de-49fc-bd48-5136c1b7cec0" alt=""><figcaption></figcaption></figure>

### 3. Copy the driver download link

Now, you can copy the link to the actual driver.&#x20;

<figure><img src="https://276866638-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FScYFYZoiZXazILi6lxJ5%2Fuploads%2FjILUTGA8vo4ZaSJsQq4s%2Fimage.png?alt=media&#x26;token=f9e9fb30-53a7-4850-859a-04c7bb472547" alt=""><figcaption></figcaption></figure>

### 4. SSH onto your TensorDock instance

Use the port forwarded into port 22 as your SSH port. You should see something like the following:

<figure><img src="https://276866638-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FScYFYZoiZXazILi6lxJ5%2Fuploads%2FjhwvnUJ7lM2zeLGAR3s3%2Fimage.png?alt=media&#x26;token=cd4581e2-620b-4105-8ac5-70bccbe55322" alt=""><figcaption><p>Whoops, nvidia-smi doesn't work! Downloading new drivers will fix that...</p></figcaption></figure>

### 5. Download the driver onto your VM

Use `wget` and then append the driver's URL. This will save the driver in whatever directory you're in.&#x20;

<figure><img src="https://276866638-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FScYFYZoiZXazILi6lxJ5%2Fuploads%2FogteZRl4qrbZqIzcv1zd%2Fimage.png?alt=media&#x26;token=3beadaee-e426-42c7-920e-d0107dfcba41" alt=""><figcaption></figcaption></figure>

### 6. Enable execution permissions for the driver installer you just downloaded

Run `chmod +x`  and then append the file name

<figure><img src="https://276866638-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FScYFYZoiZXazILi6lxJ5%2Fuploads%2FdQT1hExQcJfxDMGqaCFW%2Fimage.png?alt=media&#x26;token=4c38bc0c-83ae-4c33-864e-f6f0c127dce0" alt=""><figcaption></figcaption></figure>

### 7. Run the driver installer

Run `sudo ./[DRIVER_FILENAME]`

<figure><img src="https://276866638-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FScYFYZoiZXazILi6lxJ5%2Fuploads%2FyRSmQPGrGYbMMzNIwOUj%2Fimage.png?alt=media&#x26;token=80ec09ef-e376-4a83-aca8-a1be2477e596" alt=""><figcaption></figcaption></figure>

### 8. Reboot!

Complete the questionarie, and then run `sudo reboot` to reboot your virtual machine!

### 9. Confirm everything is working

Now, `nvidia-smi` should work!&#x20;

<figure><img src="https://276866638-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FScYFYZoiZXazILi6lxJ5%2Fuploads%2Fi0ezCtsJVvj0I6XoTHsa%2Fimage.png?alt=media&#x26;token=617b08e8-ce54-4fde-9304-91430e7a7a54" alt=""><figcaption></figcaption></figure>

### Issues

If you're still facing issues, come email us at <support@tensordock.com>. For reference, these were the commands we ran while making this tutorial:

```
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/525.60.11/NVIDIA-Linux-x86_64-525.60.11.run
chmod +x NVIDIA-Linux-x86_64-525.60.11.run
sudo ./NVIDIA-Linux-x86_64-525.60.11.run
sudo reboot
```
