# Linux & NVIDIA Drivers

This tutorial assumes that you have already deployed a GPU server on the TensorDock platform:

{% embed url="<https://marketplace.tensordock.com/order_list>" fullWidth="false" %}

## Important Note - Holding & Unholding NVIDIA driver versions

NVIDIA drivers automatically update. Once the drivers update, they require a reboot for the GPUs to become usable again. By default, our templates lock your driver image to the version they were built with so that the GPUs never become unusable.&#x20;

<figure><img src="/files/T6OAHKg8dz6khaUuiong" alt=""><figcaption><p>When NVIDIA drivers automatically update, the GPUs becomes unusable. Thus, you should always lock a working driver version</p></figcaption></figure>

To unlock the driver version, run the following command:

```
sudo apt-mark unhold nvidia* libnvidia*
```

Once you upgrade to a new driver version, you can lock the new driver version to prevent the driver from updating automatically in the future. Run the following command as the `root` user to do this.&#x20;

<pre><code><strong>dpkg-query -W --showformat='${Package} ${Status}\n' | grep -v deinstall | awk '{ print $1 }' | grep -E 'nvidia.*-[0-9]+$' | xargs -r -L 1 apt-mark hold
</strong></code></pre>

## Important Note - NVIDIA H100 SXM5

Our NVIDIA H100 SXM5 servers require the installation of the `nvidia-fabricmanager-535` driver for the GPU driver to properly use the NVSwitch fabric installed. **NVLink is only enabled for 8x H100 VMs. If you do not install this package, CUDA will NOT work properly.**&#x20;

Our TensorML operating system packages include this package, but our base templates do not.&#x20;

First, we'll need to unhold the default drivers included with our operating system templates:

```
sudo apt-mark unhold nvidia* libnvidia*
```

Then, we'll need to install the NVSwitch FabricManager package:

```
sudo apt update
sudo apt install nvidia-fabricmanager-535
```

Finally, we'll upgrade all of our packages before rebooting, which will bring the GPU driver up to date with the FabricManager package.&#x20;

```
sudo apt upgrade -y 
sudo reboot
```

<figure><img src="/files/FETYarcOsyrGfT2KzMhd" alt=""><figcaption></figcaption></figure>

As pictured, `nvidia-smi -q` should show Fabric State = Completed after the reboot. This indicates the GPUs are ready for usage!&#x20;

## Installing a new driver

### 1. Search for your NVIDIA driver

First, search for your GPU through the link below and copy the link to the NVIDIA driver.&#x20;

For instance, for a GeForce 4090:

* Product Type: GeForce
* Product Series: GeForce RTX 40 Series
* Product: NVIDIA GeForce RTX 4090
* Operating System: Linux 64-bit
* Download Type: Production Branch
* Language: English (US)

{% embed url="<https://www.nvidia.com/download/index.aspx>" %}
Click on this link to search for the NVIDIA driver for your graphics card
{% endembed %}

### 2. Visit the downloads page

Once you get redirected to the driver, click on the "Download" button. Don't worry; it won't actually initiate a download. It will simply redirect you to a page where you'll confirm NVIDIA's EULA.&#x20;

<figure><img src="/files/iXzTFRnqHn8b9lHoPBX9" alt=""><figcaption></figcaption></figure>

### 3. Copy the driver download link

Now, you can copy the link to the actual driver.&#x20;

<figure><img src="/files/8hDQ3AveACcKWGB0nFwP" alt=""><figcaption></figcaption></figure>

### 4. SSH onto your TensorDock instance

Use the port forwarded into port 22 as your SSH port. You should see something like the following:

<figure><img src="/files/GYZz39SnNYCXnehekqFy" alt=""><figcaption><p>Whoops, nvidia-smi doesn't work! Downloading new drivers will fix that...</p></figcaption></figure>

### 5. Download the driver onto your VM

Use `wget` and then append the driver's URL. This will save the driver in whatever directory you're in.&#x20;

<figure><img src="/files/oaJ6liSumqG7ezoBzLDW" alt=""><figcaption></figcaption></figure>

### 6. Enable execution permissions for the driver installer you just downloaded

Run `chmod +x`  and then append the file name

<figure><img src="/files/Ka475lAjb5IxJMtReXUI" alt=""><figcaption></figcaption></figure>

### 7. Run the driver installer

Run `sudo ./[DRIVER_FILENAME]`

<figure><img src="/files/CHj1NtjLrcqWOi8mHEqQ" alt=""><figcaption></figcaption></figure>

### 8. Reboot!

Complete the questionarie, and then run `sudo reboot` to reboot your virtual machine!

### 9. Confirm everything is working

Now, `nvidia-smi` should work!&#x20;

<figure><img src="/files/sElvEA3G44HYxtoomhdj" alt=""><figcaption></figcaption></figure>

### Issues

If you're still facing issues, come email us at <support@tensordock.com>. For reference, these were the commands we ran while making this tutorial:

```
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/525.60.11/NVIDIA-Linux-x86_64-525.60.11.run
chmod +x NVIDIA-Linux-x86_64-525.60.11.run
sudo ./NVIDIA-Linux-x86_64-525.60.11.run
sudo reboot
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensordock.com/virtual-machines/updating-nvidia-drivers-on-ubuntu.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
