Managed Containers

Learn to deploy Docker containers on TensorDock in under 5 minutes

Introduction

TensorDock Marketplace allows you to easily deploy Docker containers on cloud GPUs. Features include:

  • Load distribution: Container replicas are deployed on multiple hostnodes so that requests to containers can be processed in parallel. This reduces the workload on each machine and allows you to make more requests to the container per minute.

  • Scalability: You can change the number of deployed replicas at any time. Our system will automatically scale the number of replicas depending on GPU usage, so you don't have to worry about handling this logic if you don't want to.

  • API: Containers can be deployed, managed, and terminated via the TensorDock API.

You will need to set an SSH public key in your organization settings before deploying a container, which can be done here.

Example: Deploy a Docker LLM API

In this example, we'll be deploying a vLLM container. Once it's deployed, we will be able to generate endlessly fascinating text by using the container's API.

vLLM has certain hardware requirements that must be met in order to deploy properly. If you are deploying a vLLM container, make sure to deploy at least 2 GPUs per replica, 8 GB of RAM, and ensure that the GPU types you have selected support quantization.

  1. Configure basic details about your container, such as its name and Docker image source.

  1. Configure how your container should be deployed on TensorDock's hostnodes (physical servers). Since resource pricing varies by hostnode, you can set a limit on the hourly rate of each replica. In this example, we'll make sure that the running cost of each replica does not exceed $1.00/hour. If we deploy three replicas, the maximum we can expect to be charged is $3.00/hour. Additionally, you can select what GPU types your container should be deployed on. Your choices will be used as fallbacks in case your first choice of GPU is out of stock.

  1. Customize the runtime configuration of your Docker container. The configuration below is equivalent to the Docker command:

docker run --runtime nvidia \
    --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=MY_TOKEN" \
    -p 8000:8000 vllm/vllm-openai:latest \
    --model TheBloke/Llama-2-13B-chat-AWQ \
    --dtype half \
    --quantization awq \
    --max-model-len 4096
  1. Deploy your container. It can take roughly 5-10 minutes for the container to boot, but the duration ultimately depends on the container's size.

  2. Test the deployed API via one of the container URLs (provided on the container management page). In this example, the URL is:

https://1f5f01f6-d480-4f11-8123-04c88b2e6d41.tensordock.app

The path is forwarded to the container API, so we can try testing the model by using the /v1/completions route. The request URL looks like this:

https://1f5f01f6-d480-4f11-8123-04c88b2e6d41.tensordock.app/v1/completions

Congratulations, you've just deployed your first Dockerized LLM on TensorDock!

Scaling Containers

Once your container is live, you may want to scale it depending on usage. TensorDock automatically scales containers by GPU utilization. Our auto-scaler works by using the following logic:

  • If GPU utilization exceeds 80% on any replica, add a replica to the container group

  • If GPU utilization falls below 10% on any replica, scale down the container group by one replica

All containers have a minimum of three running replicas (unless you choose to deploy fewer than three). If you want to disable autoscaling and implement your own scaling system, you can toggle the autoscaling option on the dashboard.

Accessing Replicas

The container management page displays all running virtual machines on which your Docker image has been deployed. You can find information on their GPU specifications, hourly rate, and connection details in case you want to SSH into a replica. This can be useful for debugging faulty deployments.

Last updated