Running a Tensor Flow test on a Vast.Ai host

Vast.ai is a cloud computing, matchmaking and aggregation service focused on lowering the price of compute-intensive workloads.

We have been hosting a few systems on the vast.ai platform and it is absolutely vital to test the system before making it live.

To be able to get the most out of vast.ai and your hardware investment it is essential to get verified by vast.ai. This tutorial is intended to help you test your system to ensure it will pass the verification process easily.

We assume that you have a host listed on vast.ai and that you can find it under the CLIENT Create section.

This test aims to run a tensor flow workload that would be typical to what would be deployed by clients on vast.ai. You might be mistaken in thinking a crypto mining test would be adequate to test your system but sadly it is grossly inefficient.

Step 1: Configure the docker OS image

Use the vastai/tesorflow image and select latest version. Select the Launch Mode:Run a jupyter-python notebook

The jupyter-python notebook is by far the easiest way to get access to your running docker instance. You can also use an SSH terminal program to connect but it is beyond this guide.

Step 2: find and rent the host you want to test

Modify the filters to find your system, in this case, we will be renting 8 RTX 2070S system

If you can’t find it make sure you tick the Unavailable offers as well as the unverified Machines.

Although it shows YOU RENTED and is greyed out you can still click on the button to create your instance.

Step 3: Connect to your system

Go to the CLIENT->Instances section and scroll down to find the new instances.

Once the connect button is available, click on it to open the jupyter-python notebook tab in your browser. It might take a few minutes to work so give it time to load and for the servers to update.

Step 4. Open a terminal

Click on the New tab and Terminal to open another tab in the browser with a CLI Terminal that you can use to access the running docker image.

Step 5: Run the stress test.

Before pasting the below command ensure to change –num_gpus=8 to match the amount of GPUs in the system. In this case, it was 8.

bash -c ‘apt -y update; apt install -y git; git clone https://github.com/tensorflow/benchmarks.git; cd benchmarks; git checkout 7d578b912c16c138e819ea9bf40113f8c7ae6811; cd scripts/tf_cnn_benchmarks/; python3 tf_cnn_benchmarks.py –data_format=NCHW –batch_size=128 –model=resnet50 –optimizer=momentum –variable_update=replicated –nodistortions –gradient_repacking=8 –num_gpus=8 –num_epochs=90 –weight_decay=1e-4 –data_dir=${DATA_DIR} –use_fp16 –train_dir=${CKPT_DIR} –allow_growth=True’

The output should look like this after about 2-5 min.

It is recommended to run this test for a few hours to ensure your system is capable of handling this type of workloads. This test might not reveal all the reason why you could fail verification but it is a big improvement over using a cryptocurrency miner.