ROCm for "Old" AMD GPU

Some Back Story

I recently recycled a few RX580's I bought during the height of 2017-2018 coin mining craziness. Back then RX580's sold for around C$600, turning out at aronud 30-ishMH/s. At that rate, we figured that we'd break even with ETH price above $800. But maintaining many GPUs and have them run with high-enough stability is troublesome. We even got fire alarms, then there was a few instances that the PCIe riser or MOLEX power cable/connector just melted. The best for personal safety we stopped the mining process.

Time flies, 2021 rises. We sold the cards at surprisingly high price (~C$200) due to another shortage, due to trade war, COVID and coin boom round 2. We couldn't sell a few cards because they weren't in great condition. To be fair, a few cards were never stable enough to mine and these folks were probably the culpits. They seemed, okay-enough to pass Furmark after we reverted BIOS to their factory settings. One card though still refuses to play game at full screen for some later titles (GTA5, Cities Skylines, etc.). Another only passes Furmark "most of the time". Point being, be very careful buying mining cards, even if they DO boot and DO pass Furmark.

Since GCP (Google Cloud) is kind enough to be out of resource frequent enough, plus it is just impossible to get a hand on RTX 30xx card at non-ridiculous price at the moment, we figured it'd be a good idea to look at AMD's ROCm offering. Deep learning is a lot of fun, but it is not fun if we can't get GPUs. Hopefully this will do for now.

Issues

As of March 2021, ROCm support for PyTorch is pretty good. PyTorch even has added a beta-build for ROCm. But AMD magically decided to cut off support for ROCm at Vega. That means non of the pre-Polaris (Polaris, Ellesmere, Baffin, etc.) do not work for sure post ROCm 4.0. I also remember seeing posts about support being iffy at ROCm 3.8+. BUT ROCm is also NOT supported for the latest GPUs (big Navi, at the time of writing). AMD, please spent more time and money on supporting the deep learning community, so researchers are not subject to the mercy of nVIDIA.

If you follow the official AMD guide for ROCm installation, you will see issues installing PyTorch. Along the lines of "some binary cannot be found for current device", referring to the discountinued support. One such post is here. Here, gfx803 is RX580.

What I did, for Ubuntu 20.04

DO NOT INSTALL ANY DRIVER. NOTHING IS NEEDED, RUN THESE AS SOON AS YOU BOOT
So, do read the official guide. Then run these commands. BUT, these are just to get your local system read. You are NOT running things locally. Instead, we want to use Docker.

For your local system

Permission for groups.

echo 'ADD_EXTRA_GROUPS=1' | sudo tee -a /etc/adduser.conf
echo 'EXTRA_GROUPS=video' | sudo tee -a /etc/adduser.conf
echo 'EXTRA_GROUPS=render' | sudo tee -a /etc/adduser.conf

Update the system

sudo apt update
sudo apt dist-upgrade
sudo apt install libnuma-dev

Include the latest ROCm toolkit

wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list

Install the ROCm toolkit normally. If you cannot install the rocm-dkms package due to this issue, it probably means you have a previous installation of AMD driver. Find a way to run "amdgpu-pro-uninstall" or "amdgpu-uninstall" to uninstall driver. Or just do a fresh install. Takes no more than 15 minutes using a USB3.0 stick on a NVMe SSD.

sudo apt update
sudo apt install rocm-dkms && sudo reboot

Change group permission, due to some ROCm requirement on Ubuntu 20.04.

sudo usermod -a -G video $USER
sudo usermod -a -G render $USER

Install a bunch of ROCm toolkits

sudo apt-get install clinfo
sudo -usermod -a -G render $USER

These commands can let you inspect condition of GPUs. They should return no error when you run them. If you see group permission issues, make sure you added yourself (or user) to video and render group.

rocm-smi
rocminfo
clinfo

Install some other stuff. Not sure if they are useful though, welp.

sudo apt-get install rocm-dev rocblas
sudo apt install rocm-libs miopen-hip cxlactivitylogger rccl

For Docker

DO NOT try to work directly on your local. Your life will be miserable. If you do figure out though, let me know *(^_^)*.

Install Docker, because running these on Docker is much easier.

sudo apt install docker.io

Pull the docker image. This actually took me a while to figure out. There is a ROCm guide in PyTorch , but at the time of writing, it contains inconsistent information. That is, the Docker image is for latest ROCm, which won't support RX580. Note that the following images are uploaded by rocmdev, not jeffdaily. A few jeffdaily's uploads on Docker registry do not have PyTorch compiled with ROCm compile flag.

sudo apt install docker.io
docker pull rocm/pytorch:rocm3.7_ubuntu18.04_py3.6_pytorch
sudo docker run -it -v $HOME:/data --privileged --rm --device=/dev/kfd --device=/dev/dri --group-add video rocm/pytorch:rocm3.7_ubuntu18.04_py3.6_pytorch

The guide will say, run some tests, but seems like the tests actually fail. Not sure if this is bad, welp.... Also, this version of Torch does not support the --verbose flag.

PYTORCH_TEST_WITH_ROCM=1 python3.6 test/run_test.py

Torch vision is always needed

pip install torchvision

Checking our Torch Installtion

Launch Python. Then run the following
List "CUDA" device.

import torch
torch.cuda.is_available()
torch.cuda.current_device()
torch.cuda.get_device_name(0)

Very simple stuff

shape = torch.Size((300, 300))
x = torch.cuda.FloatTensor(shape)
torch.randn(shape, out=x)

Run a model. Oh, first get to some other directory, not /pytorch. Also notice that you are root, on that Docker image. Do whatever you feel comfortable. This stuff runs for a dozen or so epochs, then the end accuracy should be "SOTA", so 99%.
Note this only runs on 1 GPU. See info on how to change visible devices. But this doesn't seem to work as intended. I just, changed to use cuda:1 or cuda:0 to test my other GPUs.

wget https://raw.githubusercontent.com/pytorch/examples/master/mnist/main.py
python main.py