ROCm for "Old" AMD GPU

Some Back Story

I recently recycled a few RX580's I bought during the height of 2017-2018 coin mining craziness. Back then RX580's sold for around C$600, turning out at aronud 30-ishMH/s. At that rate, we figured that we'd break even with ETH price above $800. But maintaining many GPUs and have them run with high-enough stability is troublesome. We even got fire alarms, then there was a few instances that the PCIe riser or MOLEX power cable/connector just melted. The best for personal safety we stopped the mining process.

Time flies, 2021 rises. We sold the cards at surprisingly high price (~C$200) due to another shortage, due to trade war, COVID and coin boom round 2. We couldn't sell a few cards because they weren't in great condition. To be fair, a few cards were never stable enough to mine and these folks were probably the culpits. They seemed, okay-enough to pass Furmark after we reverted BIOS to their factory settings. One card though still refuses to play game at full screen for some later titles (GTA5, Cities Skylines, etc.). Another only passes Furmark "most of the time". Point being, be very careful buying mining cards, even if they DO boot and DO pass Furmark.

Since GCP (Google Cloud) is kind enough to be out of resource frequent enough, plus it is just impossible to get a hand on RTX 30xx card at non-ridiculous price at the moment, we figured it'd be a good idea to look at AMD's ROCm offering. Deep learning is a lot of fun, but it is not fun if we can't get GPUs. Hopefully this will do for now.

Issues

As of March 2021, ROCm support for PyTorch is pretty good. PyTorch even has added a beta-build for ROCm. But AMD magically decided to cut off support for ROCm at Vega. That means non of the pre-Polaris (Polaris, Ellesmere, Baffin, etc.) do not work for sure post ROCm 4.0. I also remember seeing posts about support being iffy at ROCm 3.8+. BUT ROCm is also NOT supported for the latest GPUs (big Navi, at the time of writing). AMD, please spent more time and money on supporting the deep learning community, so researchers are not subject to the mercy of nVIDIA.

If you follow the official AMD guide for ROCm installation, you will see issues installing PyTorch. Along the lines of "some binary cannot be found for current device", referring to the discountinued support. One such post is here. Here, gfx803 is RX580.

What I did, for Ubuntu 20.04

DO NOT INSTALL ANY DRIVER. NOTHING IS NEEDED, RUN THESE AS SOON AS YOU BOOT
So, do read the official guide. Then run these commands. BUT, these are just to get your local system read. You are NOT running things locally. Instead, we want to use Docker.

For your local system


Permission for groups.
Update the system
Include the latest ROCm toolkit
Install the ROCm toolkit normally. If you cannot install the rocm-dkms package due to this issue, it probably means you have a previous installation of AMD driver. Find a way to run "amdgpu-pro-uninstall" or "amdgpu-uninstall" to uninstall driver. Or just do a fresh install. Takes no more than 15 minutes using a USB3.0 stick on a NVMe SSD.
Change group permission, due to some ROCm requirement on Ubuntu 20.04.
Install a bunch of ROCm toolkits
These commands can let you inspect condition of GPUs. They should return no error when you run them. If you see group permission issues, make sure you added yourself (or user) to video and render group.
Install some other stuff. Not sure if they are useful though, welp.

For Docker

DO NOT try to work directly on your local. Your life will be miserable. If you do figure out though, let me know *(^_^)*.


Install Docker, because running these on Docker is much easier.
Pull the docker image. This actually took me a while to figure out. There is a ROCm guide in PyTorch , but at the time of writing, it contains inconsistent information. That is, the Docker image is for latest ROCm, which won't support RX580. Note that the following images are uploaded by rocmdev, not jeffdaily. A few jeffdaily's uploads on Docker registry do not have PyTorch compiled with ROCm compile flag.
The guide will say, run some tests, but seems like the tests actually fail. Not sure if this is bad, welp.... Also, this version of Torch does not support the --verbose flag.
Torch vision is always needed

Checking our Torch Installtion

Launch Python. Then run the following
List "CUDA" device.
Very simple stuff
Run a model. Oh, first get to some other directory, not /pytorch. Also notice that you are root, on that Docker image. Do whatever you feel comfortable. This stuff runs for a dozen or so epochs, then the end accuracy should be "SOTA", so 99%.
Note this only runs on 1 GPU. See info on how to change visible devices. But this doesn't seem to work as intended. I just, changed to use cuda:1 or cuda:0 to test my other GPUs.