ROCm for "Old" AMD GPU
Some Back Story
I recently recycled a few RX580's I bought during the height of 2017-2018
coin mining craziness. Back then RX580's sold for around C$600, turning
out at aronud 30-ishMH/s. At that rate, we figured that we'd break even
with ETH price above $800. But maintaining many GPUs and have them run
with high-enough stability is troublesome. We even got fire alarms, then
there was a few instances that the PCIe riser or MOLEX power cable/connector
just melted. The best for personal safety we stopped the mining process.
Time flies, 2021 rises. We sold the cards at surprisingly high price
(~C$200) due to another shortage, due to trade war, COVID and coin boom
round 2. We couldn't sell a few cards because they weren't in great condition.
To be fair, a few cards were never stable enough to mine and these folks
were probably the culpits. They seemed, okay-enough to pass Furmark after
we reverted BIOS to their factory settings. One card though still refuses
to play game at full screen for some later titles (GTA5, Cities Skylines, etc.).
Another only passes Furmark "most of the time". Point being, be very careful
buying mining cards, even if they DO boot and DO pass Furmark.
Since GCP (Google Cloud) is kind enough to be out of resource frequent enough,
plus it is just impossible to get a hand on RTX 30xx card at non-ridiculous
price at the moment, we figured it'd be a good idea to look at AMD's ROCm
offering. Deep learning is a lot of fun, but it is not fun if we can't
get GPUs. Hopefully this will do for now.
Issues
As of March 2021, ROCm support for PyTorch is pretty good. PyTorch even
has added a beta-build for ROCm. But AMD magically decided to cut off support
for ROCm at Vega. That means non of the pre-Polaris (Polaris, Ellesmere, Baffin, etc.)
do not work for sure post ROCm 4.0. I also remember seeing posts about support
being iffy at ROCm 3.8+. BUT ROCm is also NOT supported for the latest GPUs
(big Navi, at the time of writing). AMD, please spent more time and money
on supporting the deep learning community, so researchers are not subject
to the mercy of nVIDIA.
If you follow the official AMD guide for ROCm installation,
you will see issues installing PyTorch. Along the lines of "some binary cannot
be found for current device", referring to the discountinued support. One such
post is here. Here,
gfx803 is RX580.
What I did, for Ubuntu 20.04
DO NOT INSTALL ANY DRIVER. NOTHING IS NEEDED, RUN THESE AS SOON AS YOU BOOT
So, do read the official guide. Then run these commands. BUT, these are just
to get your local system read. You are NOT running things locally. Instead,
we want to use Docker.
For your local system
Permission for groups.
- echo 'ADD_EXTRA_GROUPS=1' | sudo tee -a /etc/adduser.conf
- echo 'EXTRA_GROUPS=video' | sudo tee -a /etc/adduser.conf
- echo 'EXTRA_GROUPS=render' | sudo tee -a /etc/adduser.conf
Update the system
- sudo apt update
- sudo apt dist-upgrade
- sudo apt install libnuma-dev
Include the latest ROCm toolkit
- wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
- echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list
Install the ROCm toolkit normally. If you cannot install the rocm-dkms package
due to this
issue, it probably means you have a previous installation of AMD driver. Find
a way to run "amdgpu-pro-uninstall" or "amdgpu-uninstall" to uninstall driver.
Or just do a fresh install. Takes no more than 15 minutes using a USB3.0 stick
on a NVMe SSD.
- sudo apt update
- sudo apt install rocm-dkms && sudo reboot
Change group permission, due to some ROCm requirement on Ubuntu 20.04.
- sudo usermod -a -G video $USER
- sudo usermod -a -G render $USER
Install a bunch of ROCm toolkits
- sudo apt-get install clinfo
- sudo -usermod -a -G render $USER
These commands can let you inspect condition of GPUs. They should return no
error when you run them. If you see group permission issues, make sure you
added yourself (or user) to video and render group.
Install some other stuff. Not sure if they are useful though, welp.
- sudo apt-get install rocm-dev rocblas
- sudo apt install rocm-libs miopen-hip cxlactivitylogger rccl
For Docker
DO NOT try to work directly on your local. Your life will be miserable.
If you do figure out though, let me know *(^_^)*.
Install Docker, because running these on Docker is much easier.
- sudo apt install docker.io
Pull the docker image. This actually took me a while to figure out. There
is a ROCm guide in PyTorch
, but at the time of writing, it contains inconsistent information. That is,
the Docker image is for latest ROCm, which won't support RX580. Note that the
following images are uploaded by rocmdev, not jeffdaily. A few jeffdaily's
uploads on Docker registry do not have PyTorch compiled with ROCm compile flag.
- sudo apt install docker.io
- docker pull rocm/pytorch:rocm3.7_ubuntu18.04_py3.6_pytorch
- sudo docker run -it -v $HOME:/data --privileged --rm --device=/dev/kfd --device=/dev/dri --group-add video rocm/pytorch:rocm3.7_ubuntu18.04_py3.6_pytorch
The guide will say, run some tests, but seems like the tests actually fail.
Not sure if this is bad, welp.... Also, this version of Torch does not
support the --verbose flag.
- PYTORCH_TEST_WITH_ROCM=1 python3.6 test/run_test.py
Torch vision is always needed
Checking our Torch Installtion
Launch Python. Then run the following
List "CUDA" device.
- import torch
- torch.cuda.is_available()
- torch.cuda.current_device()
- torch.cuda.get_device_name(0)
Very simple stuff
- shape = torch.Size((300, 300))
- x = torch.cuda.FloatTensor(shape)
- torch.randn(shape, out=x)
Run a model. Oh, first get to some other directory, not /pytorch. Also notice
that you are root, on that Docker image. Do whatever you feel comfortable. This stuff
runs for a dozen or so epochs, then the end accuracy should be "SOTA", so 99%.
Note this only runs on 1 GPU. See info
on how to change visible devices. But this doesn't seem to work as intended.
I just, changed to use cuda:1 or cuda:0 to test my other GPUs.
- wget https://raw.githubusercontent.com/pytorch/examples/master/mnist/main.py
- python main.py