Author Topic: My little amdgpu bughunt  (Read 5232 times)

0 Members and 1 Guest are viewing this topic.

My little amdgpu bughunt
Where to begin. I'm a little surprised no one else seems to have run across this issue.
I'm playing on a current Archlinux with an Intel CPU and AMD RX560 graphics card,
mesa with no amdgpu-pro or other fancy stuff as drivers.

On newer nightly builds and RC19, my system would regularly freeze a few seconds into any mission. Almost instantly with smaa enabled.
FSO logs didn't show anything wrong, sound worked, could pause/unpause, but I couldn't even
switch to a tty to look what was wrong - my display was just completely frozen!
A hard reboot later I looked into journalctl and found a mass of red lines about a gpu fault 146 or 147,
pretty similar to these bug reports: https://bugs.freedesktop.org/show_bug.cgi?id=107152

Now by running builds one after another from working to not working,
I've found everything started with 3.8.1-20181107 and this commit:
https://github.com/scp-fs2open/fs2open.github.com/commit/bb6c00a5164b89d3d51275992b25ea5313669282#diff-3a82fccc233b03c40a01981f092bc2a0
Cloned current master branch, commented out the two lines, built.
Was able to play through BtA:Operation Templar Mission 2 with smaa enabled (fxaa disabled), which had previously run 10 seconds tops.
So I'm assuming this "fixed" it for me.

Now to my questions: The commit comment is pretty cryptic ("Update gropengltnl.cpp"). What was this change meant to accomplish?
Has it caused problems anywhere else? What intended improvements am I foregoing? Is there another workaround known?

I mean, I'm happy everything seems to be working now, and I know AMDGPU isn't all that stable,
but I'd like to know WHY it works now, and understanding OpenGL is a little beyond a small-time hobbyist coder.

Thanks for any help!

 

Offline niffiwan

  • 211
  • Eluder Class
Re: My little amdgpu bughunt
good detective work!

The pull request that included that commit has more info, basically it was removing errors when creating a shadow framebuffer. Seems that it was tested on Nvidia & Intel but not AMD!  :nervous: When you removed the lines do you see the other errors listed in the pull request? If not then maybe some sort of conditional statement is needed for AMD vs Nvidia/Intel.
Creating a fs2_open.log | Red Alert Bug = Hex Edit | MediaVPs 2014: Bigger HUD gauges | 32bit libs for 64bit Ubuntu
----
Debian Packages (testing/unstable): Freespace2 | wxLauncher
----
m|m: I think I'm suffering from Stockholm syndrome. Bmpman is starting to make sense and it's actually written reasonably well...

 
Re: My little amdgpu bughunt
You mean in the fs debug log, right?
Haven't created one since everything worked, but I'll do so soon.

 
Re: My little amdgpu bughunt
https://fsnebula.org/log/5dd10b02cb0d3322ec684742
Here it is. Error present on line 144.
I'm getting a bad feeling about this...
Trying out something.

EDIT:
OK, that's what I was afraid of.
RC 19 works if I disable shadows.

From what I was able to gather from the graphics card board, RX560 should have more than enough power
to draw shadows. Does anyone have experience with this card, maybe on windows or with another driver?
Might AMDGPU-PRO help here?

Anyway, the problem doesn't seem to lie with FSO, but my system.
Detective work on a case that doesn't exist, it appears.

next EDIT:
Official builds work with AMDGPU-PRO. GNOME on Wayland doesn't - Xwayland segfaults. So I'm using X11.
FSO on any x11 DE has an issue I experienced some time ago:
Everything is dark. Frome the start screen to the missions, everything looks like its behind a black layer of about 50% transparency.
I can workaround that with -fullscreen-window. Is this behaviour known? Is there a better fix?
Otherwise, since this setup seems to be working, and I'm getting generally good framerates, I'll be sticking with it.
« Last Edit: November 17, 2019, 04:08:02 am by themaddin »

 

Offline m!m

  • 211
Re: My little amdgpu bughunt
Great work tracking this down! The changes were introduced in this pull request: https://github.com/scp-fs2open/fs2open.github.com/pull/1924/files They are correct as far as I can tell so I would guess that this is a Mesa bug. I have an RX480 and also use that with the Mesa drivers and had some GPU hang issues some time ago but not anymore. However, I remember disabling shadows at some point and never having issues after that so maybe this is the same bug.

Could record an OpenGL trace with apitrace of an instance of this bug? Maybe that can be sent to the Mesa devs to resolve this.

 
Re: My little amdgpu bughunt
Did that. I've got a 2GB trace file sitting in my fs folder now. Not sure if it helps, though,
because this time, graphics didn't freeze for longer than half a second - everything worked normally after that!
Could apitrace be restarting whatever crashes in my card?

AMDGPU-PRO is giving me real bad framerates now and again, btw. So if this can't be solved easily,
I'll play without shadows on the open driver and wayland in the future.

Edit: Just had a freeze trying to replay the trace. About ready to give up on this and play without shadows.
« Last Edit: November 17, 2019, 03:01:04 pm by themaddin »

 
Re: My little amdgpu bughunt
Got to thinking about this again.
What was most strange about it is this:
There was no freeze while running in apitrace,
only a slight hang - as if apitrace "made" my GPU "go on normally"
 - I don't know how to express this technically - after the cause of the freeze.

I booted with amdgpu.gpu_recovery=1. No freeze this time, not even an appreciable
FPS drop. I still get lots of red lines in journalctl, though:
amdgpu 0000:01:00.0: GPU fault detected: 146 0x078a4414 for process fs2_open_19_0_0 pid 1477 thread fs2_open_1:cs0 pid 1497
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001070F1
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05044014
amdgpu 0000:01:00.0: VM fault (0x14, vmid 2, pasid 32772) at page 1077489, write from 'TC1' (0x54433100) (68)

(and dozens more like that)

I'm fairly certain it's an AMDGPU bug, but as long as I can work around it ...

@m!m: Maybe you could verify this / test if shadows work with that parameter set?

EDIT: Ah hell. On some missions, this helps, on some it doesn't.
« Last Edit: December 01, 2019, 02:39:04 pm by themaddin »

 
Re: My little amdgpu bughunt
Sorry to engage in some forum necromancy, but I'm hitting this exact problem. The description in OP matches my experience almost exactly.

Given that this thread is a couple of years old at this point, what are my options? Have these problems been fixed, or is there some option I can tweak to keep from getting these crashes?

 
Re: My little amdgpu bughunt
I got a new GPU in the meantime, but the only thing that worked for me on polaris cards is using AMDGPU-PRO and/or lowering shadow quality. Can give you hints how to do it on Arch when I'm back at my machine. I've also opened an issue with mesa, link to follow.

 
Re: My little amdgpu bughunt
Thanks! I'm on Arch as well, using an AMD Vega GPU, which is also based on Polaris.

amdgpu: hwmgr_sw_init smu backed is vegam_smu

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Polaris 22 XL [Radeon RX Vega M GL] (rev c0)

 
Re: My little amdgpu bughunt
Didn't know that chip, seems to be an odd Vega/Polaris mix or just rebrand. On a discrete Vega64, and afaik "true" integrated vega (2400g), I no longer have the issue. But since yours is a Polaris by code-path, well.

Anyway, here's what I did. Expect to trade performance for stability.
Build amdgpu-pro-libgl from AUR and do not install it, it will mess up other applications, including, the last time I tried, Mutter. Instead I unzipped it to /opt/fs2_open, and launched FSO with the following environment set:

LD_LIBRARY_PATH="/opt/fs2_open/amdgpu-pro-libgl-21.10_1247438-1-x86_64.pkg/usr/lib/amdgpu-pro:${LD_LIBRARY_PATH}"
LIBGL_DRIVERS_PATH="/opt/fs2_open/amdgpu-pro-libgl-21.10_1247438-1-x86_64.pkg/usr/lib/dri/"
dri_driver="amdgpu"


I did this by prepending env... to the knossos.desktop's Exec= clause, FSO processes inherit its environment; adapt to your launcher of choice. High shadow qualities in particular will lead to noticeably bad framerates.

Here's the issue i opened with mesa: https://gitlab.freedesktop.org/mesa/mesa/-/issues/4530
Maybe an additional bug report will draw some new attention to it. Best of luck!

  
Re: My little amdgpu bughunt
It's a weird chip for sure. It's an AMD GPU integrated into an Intel CPU.

vendor_id   : GenuineIntel
cpu family   : 6
model      : 158
model name   : Intel(R) Core(TM) i7-8705G CPU @ 3.10GHz

A bit of an archaeological curiosity at this point now that Intel have their own high-performance GPUs and don't have to buy them from a competitor anymore.

 
Re: My little amdgpu bughunt
Thanks to the insistence of noobspace, who picked up my old issue referenced above, this has been fixed in the 21.4 mesa release of yesterday.

All shadow qualities should work on Polaris GPUs on any Linux distro using Mesa >= 21.3.4.
« Last Edit: January 14, 2022, 06:14:07 am by themaddin »

 
Re: My little amdgpu bughunt
Teamwork makes the dream work  :D