Skip to content

Comments

Replace rocprofilerv1 with rocprofiler-sdk (v3) for rocmon#716

Merged
ipatix merged 32 commits intomasterfrom
rocprofiler-sdk
Jan 28, 2026
Merged

Replace rocprofilerv1 with rocprofiler-sdk (v3) for rocmon#716
ipatix merged 32 commits intomasterfrom
rocprofiler-sdk

Conversation

@ipatix
Copy link
Contributor

@ipatix ipatix commented Jan 12, 2026

This PR introduces support for device counting with rocprofiler-sdk. Device counting allows out-of-band counting of GPUs. While the actual counting is still performed within the target process, this is not strictly necessary.

The first few tests on MI210 look good so far, although they still need verification against AMD's own rocprofilerv3. That however is a bit difficult, since I don't think the rocprofilerv3 frontend application actually supports device counting. I did some rudimentary scripting to sum up the per-kernel values and compare them against the device counted ones. There were differences of about 5%, which means we're at least in the same ballpark.

MI210 (gfx90a) example:

Details
./likwid-perfctr --rocmgroup MEM ~/Projects/BabelStream/build/hip-stream -s $((1024 * 1024 * 256))
--------------------------------------------------------------------------------
CPU name:       AMD EPYC 7543 32-Core Processor                
CPU type:       AMD K19 (Zen3) architecture
CPU clock:      2.79 GHz
--------------------------------------------------------------------------------
W20260112 16:49:21.788848 16184536227712 ioctl.cpp:68] Device 19359 could not be locked for profiling due to lack of permissions (capability SYS_PERFMON). PMC Counters may be inaccurate and System Counter Collection will be degraded.
BabelStream
Version: 5.0
Implementation: HIP
Running  Classic kernels 100 times
Number of elements: 268435456
Precision: double
Array size: 2147.5 MB
Total size: 6442.5 MB
Using HIP device AMD Instinct MI210
Driver: 70152802
Memory: DEFAULT
Init: 0.131138 s (=49127.179665 MB/s)
Read: 0.322360 s (=19985.295624 MB/s)
Function    MB/s        Min (sec)   Max         Average     
Copy        1430086.008 0.00300     0.00304     0.00301     
Mul         1397712.776 0.00307     0.00317     0.00312     
Add         1284599.650 0.00502     0.00524     0.00515     
Triad       1246393.739 0.00517     0.01277     0.00528     
Dot         1307159.299 0.00329     0.00347     0.00343     
--------------------------------------------------------------------------------
+-------------+---------+
| Region Info |  main-0 |
+-------------+---------+
| Runtime [s] | 7.10727 |
|  call count |       1 |
+-------------+---------+

+----------------------+---------+----------------+
|         Event        | Counter |      GPU 0     |
+----------------------+---------+----------------+
|    ROCP_TA_TA_BUSY   |  ROCM0  | 238884935114.0 |
| ROCP_GRBM_GUI_ACTIVE |  ROCM1  |   3470060345.0 |
|      ROCP_SE_NUM     |  ROCM2  |            8.0 |
+----------------------+---------+----------------+

+------------------------+-----------------+
|         Metric         |      GPU 0      |
+------------------------+-----------------+
| GPU memory utilization | 860.52154488541 |
+------------------------+-----------------+

While the "utilization" in percent doesn't make much sense (being >100%), the event values appear to match what rocprofilerv3 measured. So perhaps LIKWID's MEM metric formula is wrong?

4x MI300A (gfx942) example:

Details
./likwid-perfctr --rocmgroup MEM ~/Projects/BabelStream/build/hip-stream -s $((1024 * 1024 * 256))
--------------------------------------------------------------------------------
CPU name:       AMD Instinct MI300A Accelerator                
CPU type:       nil
CPU clock:      3.69 GHz
--------------------------------------------------------------------------------
W20260112 16:56:23.465165 17098584899456 ioctl.cpp:68] Device 5998 could not be locked for profiling due to lack of permissions (capability SYS_PERFMON). PMC Counters may be inaccurate and System Counter Collection will be degraded.
W20260112 16:56:23.537232 17098584899456 ioctl.cpp:68] Device 5999 could not be locked for profiling due to lack of permissions (capability SYS_PERFMON). PMC Counters may be inaccurate and System Counter Collection will be degraded.
W20260112 16:56:23.547630 17098584899456 ioctl.cpp:68] Device 6000 could not be locked for profiling due to lack of permissions (capability SYS_PERFMON). PMC Counters may be inaccurate and System Counter Collection will be degraded.
W20260112 16:56:23.557986 17098584899456 ioctl.cpp:68] Device 6001 could not be locked for profiling due to lack of permissions (capability SYS_PERFMON). PMC Counters may be inaccurate and System Counter Collection will be degraded.
BabelStream
Version: 5.0
Implementation: HIP
Running  Classic kernels 100 times
Number of elements: 268435456
Precision: double
Array size: 2147.5 MB
Total size: 6442.5 MB
Using HIP device AMD Instinct MI300A
Driver: 70152802
Memory: DEFAULT
Init: 0.535835 s (=12023.194714 MB/s)
Read: 0.622011 s (=10357.462530 MB/s)
Function    MB/s        Min (sec)   Max         Average     
Copy        3727781.598 0.00115     0.00179     0.00145     
Mul         3509651.243 0.00122     0.00148     0.00138     
Add         3650328.939 0.00176     0.00199     0.00185     
Triad       3783149.979 0.00170     0.05108     0.00237     
Dot         3068899.129 0.00140     0.00213     0.00200     
--------------------------------------------------------------------------------
+-------------+----------+
| Region Info |  main-0  |
+-------------+----------+
| Runtime [s] | 7.234938 |
|  call count |        1 |
+-------------+----------+

+----------------------+---------+----------------+-------------+-------------+-------------+
|         Event        | Counter |      GPU 0     |    GPU 1    |    GPU 2    |    GPU 3    |
+----------------------+---------+----------------+-------------+-------------+-------------+
|    ROCP_TA_TA_BUSY   |  ROCM0  | 120435247544.0 |         0.0 |         0.0 |         0.0 |
| ROCP_GRBM_GUI_ACTIVE |  ROCM1  |   5302032889.0 | 212066636.0 | 328843541.0 | 359586414.0 |
|      ROCP_SE_NUM     |  ROCM2  |           96.0 |        96.0 |        96.0 |        96.0 |
+----------------------+---------+----------------+-------------+-------------+-------------+

+---------------------------+--------------+-----------+--------------+--------------+
|           Event           |      Sum     |    Min    |      Max     |      Avg     |
+---------------------------+--------------+-----------+--------------+--------------+
|    ROCP_TA_TA_BUSY STAT   | 120435247544 |         0 | 120435247544 | 2.408705e+10 |
| ROCP_GRBM_GUI_ACTIVE STAT |   6202529480 | 212066636 |   5302032889 |   1240505896 |
|      ROCP_SE_NUM STAT     |          384 |        96 |           96 |      76.8000 |
+---------------------------+--------------+-----------+--------------+--------------+

+------------------------+-----------------+--------------------+--------------------+--------------------+
|         Metric         |      GPU 0      |        GPU 1       |        GPU 2       |        GPU 3       |
+------------------------+-----------------+--------------------+--------------------+--------------------+
| GPU memory utilization | 23.661373945569 | 7.859164921476e-08 | 5.068266390753e-08 | 4.634954497104e-08 |
+------------------------+-----------------+--------------------+--------------------+--------------------+

+-----------------------------+---------+--------------+---------+--------+
|            Metric           |   Sum   |      Min     |   Max   |   Avg  |
+-----------------------------+---------+--------------+---------+--------+
| GPU memory utilization STAT | 23.6614 | 4.634954e-08 | 23.6614 | 5.9153 |
+-----------------------------+---------+--------------+---------+--------+

The RX 6900 XT (gfx1030) appears to read zeroes only. But I vaguely remember that GPU didn't function properly with AMD tools either. Still needs testing:

Details
/likwid-perfctr --rocmgroup MEM ~/Projects/BabelStream/build/hip-stream -s $((1024 * 1024 * 256))                                                                                                         
--------------------------------------------------------------------------------                                                                                                                                                               
CPU name:       Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz                                                                                                                                                                                      
CPU type:       Intel Xeon Broadwell EN/EP/EX processor                                                                                                                                                                                        
CPU clock:      2.10 GHz                                                                                                                                                                                                                       
--------------------------------------------------------------------------------                                                                                                                                                               
W20260112 16:25:24.519724 19995440996224 ioctl.cpp:68] Device 18026 could not be locked for profiling due to lack of permissions (capability SYS_PERFMON). PMC Counters may be inaccurate and System Counter Collection will be degraded.      
BabelStream                                                                                                                                                                                                                                    
Version: 5.0                                                                                                                                                                                                                                   
Implementation: HIP                                        
Running  Classic kernels 100 times                         
Number of elements: 268435456                              
Precision: double                                          
Array size: 2147.5 MB                                      
Total size: 6442.5 MB                                      
Using HIP device AMD Radeon RX 6900 XT                     
Driver: 70152802                                           
Memory: DEFAULT                                            
Init: 0.188190 s (=34233.733216 MB/s)                      
Read: 0.729806 s (=8827.626619 MB/s)                       
Function    MB/s        Min (sec)   Max         Average                                                                
Copy        464843.424  0.00924     0.01131     0.00927                                                                
Mul         457383.639  0.00939     0.00945     0.00940                                                                
Add         428179.706  0.01505     0.01511     0.01507                                                                
Triad       428623.538  0.01503     0.01506     0.01505                                                                
Dot         488885.181  0.00879     0.01232     0.00883                                                                
--------------------------------------------------------------------------------                                       
ROCMON DEBUG - [rocmon_markerGetRegionCounters:823] Cannot calculate formula: 100*max(ROCM0,16)/ROCM1/ROCM2                                                                                                                                    
+-------------+-----------+                                
| Region Info |   main-0  |                                
+-------------+-----------+                                
| Runtime [s] | 14.069782 |                                
|  call count |         1 |                                
+-------------+-----------+                                

+----------------------+---------+-------+                 
|         Event        | Counter | GPU 0 |                 
+----------------------+---------+-------+                 
|    ROCP_TA_TA_BUSY   |  ROCM0  |   0.0 |                 
| ROCP_GRBM_GUI_ACTIVE |  ROCM1  |   0.0 |                 
|      ROCP_SE_NUM     |  ROCM2  |   4.0 |                 
+----------------------+---------+-------+                 

+------------------------+-------+                         
|         Metric         | GPU 0 |                         
+------------------------+-------+                         
| GPU memory utilization |   -   |                         
+------------------------+-------+

MI300X unfortunately wasn't available for testing yet.

ipatix added 30 commits January 7, 2026 16:54
The new rocmon will use rocprofiler-sdk.
Apparently for some very weird reasons, rocprofiler_start_context will
fail on the first call with ROCPROFILER_STATUS_ERROR_HSA_NOT_LOADED.
However, it will return ROCPROFILER_STATUS_SUCCESS when simply calling
it a second time. Let's see if that's not causing issues.
Turns out this is completely unnecessary. We can just use
PTHREAD_MUTEX_INITIALIZER instead.
Using HIP before that makes rocprofiler-sdk not initialize properly and
it will complain that HSA is not initialized.

Unforuntately this requires the code to become more ugly, since we have
to filter out the "unwanted" (aka HIP_VISIBLE_DEVICES) GPUs at a later time.
Accordingly we keep track of all devices, but mark them whether they are
used or not, and which HIP ID they have (if applicable).
We use the rocmon marker API, because it requires less code and it's no
longer necessary to have duplicate logic for writing and reading result
files.
This never deleted a "file". Instead it used to destroy in memory data
from the parsed marker file, which is now automatically destroyed via
Lua's garbage collector.
This may very well be an expected case and shouldn't terminate
execution.
@ipatix
Copy link
Contributor Author

ipatix commented Jan 13, 2026

Currently still fails on MI300X (gfx942):

rocprofiler_configure_device_counting_service failed: 'Function invoked with one or more invalid arguments'

@ipatix
Copy link
Contributor Author

ipatix commented Jan 13, 2026

That's looking better now. Not sure how the MI300A worked before ("how could multi GPU work at all before?"):

MI300X (gfx942):

Details
./likwid-perfctr --rocmgroup MEM ~/Projects/BabelStream/build/hip-stream -s $((1024 * 1024 * 256))
--------------------------------------------------------------------------------
CPU name:       AMD EPYC 9474F 48-Core Processor               
CPU type:       AMD K19 (Zen4) architecture
CPU clock:      3.59 GHz
--------------------------------------------------------------------------------
W20260113 17:42:31.705578 21315449151360 ioctl.cpp:68] Device 10700 could not be locked for profiling due to lack of permissions (capability SYS_PERFMON). PMC Counters may be inaccurate and System Counter Collection will be degraded.
W20260113 17:42:31.799656 21315449151360 ioctl.cpp:68] Device 10697 could not be locked for profiling due to lack of permissions (capability SYS_PERFMON). PMC Counters may be inaccurate and System Counter Collection will be degraded.
W20260113 17:42:31.815874 21315449151360 ioctl.cpp:68] Device 10696 could not be locked for profiling due to lack of permissions (capability SYS_PERFMON). PMC Counters may be inaccurate and System Counter Collection will be degraded.
W20260113 17:42:31.831372 21315449151360 ioctl.cpp:68] Device 10699 could not be locked for profiling due to lack of permissions (capability SYS_PERFMON). PMC Counters may be inaccurate and System Counter Collection will be degraded.
W20260113 17:42:31.846944 21315449151360 ioctl.cpp:68] Device 10702 could not be locked for profiling due to lack of permissions (capability SYS_PERFMON). PMC Counters may be inaccurate and System Counter Collection will be degraded.
W20260113 17:42:31.862567 21315449151360 ioctl.cpp:68] Device 10703 could not be locked for profiling due to lack of permissions (capability SYS_PERFMON). PMC Counters may be inaccurate and System Counter Collection will be degraded.
W20260113 17:42:31.877939 21315449151360 ioctl.cpp:68] Device 10701 could not be locked for profiling due to lack of permissions (capability SYS_PERFMON). PMC Counters may be inaccurate and System Counter Collection will be degraded.
W20260113 17:42:31.893371 21315449151360 ioctl.cpp:68] Device 10698 could not be locked for profiling due to lack of permissions (capability SYS_PERFMON). PMC Counters may be inaccurate and System Counter Collection will be degraded.
BabelStream
Version: 5.0
Implementation: HIP
Running  Classic kernels 100 times
Number of elements: 268435456
Precision: double
Array size: 2147.5 MB
Total size: 6442.5 MB
Using HIP device AMD Instinct MI300X
Driver: 70152802
Memory: DEFAULT
Init: 0.293001 s (=21987.797804 MB/s)
Read: 0.302377 s (=21306.039876 MB/s)
Function    MB/s        Min (sec)   Max         Average     
Copy        4077803.852 0.00105     0.00121     0.00111     
Mul         4106499.501 0.00105     0.00113     0.00108     
Add         3916264.517 0.00165     0.01246     0.00177     
Triad       3972836.472 0.00162     0.00183     0.00165     
Dot         3510308.122 0.00122     0.01206     0.00159     
--------------------------------------------------------------------------------
+-------------+----------+
| Region Info |  main-0  |
+-------------+----------+
| Runtime [s] | 6.897542 |
|  call count |        1 |
+-------------+----------+

+----------------------+---------+----------------+-------------+--------------+-------------+-------------+--------------+-------------+--------------+
|         Event        | Counter |      GPU 0     |    GPU 1    |     GPU 2    |    GPU 3    |    GPU 4    |     GPU 5    |    GPU 6    |     GPU 7    |
+----------------------+---------+----------------+-------------+--------------+-------------+-------------+--------------+-------------+--------------+
|    ROCP_TA_TA_BUSY   |  ROCM0  | 468017431161.0 |         0.0 |          0.0 |         0.0 |         0.0 |          0.0 |         0.0 |          0.0 |
| ROCP_GRBM_GUI_ACTIVE |  ROCM1  |  30459941723.0 | 529817184.0 | 1416308231.0 | 865736504.0 | 358845989.0 | 1263198121.0 | 937608016.0 | 1103709785.0 |
|      ROCP_SE_NUM     |  ROCM2  |          256.0 |       256.0 |        256.0 |       256.0 |       256.0 |        256.0 |       256.0 |        256.0 |
+----------------------+---------+----------------+-------------+--------------+-------------+-------------+--------------+-------------+--------------+

+---------------------------+--------------+-----------+--------------+--------------+
|           Event           |      Sum     |    Min    |      Max     |      Avg     |
+---------------------------+--------------+-----------+--------------+--------------+
|    ROCP_TA_TA_BUSY STAT   | 468017431161 |         0 | 468017431161 | 5.200194e+10 |
| ROCP_GRBM_GUI_ACTIVE STAT |  36935165553 | 358845989 |  30459941723 | 4.103907e+09 |
|      ROCP_SE_NUM STAT     |         2048 |       256 |          256 |     227.5556 |
+---------------------------+--------------+-----------+--------------+--------------+

+------------------------+----------------+--------------------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+
|         Metric         |      GPU 0     |        GPU 1       |       GPU 2       |       GPU 3       |        GPU 4       |       GPU 5       |       GPU 6       |       GPU 7       |
+------------------------+----------------+--------------------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+
| GPU memory utilization | 6.001958595647 | 1.179652187348e-08 | 4.41288122402e-09 | 7.21928666647e-09 | 1.741694262047e-08 | 4.94775910136e-09 | 6.66589864138e-09 | 5.66272047683e-09 |
+------------------------+----------------+--------------------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+

+-----------------------------+--------+--------------+--------+--------+
|            Metric           |   Sum  |      Min     |   Max  |   Avg  |
+-----------------------------+--------+--------------+--------+--------+
| GPU memory utilization STAT | 6.0020 | 4.412881e-09 | 6.0020 | 0.7502 |
+-----------------------------+--------+--------------+--------+--------+

@ipatix
Copy link
Contributor Author

ipatix commented Jan 28, 2026

While the groups probably still need fixing, I'll merge this for now. It'll probably be best to open separate issues for that.

@ipatix ipatix merged commit 15c91f0 into master Jan 28, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant