Skip to content

Users/todavis/aocl host blas#3314

Draft
tony-davis wants to merge 31 commits intomainfrom
users/todavis/aocl-host-blas
Draft

Users/todavis/aocl host blas#3314
tony-davis wants to merge 31 commits intomainfrom
users/todavis/aocl-host-blas

Conversation

@tony-davis
Copy link

@tony-davis tony-davis commented Feb 9, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

tony-davis and others added 26 commits January 22, 2026 18:02
…ting

Adds AOCL-BLAS 5.2 (BLIS/BLAS/CBLAS) as an alternative to OpenBLAS for
rocBLAS client testing. AOCL-BLAS provides ILP64 support needed for
large-scale stress tests and serves as a complementary CPU BLAS provider.

- Add host-aocl-blas artifact (Linux-only, enabled with THEROCK_BUILD_TESTING)
- Build AOCL 5.2 BLAS component from source with ILP64 and multithreading
- Install to lib/host-math/ alongside OpenBLAS
- Enable LINK_BLIS in rocBLAS when testing is enabled
- Add therock_test_validate_static_lib() for static library validation
AOCL's build system ignores CMAKE_INSTALL_INCLUDEDIR for headers,
causing them to install to dist/include/ instead of the expected
dist/lib/host-math/include/ location. This commit adds a custom
CMake command to copy headers to the correct location after staging.

Also fixes dependency ordering by moving therock-aocl-blas from
RUNTIME_DEPS to BUILD_DEPS for rocBLAS, since it's a static library
that must be built before rocBLAS links against it.

Changes:
- Add custom command to copy AOCL headers after staging
- Set CMAKE_INSTALL_LIBDIR/BINDIR/INCLUDEDIR for AOCL build
- Move therock-aocl-blas to BUILD_DEPS for proper build ordering
- Update library validation path to match new location
During rocBLAS builds, pip installs Tensile from the source directory,
creating egg-info and build artifacts that can cause permission errors
on subsequent builds. Add automatic cleanup of these artifacts during
rocBLAS+expunge to prevent build failures.

Cleans:
- Tensile.egg-info/ - Package metadata
- build/ - In-tree build directory
- dist/ - Distribution directory
- .eggs/ - Egg cache directory
…ting

Adds AOCL-BLAS 5.2 (BLIS/BLAS/CBLAS) as an alternative to OpenBLAS for
rocBLAS client testing. AOCL-BLAS provides ILP64 support needed for
large-scale stress tests and serves as a complementary CPU BLAS provider.

- Add host-aocl-blas artifact (Linux-only, enabled with THEROCK_BUILD_TESTING)
- Build AOCL 5.2 BLAS component from source with ILP64 and multithreading
- Install to lib/host-math/ alongside OpenBLAS
- Enable LINK_BLIS in rocBLAS when testing is enabled
- Add therock_test_validate_static_lib() for static library validation
AOCL headers now install to lib/host-math/include/aocl/ subdirectory,
matching the OpenBLAS pattern (lib/host-math/include/openblas/).

Changes:
- Replace broken custom_command/custom_target with install(CODE) script
- Headers copied during install phase, not as separate build step
- Updated CMake package config to point to include/aocl/ subdirectory
- Preserves directory structure for Au/, Capi/, alci/ subdirectories

Verified:
- rocBLAS configure finds headers successfully
- rocblas-test and rocblas-bench link against AOCL
- Binaries show "Using reference library .../libaocl.a"
Pre-commit hook fixes:
- Remove trailing whitespace from aocl-config.cmake.in
- Apply black formatting to validate_static_library.py
Fixes all 8 issues raised by Copilot PR review:

1. validate_static_library.py: Filter empty strings when counting object files
2. validate_static_library.py: Fail validation if archive is empty (0 objects) or 0 MB
3. therock_testing.cmake: Skip static lib validation in sanitizer builds (matches shared lib)
4. aocl-config.cmake.in: Clarify OpenMP is needed for multithreaded AOCL-BLAS
5. aocl/CMakeLists.txt: Replace local path with public GitHub URL reference
6. BLAS/CMakeLists.txt: Add Tensile cleanup to clean target (not just expunge)
7. aocl/CMakeLists.txt: Clarify OpenMP comment - AOCL manages discovery internally
8. aocl/CMakeLists.txt: Add verification for install-time GLOB, fail if no headers found
The +clean target isn't available at this point in CMakeLists.txt.
Only +expunge is explicitly created and can be used as a dependency.

Fixes CMake configuration error:
  Cannot add target-level dependencies to non-existent target "rocBLAS+clean"
- Add validation for critical AOCL headers (blis.h, cblas.h, blis64.h) in install script
- Remove unused @PACKAGE_INIT@ from aocl-config.cmake.in to match OpenBLAS pattern
- Note: rocBLAS+expunge-tensile already only depends on rocBLAS+expunge (not +clean)
Create a symlink from libaocl.a to libcblas.a so that rocBLAS's
find_library(NAMES cblas ...) can discover TheRock's AOCL through
the hinted library paths. This allows the develop branch of
rocm-libraries to work without modification.

This is a temporary workaround until rocBLAS is updated to use
find_package(aocl) for CMake package discovery.
AOCL-BLAS is disabled on Windows (disable_platforms = ["windows"])
so we should not add it as a dependency for rocBLAS on Windows.

Windows builds will continue to use only OpenBLAS as the CPU
reference BLAS for rocBLAS clients/tests.
Instead of symlinks, copy AOCL library and headers to the location
that rocBLAS's develop branch searches:
  ${BUILD_DIR}/deps/aocl/install_package/

This allows rocBLAS to find TheRock's AOCL 5.2 without any changes
to rocm-libraries, matching the existing search logic in
rocblas/clients/CMakeLists.txt.

Benefits:
- No symlinks (better Windows compatibility)
- No changes needed to rocm-libraries develop branch
- Works with existing rocBLAS AOCL discovery logic

Removes the previous libcblas.a symlink approach which didn't
handle headers.
Move the copy logic from rocBLAS CMakeLists (configure-time) to
AOCL CMakeLists (install-time). This ensures the copy happens
after AOCL builds but before rocBLAS configures, making it work
in a single build pass.

Timeline now:
1. Top-level configure
2. Build phase:
   - AOCL builds and installs (copies to rocBLAS deps/ location)
   - rocBLAS configures (finds AOCL in deps/ location) ✓
   - rocBLAS builds

This matches rocBLAS's existing search logic in
rocblas/clients/CMakeLists.txt without requiring any changes
to rocm-libraries.
Automatically set THEROCK_ENABLE_HOST_MATH=ON when THEROCK_BUILD_TESTING=ON
to ensure host math libraries (OpenBLAS, AOCL-BLAS, SuiteSparse) are built
for rocBLAS clients and tests.

Without this, AOCL-BLAS was declared but never built on CI because its
feature group (HOST_MATH) was disabled by default, causing rocBLAS
configure to fail with "Could not find any BLAS library".

Fixes CI failure where rocBLAS clients couldn't find BLAS library.
Pass -DBUILD_DIR to rocBLAS CMake configuration, pointing to its build
directory. This custom variable is used by rocBLAS clients/CMakeLists.txt
to locate bundled dependencies in ${BUILD_DIR}/deps/.

Without this, rocBLAS couldn't find TheRock's AOCL at
${BUILD_DIR}/deps/aocl/install_package/ and would fall back to system
AOCL installations, even though the files were correctly copied there
during AOCL's install phase.

Now rocBLAS will find: build/math-libs/BLAS/rocBLAS/build/deps/aocl/...
AOCL's CMakeLists.txt defaults CMAKE_CONFIGURATION_TYPES to Debug if
not explicitly set. Even though we pass CMAKE_BUILD_TYPE=Release,
the Debug configuration type was causing BLIS to build without
optimizations, resulting in 100-450x slowdowns on triangular
operations (trsm, trmm) and test timeouts.

Add -DCMAKE_CONFIGURATION_TYPES=Release to explicitly force Release
mode for all AOCL components.

This fixes CI test timeouts where rocBLAS smoke tests failed to
complete in 15 minutes due to Debug-mode AOCL performance.
CI diagnostics revealed we cannot detect actual CPU allocation:
- multiprocessing.cpu_count() returns 256 (all system cores)
- SLURM variables not available
- cgroup limits not detectable via standard paths

Without this fix, rocBLAS sets 254 OpenMP threads on containers
with only ~48-64 allocated cores, causing 60-100x AOCL performance
degradation due to thread oversubscription.

Conservative value of 48 threads assumes typical CI allocation of
50-64 cores, leaving headroom for system threads as recommended
by AOCL team (use allocated_cores - 4).

This should reduce rocBLAS test time from ~8.9 min to ~2.5 min.
…vention

Follow CMake best practices for package naming and target namespaces:
- Change package name from 'aocl' to 'AOCL' (matches BLAS/LAPACK pattern)
- Change target from 'AOCL::aocl' to 'AOCL::AOCL' (namespace matches package)
- Rename config file to 'AOCLConfig.cmake' (standard for uppercase packages)
- Update install path to 'cmake/AOCL'

This follows Kitware's recommendation that package names and target
namespaces should match exactly, which will be enforced in CMake 3.31+
via Common Package Specification (CPS).

References:
- https://www.kitware.com/psa-your-package-name-and-target-namespace-should-match/
- Standard CMake modules: BLAS::BLAS, LAPACK::LAPACK, ZLIB::ZLIB
This commit enables AOCL-BLAS to build on Windows alongside Linux,
supporting the AOCL team's deliverable improvements.

Key changes:
- BUILD_TOPOLOGY: Remove Windows platform restriction for AOCL
- therock_subproject: Preserve Windows SDK environment for nested CMake
- BLAS/CMakeLists: Enable AOCL-BLAS for rocBLAS testing on Windows
- aocl/CMakeLists: Major refactoring for cross-platform support
  - Use tony-davis fork with Windows/CMake fixes (temporary)
  - Add VS Clang-CL toolchain support for Windows
  - Cross-platform OpenMP configuration (MSVC vs GCC flags)
  - Disable AOCL_UTILS on Windows (MSVC incompatibility)
  - Modern CMake paths with GNUInstallDirs
  - Platform-specific library naming (aocl.lib vs libaocl.a)
  - Remove custom package config (AOCL now provides its own)

Result: AOCL-BLAS builds successfully on both Windows and Linux,
enabling rocBLAS testing with CPU reference BLAS on Windows.

Co-authored-by: Cursor <cursoragent@cursor.com>
[RIPE FOR OWN PR]

After amdsmi was moved from base/ to core/ (d3bb45a), it is only built
when THEROCK_ENABLE_CORE_AMDSMI=ON. math-libs/BLAS/CMakeLists.txt still
added the amdsmi target to optional deps on non-Windows unconditionally,
so configure failed with "non-existent target amdsmi" when building
rocBLAS/hipBLAS/hipBLASLt without CORE_AMDSMI.

rocBLAS and hipBLASLt treat amdsmi as optional (GPU monitoring); they
can build and run without it. Only add amdsmi to the optional deps when
THEROCK_ENABLE_CORE_AMDSMI is ON. rocm_smi_lib remains in the list when
not Windows (that target is always present).

Co-authored-by: Cursor <cursoragent@cursor.com>
tony-davis and others added 2 commits February 9, 2026 18:46
…LAS dep

- third-party/aocl: Use SOURCE_DIR for fetch layout; install to lib/host-math/
  with ENABLE_AOCL_UTILS=OFF; provide_package at lib/host-math/lib/cmake/AOCL.
  Remove legacy install(CODE) workaround; downstream uses find_package(AOCL CONFIG).
- therock_subproject: Resolve build deps from THEROCK_STAGE_DIR (not DIST_DIR)
  so subprojects find package configs from each dep's stage install tree (fixes
  AOCL and other stage-installed packages).
- math-libs/BLAS: Treat AOCL-BLAS as runtime-only optional dep for rocBLAS
  (mirror OpenBLAS); remove rocBLAS build dep on therock-aocl-blas.

Co-authored-by: Cursor <cursoragent@cursor.com>
tony-davis and others added 3 commits February 9, 2026 19:43
Reduce test_rocblas.py to minimal AOCL change (cap OMP_NUM_THREADS=48) and single log line; remove CPU allocation diagnostics block.


skip-checks: true
Co-authored-by: Cursor <cursoragent@cursor.com>
Keep Windows env preservation block only; resolve build deps from dist again to avoid global behavior change.


skip-checks: true
Co-authored-by: Cursor <cursoragent@cursor.com>
Drop rocBLAS+expunge-tensile; unrelated to AOCL integration.


skip-checks: true
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: TODO

Development

Successfully merging this pull request may close these issues.

1 participant