`omb`

The OpenMP benchmark suite.

This small utility program enables one to benchmark various OpenMP things.

Currently, it expands on the STREAM implementation. Allowing many more configurations, some changes, but also porting it to FORTRAN.

The FORTRAN implementation is pretty standard, but a little verbose due to using pre-processing via fypp.

Installation

Installing this small utility requires:

A FORTRAN compiler, currently auto-detecting GNU or Intel
- if support is needed for other compilers, please open an issue.
A functional OpenMP implementation (>=4.5).
- if support is needed for older OpenMP implementations, please open an issue.

That's it!

Building is straightforward, using standardized CMake variables. CMAKE_INSTALL_PREFIX is obeyed.

cmake -S. -Bobj
cmake --build $obj -v

which builds an executable omb + drivers for running omb in unique thread-placement combinations.

Please ensure that all optimizations are enabled to enable a full benchmark (compiling with -O1 will generally show worse performance than expected).

Controlling allocation behavior

There are certain CMake flags that allows one to compile omb in various formats (e.g. cmake ... -DOMB_INT_KIND=...).

OMB_INT_KIND Controls the integer kind for the loop and size counters. It defaults to the 64-byte integer to allow large arrays.
OMB_ALLOC_TYPE Fortran allows 3 different ways of allocating memory. In omb all allocations and array declarations are done in a single subroutine. Hence, the influence of the array declarations can impact performance.
- stack
```
real :: a(N)
```
  Be sure to set an unlimited stack size before running!
- allocatable
```
real, allocatable :: a(:)
allocate(a(N))
```
- pointer
```
real, pointer :: a(:) => null()
allocate(a(N))
```
OMB_TIMING By default, omb will use the OpenMP library timing routines. However, one can selectively use the fortran intrinsic (system_clock).

Please compile omb, then run omb -env and check the timing precision of both, they are typically the same precision. Select the one with the highest precision (smallest number).
- omp (default)
  
  Uses the OpenMP library omp_get_wtime
- systemclock
  
  Use the fortran intrinsic system_clock.

Running

This benchmark system has quite a bit of options. For the most up to date options, please refer to the help of the program: omb --help.

Here are the most commonly used options for omb:

Flag	Behaviour
`-help`	Show an extensive help text!
`-n <size>`	Specify the full size of allocated arrays.
	E.g. `-n 2MB` (`kB`, `MB`, `GB`, `TB` are allowed).
`-it <count>`	Take the minimum timing out of this many iterations.
`-dtype 32\|64\|128`	Use the data-type with this many bits per element.
`-kernel <name>`	Specify the OpenMP construct used in the benchmark. Please see `omb --help` for available kernels.
`-method <name>`	Specify the benchmark to run, see `omb --help` for available methods.

Besides the optional flags, the benchmark includes a large number of different methods. The default method to run the benchmark is the triad method.

Here dtypes/elem is the number of basic dtype byte sizes for each element. E.g. a -dtype 64 equals 8 bytes per element. Hence, dtypes/elem=3 results in 3*8=24 bytes per element operation (equivalent to STREAMS way of counting).

Method	Operation	dtypes/elem	FLOP/elem	MOP/elem
`triad`	`a = b + c*2`	3	2	3
`tetrad`	`a = b + c*d`	4	2	4
`pentad`	`a = bc + de`	5	3	5
`axpy`	`a = a + b*2`	2	2	3
`scale`	`a = b*2`	2	1	2
`add`	`a = b + c`	3	1	3
`fill`	`a = 2.`	1	0	1
`sum`	`res = sum(a)`	1	1	1
`read`	`dummy(a)`	1	0	1
`copy`	`a = b`	2	0	2

As an example lets invoke omb with

the triad method,
using the parallel do simd construct
20MB allocated memory
taking the best timing out of 10 iterations
use 2 threads, and the two cores, as specified by the runtime.

$> OMP_NUM_THREADS=2 OMP_PLACES=cores(2) omb -m triad -kernel do:simd -it 10 -n 20MB
 triad do:simd 1 8   1.99999924E+01   5.18938001E-04   6.45935400E-04   1.15988655E-08   8.63896001E-04  37.63694799E+00   3.36769710E+00

The columns are described in this small box, omb --help will also show this information.

Short name	Description
`METHOD`	name of the method running
`KERNEL`	which kernel used in `METHOD`
`FIRST_TOUCH`	0 for master thread first-touch, 1 for distributed first-touch
`ELEM_B`	number of bytes per element in the array [B]
`MEM_MB`	size of all allocated arrays, in [MB]
`TIME_MIN`	minimum runtime of iterations [s]
`TIME_AVG`	average runtime of iterations [s]
`TIME_STD`	Bessel corrected standard deviation of runtime [s]
`TIME_MAX`	maximum runtime of iterations [s]
`BANDWIDTH_GBS`	maximum bandwidth using `TIME_MIN` [GB/s]
`GFLOPS`	maximum FLOPS using `TIME_MIN` [G/s]

Kernels

OpenMP allows several ways to utilize parallelism.

Kernel	OpenMP construct
`serial`
`do`	`!$omp parallel do`
`do:simd`	`!$omp parallel do simd`
`do:simd+nontemporal`	`!$omp parallel do simd nontemporal`
`manual`	`!$omp parallel`
`manual:simd`	`!$omp parallel` `!$omp simd`
`manual:simd+nontemporal`	`!$omp parallel` `!$omp simd nontemporal`
`loop`	`!$omp parallel loop`
`workshare`	`!$omp parallel workshare`
`taskloop`	`!$omp parallel` `!$omp single` `!$omp taskloop`
`taskloop:simd`	`!$omp parallel` `!$omp single` `!$omp taskloop simd`
`teams:manual`	`!$omp teams`
`teams:distribute`	`!$omp teams` `!$omp distribute`
`teams:distribute:do`	`!$omp teams` `!$omp distribute parallel do`
`teams:distribute:manual`	`!$omp teams` `!$omp distribute parallel`
`teams:parallel:do`	`!$omp teams` `!$omp parallel do`
`teams:parallel:manual`	`!$omp teams` `!$omp parallel`
`teams:parallel:loop`	`!$omp teams` `!$omp parallel loop`

The teams construct was mainly introduced in OpenMP to perform distributed computations on GPU's due to its multi-level parallelism. For those teams constructs where there is no subsequent distribution on individual teams, there can happen incorrect timing results because those constructs does not have synchronization through OpenMP. I.e. barrier calls are not available in a teams construct, only in a teams ... parallel where all threads of a team is participating. We have it here to showcase how teams can be abused for CPU's as well. There will be a wide spread of performance depending on the compiler used.

Specialized methods

Some methods are not meant for showcasing performance of the system, but rather some bad usage of OpenMP constructs.

There are some false-sharing methods available which can highlight the performance hit by using false-sharing access patterns (bad cache usage).

Method	Operation
`fs:triad`	`a = b + c*2`
`fs:tetrad`	`a = b + c*d`

And there is a limited amount of kernels allowed because of the way it accesses the memory elements. One can compare the triad with the fs:triad using the do kernel. And then any performance hit will be due to the false-sharing access pattern.

Running the driver

Together with the omb executable there is a driver that enables easy test of a set of places. The driver executable is named omb-driver which is a simple bash script.

It is a shortcut driver for testing combinations of places of threads. It is best shown by an example:

$> OMP_NUM_THREADS=3 OMP_PLACES=0,{1,2},4,5,10 omb-driver
  0 1,2   4  triad do 1 8   3.07200000E+03   9.40218100E-02   9.44088364E-02   2.60900939E-07   9.56927400E-02  31.90749040E+00   2.85503391E+00
  0 1,2   5  triad do 1 8   3.07200000E+03   9.40824550E-02   9.59592584E-02   1.36111673E-05   1.06213985E-01  31.88692302E+00   2.85319357E+00
  0 1,2  10  triad do 1 8   3.07200000E+03   9.41941960E-02   9.46193575E-02   4.79454067E-08   9.49652460E-02  31.84909610E+00   2.84980888E+00
  0   4   5  triad do 1 8   3.07200000E+03   1.01350198E-01   1.06201720E-01   6.39561026E-06   1.08226976E-01  29.60033684E+00   2.64859331E+00
  0   4  10  triad do 1 8   3.07200000E+03   7.81628790E-02   7.99320207E-02   2.31608466E-06   8.19245620E-02  38.38139074E+00   3.43430871E+00
  0   5  10  triad do 1 8   3.07200000E+03   7.77361840E-02   8.02890019E-02   2.61882319E-06   8.24161840E-02  38.59206673E+00   3.45315968E+00
1,2   4   5  triad do 1 8   3.07200000E+03   1.06380118E-01   1.07770293E-01   2.96683275E-07   1.08236085E-01  28.20075834E+00   2.52336114E+00
1,2   4  10  triad do 1 8   3.07200000E+03   7.79563530E-02   8.13598654E-02   1.34486237E-05   8.89212030E-02  38.48307270E+00   3.44340706E+00
1,2   5  10  triad do 1 8   3.07200000E+03   7.77943770E-02   7.87795150E-02   7.42053940E-07   7.98704600E-02  38.56319847E+00   3.45057659E+00
  4   5  10  triad do 1 8   3.07200000E+03   1.08133645E-01   1.13019056E-01   3.07201868E-06   1.14062300E-01  27.74344655E+00   2.48244157E+00

will run several tests all with only 3 threads (note example output just after).
This will be equivalent to running all the upper triangular part of the product combination of placements:

export OMP_NUM_THREADS=3

# Example: prefix output will be
#  0 1,2 4 $(omb)
OMP_PLACES=0,{1,2},4 omb
OMP_PLACES=0,{1,2},5 omb
OMP_PLACES=0,{1,2},10 omb
OMP_PLACES=0,4,5 omb
OMP_PLACES=0,4,10 omb
OMP_PLACES=0,5,10 omb
OMP_PLACES={1,2},4,5 omb
OMP_PLACES={1,2},4,10 omb
OMP_PLACES={1,2},5,10 omb
OMP_PLACES=4,5,10 omb

Note that by default, the omb-driver prepends the placement of each thread by adding OMP_NUM_THREADS columns. These first OMP_NUM_THREADS describes each thread placement.

If one does not wish to prefix with the placement id's of each thread on can use the option omb-driver -Dwithout-place-info to omit the first OMP_NUM_THREADS placement columns.

By default, if OMP_PLACES is unset, omb-driver will set OMP_PLACES=cores.

# CPU 2 hw-threads x 4 cores
# The below 3 invocations are equivalent:
OMP_PLACES={0:2}:4:2 omb-driver
OMP_PLACES={0,1},{2,3},{4,5},{6,7} omb-driver
omb-driver

If one wishes to distinguish hardware-threads as thread domains, simply call it with omb-driver -Ddomains threads which will make all threads a place.

Implementation remarks

The omb benchmark program is an extension/rewrite of the STREAM program. However, it has a different goal than just showing the memory bandwidth limit of the system.

It, however, takes some different approaches:

omb encapsulates only the algorithm in the timing. STREAM does a timing around the omp parallel pragmas/constructs. Hence, STREAM also times spawning of threads, which may, or may not, be desired.

In particular, omb, enables cache discovery for small memory footprints. In these cases it is vital to not time thread-spawning.
omb has command-line arguments to change internal allocated array sizes and iteration parameters etc. (no recompiling needed)
omb has different kernels for each method implemented.
omb has more methods implemented. Note the STREAM comments about which methods are applicable to benchmark bandwidth problems. The same applies to omb.

License

It is released under the MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.github		.github
check_compiler		check_compiler
cmake		cmake
driver		driver
src		src
utils		utils
.pre-commit-config.yaml		.pre-commit-config.yaml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`omb`

Installation

Controlling allocation behavior

Running

Kernels

Specialized methods

Running the driver

Implementation remarks

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

omb

Installation

Controlling allocation behavior

Running

Kernels

Specialized methods

Running the driver

Implementation remarks

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`omb`

Packages