A6: HPC

500 - 750 points

CMake target: a6

cmake --build .build --target a6

You've always been telling how large your memory bandwidth is but you don't actually know how it compares (and fear it is just average)? Well, don't you worry because benchmarks have you covered!

The STREAM memory benchmark is a simple, synthetic benchmark tool primarily used to measure memory bandwidth (in MB/s) using simple array operations (kernels). Benchmarks like it are widely used in the field of High-Performance Computing (HPC) to test the memory system of a computer - a crucial component for performance in scientific and engineering applications.

Considering A,B,CA,B,C are arrays of length NN and qq is a scalar, the four STREAM kernels are the following:

  • COPY: A[i]=B[i]A[i]=B[i]

  • SCALE: A[i]=q×B[i]A[i]=q\times B[i]

  • ADD: A[i]=B[i]+C[i]A[i] = B[i] + C[i]

  • TRIAD: A[i]=B[i]+q×C[i]A[i] = B[i]+q\times C[i]

for iNi \le N.

The arrays are filled with 64-bit integer values (but not initialized to any particular value). Each kernel is run for NTIMES iterations and the run duration for each iteration is measured. For the results, the average, minimum, and maximum times are reported for each kernel. Lastly, the effective memory bandwidth in (MB/s) for the best run is computed and reported as well. The output of an example run of the benchmark is shown below:

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           13784.1     0.097001     0.087057     0.104715
Scale:          14221.0     0.094185     0.084382     0.104606
Add:            18931.8     0.099827     0.095078     0.106978
Triad:          18229.5     0.105672     0.098741     0.118283

Assignment

Implement the (single-core) STREAM benchmark, with all four array operations. Each operation should be run for NTIMES iterations and the execution time should be measured (for which you can use the clock_gettime library function).

Caution: the STREAM benchmark uses its own way of counting bytes. Have a look at the documentation such that your implementation follows the same approach.

When you have a working implementation, run the benchmark for different array sizes and note the results. Use your results to try to determine the (L1/L2) cache size(s) of your machine. Include your findings, calculations, and L1/L2 cache size(s) in a PDF document.

Submission Checklist

Before you enter the submission queue, make sure that you have done all of the below-mentioned steps. If any of the steps are not (sufficiently) completed, your submission will be rejected during hand-in.


Useful Information

Floating Point Numbers

To calculate the results, you will need to work with floating point (double-precision) numbers. There are special instructions and registers for those that you will need to use.

Printing

The a6-hpc.S file already contains some handy format strings that you may use for formatting your output. You can also use your own formats, but try to stay as close as possible to the format of the original STREAM benchmark. If used correctly, the given format strings should produce an output similar to the one shown below:

Array size = 75000000 (elements).
Each kernel will be executed 20 times.
---------------------------------------------------------------
Function  Best Rate MB/s     Avg time     Min time     Max time
Copy:            13784.1     0.097001     0.087057     0.104715
Scale:           14221.0     0.094185     0.084382     0.104606
Add:             18931.8     0.099827     0.095078     0.106978
Triad:           18229.5     0.105672     0.098741     0.118283

Automated testing

Due to the nature of the benchmark, automated testing of the functionality is not (easily) possible. CodeGrade mainly serves as a platform for a TA to quickly see the results of your benchmark - "passing" the automated tests does not imply that your solution is functioning correctly. Correctness will be evaluated by the TA handling your submission.


Bonuses

Bonus A - Vectorization (250 points)

Vectorize the previously designed code using Intel AVX, or your SIMD instruction set of choice. Compare the results of the vectorized benchmark with the previous results. Does the new implementation perform better? Why?

Note for Apple Silicon users: Your assignments are running through the Rosetta 2 compatibility layer (to execute the x86_64 compiled binary on your ARM processor). This compatibility layer does not support the AVX instruction set. So, if you want to do this bonus, you will need to run your program on another (cloud) machine (or rewrite the assignment in ARM Assembly and use the native SIMD instructions).


Bonus B - Cloud Benchmark (250 points)

Many cloud providers offer compute instances (in the form of VMs) without a cost in their free tier. But how do the different offers stack up in terms of their memory bandwidth? What are the cache sizes on the used processors?

Use your own STREAM implementation to benchmark the memory of at least 3 different cloud providers (by running your program on their cloud machines and evaluating the output). Compare the results to your local implementation as well as between the different providers and present your findings (along with some information about how you performed the experiment) to the TA handling your submission.

Some example providers/services are Amazon AWS, Microsoft Azure (e.g., through GitHub Codespaces), Google Cloud Platform, Oracle Cloud, ...

Last updated