A8: HPC
500 - 750 points
Last updated
500 - 750 points
Last updated
You've always been telling how large your memory bandwidth is but you don't actually know how it compares (and fear it is just average)? Well, don't you worry because benchmarks have you covered!
The STREAM memory benchmark is a simple, synthetic benchmark tool primarily used to measure memory bandwidth (in MB/s) using simple array operations (kernels). Benchmarks like it are widely used in the field of High-Performance Computing (HPC) to test the memory system of a computer - a crucial component for performance in scientific and engineering applications.
Considering are arrays of length and is a scalar, the four STREAM kernels are the following:
COPY
:
SCALE
:
ADD
:
TRIAD
:
for .
The arrays are filled with 64-bit integer values (but not initialized to any particular value). Each kernel is run for NTIMES
iterations and the run duration for each iteration is measured. For the results, the average, minimum, and maximum times are reported for each kernel. Lastly, the effective memory bandwidth in (MB/s) for the best run is computed and reported as well. The output of an example run of the benchmark is shown below:
Implement the (single-core) STREAM benchmark, with all four array operations. Each operation should be run for NTIMES
iterations and the execution time should be measured (for which you can use the clock_gettime
library function).
Caution: the STREAM benchmark uses its own way of counting bytes. Have a look at the documentation such that your implementation follows the same approach.
Make sure you use the STREAM_ARRAY_SIZE
and NTIMES
constants in your implementation instead of hard coding the values.
When you have a working implementation, run the benchmark for different array sizes and note the results. Use your results to try to determine the (L1/L2) cache size(s) of your machine. You will need to report those findings to the TA handling your submission - make sure to come prepared with the results and your interpretations!
Due to the nature of the benchmark, automated testing of the functionality is not (easily) possible. CodeGrade mainly serves as a platform for a TA to quickly see the results of your benchmark - "passing" the automated tests does not imply that your solution is functioning correctly. Correctness will be evaluated by the TA handling your submission.
Before you enter the submission queue, make sure that you have done all of the below-mentioned steps. If any of the steps are not (sufficiently) completed, your submission will be rejected during hand-in.
Implemented all four STREAM kernels, including calculation (using the same methods as the original STREAM implementation) and printing of the results.
Ran the benchmark (your own version) with different array sizes to determine the (L1/L2) cache size(s) of your machine.
Uploaded the code to CodeGrade.
Prepared explanation of the implementation and discussion of the results to be presented to a TA.
To calculate the results, you will need to work with floating point (double-precision) numbers. There are special instructions and registers for those that you will need to use.
The a8-hpc.S
file already contains some handy format strings that you may use for formatting your output. You can also use your own formats, but try to stay as close as possible to the format of the original STREAM benchmark. If used correctly, the given format strings should produce an output similar to the one shown below:
You can modify the array sizes and/or number of iterations (without changing your code) by giving the desired values as part of the make command:
with <n>
and <i>
replaced with the desired values.
If you want to experiment with different values, make sure to always include the -B
flag. This flag tells the build system to rebuild the targets even though their code hasn't changed.
You can use the same options for running/testing the original STREAM implementation:
Note: the official STREAM benchmark seems to fail for array sizes larger than ~88M. Your solution should not have that limitation.
You can receive points for either bonus A or bonus B, but not both!
Vectorize the previously designed code using Intel AVX, or your SIMD instruction set of choice. Compare the results of the vectorized benchmark with the previous results. Does the new implementation perform better? Why?
Note for Apple Silicon users: Your assignments are running through the Rosetta 2 compatibility layer (to execute the x86_64 compiled binary on your ARM processor). This compatibility layer does not support the AVX instruction set. So, if you want to do this bonus, you will need to run your program on another (cloud) machine (or rewrite the assignment in ARM Assembly and use the native SIMD instructions).
Many cloud providers offer compute instances (in the form of VMs) without a cost in their free tier. But how do the different offers stack up in terms of their memory bandwidth? What are the cache sizes on the used processors?
Use your own STREAM implementation to benchmark the memory of at least 3 different cloud providers (by running your program on their cloud machines and evaluating the output). Compare the results to your local implementation as well as between the different providers and present your findings (along with some information about how you performed the experiment) to the TA handling your submission.
Some example providers/services are Amazon AWS, Microsoft Azure (e.g., through GitHub Codespaces), Google Cloud Platform, Oracle Cloud, ...