A6: HPC
500 - 750 points
You've always been telling how large your memory bandwidth is but you don't actually know how it compares (and fear it is just average)? Well, don't you worry because benchmarks have you covered!
The STREAM memory benchmark is a simple, synthetic benchmark tool primarily used to measure memory bandwidth (in MB/s) using simple array operations (kernels). Benchmarks like it are widely used in the field of High-Performance Computing (HPC) to test the memory system of a computer - a crucial component for performance in scientific and engineering applications.
Considering are arrays of length and is a scalar, the four STREAM kernels are the following:
COPY
:SCALE
:ADD
:TRIAD
:
for .
The arrays are filled with 64-bit integer values (but not initialized to any particular value). Each kernel is run for NTIMES
iterations and the run duration for each iteration is measured. For the results, the average, minimum, and maximum times are reported for each kernel. Lastly, the effective memory bandwidth in (MB/s) for the best run is computed and reported as well. The output of an example run of the benchmark is shown below:
Function Best Rate MB/s Avg time Min time Max time
Copy: 13784.1 0.097001 0.087057 0.104715
Scale: 14221.0 0.094185 0.084382 0.104606
Add: 18931.8 0.099827 0.095078 0.106978
Triad: 18229.5 0.105672 0.098741 0.118283
Assignment
Implement the (single-core) STREAM benchmark, with all four array operations. Each operation should be run for NTIMES
iterations and the execution time should be measured (for which you can use the clock_gettime
library function).
Caution: the STREAM benchmark uses its own way of counting bytes. Have a look at the documentation such that your implementation follows the same approach.
Make sure you use the STREAM_ARRAY_SIZE
and NTIMES
constants in your implementation instead of hard coding the values.
You may modify these constants for testing purposes, but the CodeGrade environment will automatically override their values.
#ifndef STREAM_ARRAY_SIZE
#define STREAM_ARRAY_SIZE <MODIFY_ME>
#endif
#ifndef NTIMES
#define NTIMES <MODIFY_ME>
#endif
When you have a working implementation, run the benchmark for different array sizes and note the results. Use your results to try to determine the (L1/L2) cache size(s) of your machine. Include your findings, calculations, and L1/L2 cache size(s) in a PDF document.
Submission Checklist
Before you enter the submission queue, make sure that you have done all of the below-mentioned steps. If any of the steps are not (sufficiently) completed, your submission will be rejected during hand-in.
Implemented all four STREAM kernels, including calculation (using the same methods as the original STREAM implementation) and printing of the results.
Ran the benchmark (your own version) with different array sizes to determine the (L1/L2) cache size(s) of your machine.
Uploaded the code to CodeGrade.
Uploaded a PDF of your L1/L2 cache findings and calculations to CodeGrade.
Prepared explanation of the implementation and discussion of the results to be presented to a TA.
Useful Information
Floating Point Numbers
To calculate the results, you will need to work with floating point (double-precision) numbers. There are special instructions and registers for those that you will need to use.
Printing
The a6-hpc.S
file already contains some handy format strings that you may use for formatting your output. You can also use your own formats, but try to stay as close as possible to the format of the original STREAM benchmark. If used correctly, the given format strings should produce an output similar to the one shown below:
Array size = 75000000 (elements).
Each kernel will be executed 20 times.
---------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 13784.1 0.097001 0.087057 0.104715
Scale: 14221.0 0.094185 0.084382 0.104606
Add: 18931.8 0.099827 0.095078 0.106978
Triad: 18229.5 0.105672 0.098741 0.118283
Automated testing
Due to the nature of the benchmark, automated testing of the functionality is not (easily) possible. CodeGrade mainly serves as a platform for a TA to quickly see the results of your benchmark - "passing" the automated tests does not imply that your solution is functioning correctly. Correctness will be evaluated by the TA handling your submission.
Bonuses
You can receive points for either bonus A or bonus B, but not both!
You must include a PDF document in your CodeGrade submission for either of the bonuses. In this document, describe your findings and the process used to arrive at the conclusion.
Bonus A - Vectorization (250 points)
Vectorize the previously designed code using Intel AVX, or your SIMD instruction set of choice. Compare the results of the vectorized benchmark with the previous results. Does the new implementation perform better? Why?
Bonus B - Cloud Benchmark (250 points)
Many cloud providers offer compute instances (in the form of VMs) without a cost in their free tier. But how do the different offers stack up in terms of their memory bandwidth? What are the cache sizes on the used processors?
Use your own STREAM implementation to benchmark the memory of at least 3 different cloud providers (by running your program on their cloud machines and evaluating the output). Compare the results to your local implementation as well as between the different providers and present your findings (along with some information about how you performed the experiment) to the TA handling your submission.
Some example providers/services are Amazon AWS, Microsoft Azure (e.g., through GitHub Codespaces), Google Cloud Platform, Oracle Cloud, ...
Last updated