A6: HPC
500 - 750 points
Last updated
500 - 750 points
Last updated
You've always been telling how large your memory bandwidth is but you don't actually know how it compares (and fear it is just average)? Well, don't you worry because benchmarks have you covered!
The memory benchmark is a simple, synthetic benchmark tool primarily used to measure memory bandwidth (in MB/s) using simple array operations (kernels). Benchmarks like it are widely used in the field of High-Performance Computing (HPC) to test the memory system of a computer - a crucial component for performance in scientific and engineering applications.
Considering are arrays of length and is a scalar, the four STREAM kernels are the following:
COPY
:
SCALE
:
ADD
:
TRIAD
:
for .
The arrays are filled with 64-bit integer values (but not initialized to any particular value). Each kernel is run for NTIMES
iterations and the run duration for each iteration is measured. For the results, the average, minimum, and maximum times are reported for each kernel. Lastly, the effective memory bandwidth in (MB/s) for the best run is computed and reported as well. The output of an example run of the benchmark is shown below:
Implement the (single-core) STREAM benchmark, with all four array operations. Each operation should be run for NTIMES
iterations and the execution time should be measured (for which you can use the clock_gettime
library function).
Make sure you use the STREAM_ARRAY_SIZE
and NTIMES
constants in your implementation instead of hard coding the values.
You may modify these constants for testing purposes, but the CodeGrade environment will automatically override their values.
When you have a working implementation, run the benchmark for different array sizes and note the results. Use your results to try to determine the (L1/L2) cache size(s) of your machine. Include your findings, calculations, and L1/L2 cache size(s) in a PDF document.
Before you enter the submission queue, make sure that you have done all of the below-mentioned steps. If any of the steps are not (sufficiently) completed, your submission will be rejected during hand-in.
Implemented all four STREAM kernels, including calculation (using the same methods as the original STREAM implementation) and printing of the results.
Ran the benchmark (your own version) with different array sizes to determine the (L1/L2) cache size(s) of your machine.
Uploaded the code to CodeGrade.
Uploaded a PDF of your L1/L2 cache findings and calculations to CodeGrade.
Prepared explanation of the implementation and discussion of the results to be presented to a TA.
To calculate the results, you will need to work with floating point (double-precision) numbers. There are special instructions and registers for those that you will need to use.
The a6-hpc.S
file already contains some handy format strings that you may use for formatting your output. You can also use your own formats, but try to stay as close as possible to the format of the original STREAM benchmark. If used correctly, the given format strings should produce an output similar to the one shown below:
Due to the nature of the benchmark, automated testing of the functionality is not (easily) possible. CodeGrade mainly serves as a platform for a TA to quickly see the results of your benchmark - "passing" the automated tests does not imply that your solution is functioning correctly. Correctness will be evaluated by the TA handling your submission.
You can receive points for either bonus A or bonus B, but not both!
You must include a PDF document in your CodeGrade submission for either of the bonuses. In this document, describe your findings and the process used to arrive at the conclusion.
Vectorize the previously designed code using Intel AVX, or your SIMD instruction set of choice. Compare the results of the vectorized benchmark with the previous results. Does the new implementation perform better? Why?
Many cloud providers offer compute instances (in the form of VMs) without a cost in their free tier. But how do the different offers stack up in terms of their memory bandwidth? What are the cache sizes on the used processors?
Use your own STREAM implementation to benchmark the memory of at least 3 different cloud providers (by running your program on their cloud machines and evaluating the output). Compare the results to your local implementation as well as between the different providers and present your findings (along with some information about how you performed the experiment) to the TA handling your submission.
Caution: the STREAM benchmark uses its own way of counting bytes. Have a look at the such that your implementation follows the same approach.
Some example providers/services are , (e.g., through GitHub ), , , ...