CO Lab Manual
Course Page
  • Course Information
    • Welcome
    • Introduction
    • Your Contributions
    • Lab Sessions and Etiquette
    • Team Setup
    • Assumed Prior Knowledge
  • Setup Guides
    • GitHub Repository Setup
    • Technical Setup
      • Windows
      • Linux
      • macOS
    • GitHub SSH Setup
    • Framework Setup
  • Reference Documentation
    • Introduction to the Documentation
    • A Brief History Lesson
    • Syntax (Intel vs. AT&T)
      • Section Exercises
    • Memory
      • Memory Management
      • Section Exercises
    • Registers
      • Section Exercises
    • Instructions
    • Subroutines
      • Calling Subroutines
      • Writing Subroutines
      • Section Exercises
    • Input/Output
      • Printing to the Terminal
      • Reading from the Terminal
      • Section Exercises
    • Programming Constructs
    • Assembler Directives
    • C/C++ vs Assembly
    • Building and Running Programs
    • Address Sanitization
    • A0: A Running Example
  • Assignments
    • Introduction to the Assignments
    • Mandatory Assignments
      • A1: Subroutines and I/O
      • A2: Recursion
    • Extra Assignments
      • A3-a: Fibonacci Calculator
      • A3-b: Fibonacci REPL
      • A4: Diff
      • A5: Printf
      • A6: HPC
      • A7: Bitmap
      • A8: Game
  • Appendix
    • Acknowledgments
    • Rules and Regulations
    • Frequently Asked Questions
    • How to use a Debugger
Powered by GitBook
On this page
  • Assignment
  • Submission Checklist
  • Useful Information
  • Floating Point Numbers
  • Printing
  • Automated testing
  • Bonuses
  • Bonus A - Vectorization (250 points)
  • Bonus B - Cloud Benchmark (250 points)
  1. Assignments
  2. Extra Assignments

A6: HPC

500 - 750 points

PreviousA5: PrintfNextA7: Bitmap

Last updated 4 months ago

CMake target: a6

cmake --build .build --target a6

You've always been telling how large your memory bandwidth is but you don't actually know how it compares (and fear it is just average)? Well, don't you worry because benchmarks have you covered!

The memory benchmark is a simple, synthetic benchmark tool primarily used to measure memory bandwidth (in MB/s) using simple array operations (kernels). Benchmarks like it are widely used in the field of High-Performance Computing (HPC) to test the memory system of a computer - a crucial component for performance in scientific and engineering applications.

Considering A,B,CA,B,CA,B,C are arrays of length NNN and qqq is a scalar, the four STREAM kernels are the following:

  • COPY: A[i]=B[i]A[i]=B[i]A[i]=B[i]

  • SCALE: A[i]=q×B[i]A[i]=q\times B[i]A[i]=q×B[i]

  • ADD: A[i]=B[i]+C[i]A[i] = B[i] + C[i]A[i]=B[i]+C[i]

  • TRIAD: A[i]=B[i]+q×C[i]A[i] = B[i]+q\times C[i]A[i]=B[i]+q×C[i]

for i≤Ni \le Ni≤N.

The arrays are filled with 64-bit integer values (but not initialized to any particular value). Each kernel is run for NTIMES iterations and the run duration for each iteration is measured. For the results, the average, minimum, and maximum times are reported for each kernel. Lastly, the effective memory bandwidth in (MB/s) for the best run is computed and reported as well. The output of an example run of the benchmark is shown below:

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           13784.1     0.097001     0.087057     0.104715
Scale:          14221.0     0.094185     0.084382     0.104606
Add:            18931.8     0.099827     0.095078     0.106978
Triad:          18229.5     0.105672     0.098741     0.118283

Assignment

Implement the (single-core) STREAM benchmark, with all four array operations. Each operation should be run for NTIMES iterations and the execution time should be measured (for which you can use the clock_gettime library function).

Make sure you use the STREAM_ARRAY_SIZE and NTIMES constants in your implementation instead of hard coding the values.

You may modify these constants for testing purposes, but the CodeGrade environment will automatically override their values.

#ifndef STREAM_ARRAY_SIZE
    #define STREAM_ARRAY_SIZE <MODIFY_ME>
#endif

#ifndef NTIMES
    #define NTIMES <MODIFY_ME>
#endif

When you have a working implementation, run the benchmark for different array sizes and note the results. Use your results to try to determine the (L1/L2) cache size(s) of your machine. Include your findings, calculations, and L1/L2 cache size(s) in a PDF document.

Submission Checklist

Before you enter the submission queue, make sure that you have done all of the below-mentioned steps. If any of the steps are not (sufficiently) completed, your submission will be rejected during hand-in.

  • Implemented all four STREAM kernels, including calculation (using the same methods as the original STREAM implementation) and printing of the results.

  • Ran the benchmark (your own version) with different array sizes to determine the (L1/L2) cache size(s) of your machine.

  • Uploaded the code to CodeGrade.

  • Uploaded a PDF of your L1/L2 cache findings and calculations to CodeGrade.

  • Prepared explanation of the implementation and discussion of the results to be presented to a TA.


Useful Information

Floating Point Numbers

To calculate the results, you will need to work with floating point (double-precision) numbers. There are special instructions and registers for those that you will need to use.

Printing

The a6-hpc.S file already contains some handy format strings that you may use for formatting your output. You can also use your own formats, but try to stay as close as possible to the format of the original STREAM benchmark. If used correctly, the given format strings should produce an output similar to the one shown below:

Array size = 75000000 (elements).
Each kernel will be executed 20 times.
---------------------------------------------------------------
Function  Best Rate MB/s     Avg time     Min time     Max time
Copy:            13784.1     0.097001     0.087057     0.104715
Scale:           14221.0     0.094185     0.084382     0.104606
Add:             18931.8     0.099827     0.095078     0.106978
Triad:           18229.5     0.105672     0.098741     0.118283

Automated testing

Due to the nature of the benchmark, automated testing of the functionality is not (easily) possible. CodeGrade mainly serves as a platform for a TA to quickly see the results of your benchmark - "passing" the automated tests does not imply that your solution is functioning correctly. Correctness will be evaluated by the TA handling your submission.


Bonuses

You can receive points for either bonus A or bonus B, but not both!

You must include a PDF document in your CodeGrade submission for either of the bonuses. In this document, describe your findings and the process used to arrive at the conclusion.

Bonus A - Vectorization (250 points)

Vectorize the previously designed code using Intel AVX, or your SIMD instruction set of choice. Compare the results of the vectorized benchmark with the previous results. Does the new implementation perform better? Why?

Note for Apple Silicon users: Your assignments are running through the Rosetta 2 compatibility layer (to execute the x86_64 compiled binary on your ARM processor). This compatibility layer does not support the AVX instruction set. So, if you want to do this bonus, you will need to run your program on another (cloud) machine (or rewrite the assignment in ARM Assembly and use the native SIMD instructions).


Bonus B - Cloud Benchmark (250 points)

Many cloud providers offer compute instances (in the form of VMs) without a cost in their free tier. But how do the different offers stack up in terms of their memory bandwidth? What are the cache sizes on the used processors?

Use your own STREAM implementation to benchmark the memory of at least 3 different cloud providers (by running your program on their cloud machines and evaluating the output). Compare the results to your local implementation as well as between the different providers and present your findings (along with some information about how you performed the experiment) to the TA handling your submission.

Caution: the STREAM benchmark uses its own way of counting bytes. Have a look at the such that your implementation follows the same approach.

Some example providers/services are , (e.g., through GitHub ), , , ...

STREAM
documentation
Amazon AWS
Microsoft Azure
Codespaces
Google Cloud Platform
Oracle Cloud