# Lecture 08: Locality and Shortcuts

## Up Now

• Lab 02: Computing Shortcuts
• HPC cluster instructions

## Last Time: Cost of Random Access

Linear Sum:

	float total = 0;
for (int i = 0; i < size; ++i) {
int idx = linearIndex[i];
total += values[idx];
}


Random Sum:

	float total = 0;
for (int i = 0; i < size; ++i) {
int idx = randomIndex[i];
total += values[idx];
}


## What Your Computer (Probably) Does

arr a large array

On read/write arr[i], search for arr[i] successively in

• L1 cache
• L2 cache
• L3 cache
• main memory

Copy arr[i] and surrounding values to L1 cache

• usually arr[i-a],...,arr[i+b] ends up in L1

This process is called paging

## Performance Tuning

Be aware of your programâ€™s memory access pattern

• reading values sequentially can be 10s of times faster than reading randomly or jumping around

# Lab 02: Computing Shortucts

## In Code

	float[][] shortcuts = new float[size][size];
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
float min = Float.MAX_VALUE;
for (int k = 0; k < size; ++k) {
float x = matrix[i][k];
float y = matrix[k][j];
float z = x + y;
if (z < min)
min = z;
}
shortcuts[i][j] = min;
}


## Activity/Discussion

Questions.

1. Which accesses to matrix are sequential? Which are not?
2. How could we make all memory accesses sequential?
3. Which operations can be (easily) parallelized?

## Question 1.

Which accesses to matrix are sequential? Which are not?

	float[][] shortcuts = new float[size][size];
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
float min = Float.MAX_VALUE;
for (int k = 0; k < size; ++k) {
float x = matrix[i][k];
float y = matrix[k][j];
float z = x + y;
if (z < min)
min = z;
}
shortcuts[i][j] = min;
}


## Question 2

How could we make all memory accesses sequential?

## Code, Again

	float[][] shortcuts = new float[size][size];
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
float min = Float.MAX_VALUE;
for (int k = 0; k < size; ++k) {
float x = matrix[i][k];
float y = matrix[k][j];
float z = x + y;
if (z < min)
min = z;
}
shortcuts[i][j] = min;
}


## Question 3

Which operations can be (easily) parallelized?

	float[][] shortcuts = new float[size][size];
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
float min = Float.MAX_VALUE;
for (int k = 0; k < size; ++k) {
float x = matrix[i][k];
float y = matrix[k][j];
float z = x + y;
if (z < min)
min = z;
}
shortcuts[i][j] = min;
}


## Assignment Challenges

1. Optimize loops for linear memory access

## Suggestions

1. Get working solution on your computer first
2. Then test on the HPC cluster

My Benchmark (HPC cluster):

[wrosenbaum@hpc-login1 lab02-shortcuts]\$ cat shortcutTest.out
|------|------------------|-------------|------------------|---------|
| size | avg runtime (ms) | improvement | iteration per us | passed? |
|------|------------------|-------------|------------------|---------|
|  128 |              184 |        0.05 |               11 |     yes |
|  256 |               56 |        0.82 |              294 |     yes |
|  512 |               19 |        9.22 |             6972 |     yes |
| 1024 |               85 |       33.15 |            12497 |     yes |
| 2048 |              257 |       88.33 |            33317 |     yes |
| 4096 |             1124 |      324.66 |            61095 |     yes |
|------|------------------|-------------|------------------|---------|