
Linear Sum:
	float total = 0;
	for (int i = 0; i < size; ++i) {
	    int idx = linearIndex[i];
	    total += values[idx];
	}
	return total;
Random Sum:
	float total = 0;
	for (int i = 0; i < size; ++i) {
	    int idx = randomIndex[i];
	    total += values[idx];
	}
	return total;
arr a large array
On read/write arr[i], search for arr[i] successively in
Copy arr[i] and surrounding values to L1 cache
arr[i-a],...,arr[i+b] ends up in L1This process is called paging

Be aware of your program’s memory access pattern
	float[][] shortcuts = new float[size][size];
	for (int i = 0; i < size; ++i) {
	    for (int j = 0; j < size; ++j) {
		float min = Float.MAX_VALUE;
		for (int k = 0; k < size; ++k) {
		    float x = matrix[i][k]; 
		    float y = matrix[k][j];
		    float z = x + y;
		    if (z < min)
			min = z;
		}
		shortcuts[i][j] = min;
	    }
Questions.
matrix are sequential? Which are not?Which accesses to matrix are sequential? Which are not?
	float[][] shortcuts = new float[size][size];
	for (int i = 0; i < size; ++i) {
	    for (int j = 0; j < size; ++j) {
		float min = Float.MAX_VALUE;
		for (int k = 0; k < size; ++k) {
		    float x = matrix[i][k]; 
		    float y = matrix[k][j];
		    float z = x + y;
		    if (z < min)
			min = z;
		}
		shortcuts[i][j] = min;
	    }
How could we make all memory accesses sequential?
	float[][] shortcuts = new float[size][size];
	for (int i = 0; i < size; ++i) {
	    for (int j = 0; j < size; ++j) {
		float min = Float.MAX_VALUE;
		for (int k = 0; k < size; ++k) {
		    float x = matrix[i][k]; 
		    float y = matrix[k][j];
		    float z = x + y;
		    if (z < min)
			min = z;
		}
		shortcuts[i][j] = min;
	    }
Which operations can be (easily) parallelized?
	float[][] shortcuts = new float[size][size];
	for (int i = 0; i < size; ++i) {
	    for (int j = 0; j < size; ++j) {
		float min = Float.MAX_VALUE;
		for (int k = 0; k < size; ++k) {
		    float x = matrix[i][k]; 
		    float y = matrix[k][j];
		    float z = x + y;
		    if (z < min)
			min = z;
		}
		shortcuts[i][j] = min;
	    }
My Benchmark (HPC cluster):
[wrosenbaum@hpc-login1 lab02-shortcuts]$ cat shortcutTest.out 
|------|------------------|-------------|------------------|---------|
| size | avg runtime (ms) | improvement | iteration per us | passed? |
|------|------------------|-------------|------------------|---------|
|  128 |              184 |        0.05 |               11 |     yes |
|  256 |               56 |        0.82 |              294 |     yes |
|  512 |               19 |        9.22 |             6972 |     yes |
| 1024 |               85 |       33.15 |            12497 |     yes |
| 2048 |              257 |       88.33 |            33317 |     yes |
| 4096 |             1124 |      324.66 |            61095 |     yes |
|------|------------------|-------------|------------------|---------|