# Lecture 07: Locality of Reference

## Coming Soon!

• Lab 02: Computing Shortcuts
• HPC cluster instructions

## Outline

1. Activity: Locality of Reference
• download lec07-locality-of-reference.zip from website
2. Computer Architecture, Oversimplified
3. Computing Shortcuts

## Two Stories

• embarrassingly parallel computaton
• e.g., estimating $\pi$
• e.g., Counter example
• mutual exclusion (continued next week)

Today:

• locality of reference
• LocalAdder.java

## LocalAdder Class

Task. Create an array of random (float) values and compute their sum.

Two Solutions.

1. Sum elements in sequential (linear) order
• linearIndex = [0, 1,...,size-1]
2. Sum element in random order
• randomIndex stores shuffled indices

## Two Implementations

Linear Sum:

	float total = 0;
for (int i = 0; i < size; ++i) {
int idx = linearIndex[i];
total += values[idx];
}


Random Sum:

	float total = 0;
for (int i = 0; i < size; ++i) {
int idx = randomIndex[i];
total += values[idx];
}


## Tester

AdderTester:

• computes linear sum
• computes random sum
• compares running times

Parameters:

• STEP the step size been array tests
• START starting size value
• MAX maximum size value

## Activity

Run AdderTester for a wide range of sizes:

• 1,000 – 10,000
• 10,000 – 100,000
• 100,000 – 1,000,000
• 1,0000,000 – 10,000,000
• 10,000,000 – 100,000,000

Questions.

1. How do running times compare between linear/random access for smaller arrays? What about larger arrays?

2. How does running time scale with linear/random access?

3. Did you expect to see the trend you see?

# Architecture, Less Oversimplified

## Unfortunately

Computer architechture is not so simple!

• Accessing main memory (RAM) directly is costly
• ~100 CPU cycles to read/write a value!
• Use hierarchy of smaller, faster memory locations:
• caching
• different levels of cache: L1, L2, L3
• cache memory integrated into CPU $\implies$ faster access

## How Memory is Accessed

• Look for symbol (variable) successively deeper memory locations
• L1, L2, L3, main memory
• Fetch symbol/value into L1 cache and do manipulations here
• When a cache becomes full, push its contents to a deeper level
• Periodically push changes down the heirarchy

## Why Caching Is Efficient

Heuristic:

• Most programs read/write to a relatively small number of memory locations often
• These values remain in low levels of the hierarchy
• Most commonly performed operation are performed efficiently

## Why Caching is Problematic

Cache (in)consistency

• L1, L2 cache for each core
• Multiple cores modify same variable concurrently
• Only version stored in local cache modified quickly
• Same variable has multiple values simultaneously!

Takes time to propogate changes to values

• Shared changes only occur periodically!

## What Your Computer (Probably) Does

arr a large array

On read/write arr[i], search for arr[i] successively in

• L1 cache
• L2 cache
• L3 cache
• main memory

Copy arr[i] and surrounding values to L1 cache

• usually arr[i-a],...,arr[i+a] ends up in L1

This process is called paging

## Performance Tuning

Be aware of your program’s memory access pattern

• reading values sequentially can be 10s of times faster than reading randomly or jumping around

# Lab 02: Computing Shortucts

## Network

• nodes and edges between nodes
• nodes labeled $0, 1, \ldots, n$
• directed edges $(i, j)$ from $i$ to $j$ for each $i \neq j$
• edges $(i, j)$ have associated weight, $w(i, j) \geq 0$
• weight indicates cost or distance to move from $i$ to $j$

## Shortcuts

What is cheapest path from 0 to 2?

## A Problem

Given a network as above, for all $i \neq j$, find cheapest path of length (at most) 2 from $i$ to $j$

• weight of a path is sum of weight of edges
• convention: $w(i, i) = 0$
• a shortcut from $i$ to $j$ is a path $i \to k \to j$ where $w(i, k) + w(k, j) < w(i, j)$

## Representing Input

• $D = \left( \begin{array}{ccc} 0 & 2 & 6\\ 1 & 0 & 3\\ 4 & 5 & 0 \end{array} \right)$

## Computing Output

• $D = (d_{ij})$
• Output $R = (r_{ij})$
• $r_{ij}$ = shortcut distance from $i$ to $j$
• computed by $$r_{ij} = \min_k d_{ik} + d_{kj}$$

## Example

• $D = \left( \begin{array}{ccc} 0 & 2 & 6\\ 1 & 0 & 3\\ 4 & 5 & 0 \end{array} \right)$
• $R = \left( \begin{array}{ccc} 0 & 2 & 5\\ 1 & 0 & 3\\ 4 & 5 & 0 \end{array} \right).$

## In Code

• Create a SquareMatrix object
• SquareMatrix stores a 2d array of floats called matrix
• matrix[i][j] stores $$w(i, j)$$

Write a program that computes shortcut matrix as quickly as possible!

• You’ll be given
• getShortcutMatrixBaseline()
• Your assignment is to optimize the code to write
• getShortcutMatrixOptimized ()

## Assignment Challenges

1. Optimize memory access pattern for operations
• make access pattern linear, when possible
2. Apply multithreading to get further speedup
• partition the problem into smaller parts

Payoff: optimized program will be 10s of times faster on your computer, 100s of times faster on HPC cluster!