Lecture 07: Locality of Reference
COSC 273: Parallel and Distributed Computing
Spring 2023
Coming Soon!
 Lab 02: Computing Shortcuts
 HPC cluster instructions
Outline
 Activity: Locality of Reference
 download
lec07localityofreference.zip
from website
 Computer Architecture, Oversimplified
 Computing Shortcuts
Two Stories
 Multithreaded performance
 embarrassingly parallel computaton
 e.g., estimating $\pi$
 Multithreaded correctness
 e.g.,
Counter
example
 mutual exclusion (continued next week)
Today:
 Singlethreaded performance!
 locality of reference
LocalAdder.java
LocalAdder
Class
Task. Create an array of random (float
) values and compute their sum.
Two Solutions.
 Sum elements in sequential (linear) order
linearIndex = [0, 1,...,size1]
 Sum element in random order

randomIndex
stores shuffled indices
Two Implementations
Linear Sum:
float total = 0;
for (int i = 0; i < size; ++i) {
int idx = linearIndex[i];
total += values[idx];
}
return total;
Random Sum:
float total = 0;
for (int i = 0; i < size; ++i) {
int idx = randomIndex[i];
total += values[idx];
}
return total;
Tester
AdderTester
:
 computes linear sum
 computes random sum
 compares running times
Parameters:

STEP
the step size been array tests

START
starting size value

MAX
maximum size value
Activity
Run AdderTester
for a wide range of sizes:
 1,000 – 10,000
 10,000 – 100,000
 100,000 – 1,000,000
 1,0000,000 – 10,000,000
 10,000,000 – 100,000,000
Questions.

How do running times compare between linear/random access for smaller arrays? What about larger arrays?

How does running time scale with linear/random access?

Did you expect to see the trend you see?
How do running times compare?
Can you explain the trend?
Architecture, Less Oversimplified
Idealized Picture
Unfortunately
Computer architechture is not so simple!
 Accessing main memory (RAM) directly is costly
 ~100 CPU cycles to read/write a value!
 Use hierarchy of smaller, faster memory locations:
 caching
 different levels of cache: L1, L2, L3
 cache memory integrated into CPU $\implies$ faster access
A More Accurate Picture
How Memory is Accessed
When reading or writing:
 Look for symbol (variable) successively deeper memory locations
 Fetch symbol/value into L1 cache and do manipulations here
 When a cache becomes full, push its contents to a deeper level
 Periodically push changes down the heirarchy
Memory Access Illustrated
Why Is Caching Done? Efficiency!
Why Caching Is Efficient
Heuristic:
 Most programs read/write to a relatively small number of memory locations often
 These values remain in low levels of the hierarchy
 Most commonly performed operation are performed efficiently
Why Caching is Problematic
Cache (in)consistency
 L1, L2 cache for each core
 Multiple cores modify same variable concurrently
 Only version stored in local cache modified quickly
 Same variable has multiple values simultaneously!
Takes time to propogate changes to values
 Shared changes only occur periodically!
What Your Computer (Probably) Does
arr
a large array
On read/write arr[i]
, search for arr[i]
successively in
 L1 cache
 L2 cache
 L3 cache
 main memory
Copy arr[i]
and surrounding values to L1 cache
 usually
arr[ia],...,arr[i+a]
ends up in L1
This process is called paging
Lab 02: Computing Shortucts
A Network
Network

nodes and edges between nodes
 nodes labeled $0, 1, \ldots, n$

directed edges $(i, j)$ from $i$ to $j$ for each $i \neq j$
 edges $(i, j)$ have associated weight, $w(i, j) \geq 0$
 weight indicates cost or distance to move from $i$ to $j$
Shortcuts
What is cheapest path from 0 to 2?
A Problem
Given a network as above, for all $i \neq j$, find cheapest path of length (at most) 2 from $i$ to $j$
 weight of a path is sum of weight of edges
 convention: $w(i, i) = 0$
 a shortcut from $i$ to $j$ is a path $i \to k \to j$ where $w(i, k) + w(k, j) < w(i, j)$
Shortcut Distances
Computing Output
 $D = (d_{ij})$
 Output $R = (r_{ij})$
 $r_{ij}$ = shortcut distance from $i$ to $j$
 computed by \(r_{ij} = \min_k d_{ik} + d_{kj}\)
Example

\[D =
\left(
\begin{array}{ccc}
0 & 2 & 6\\
1 & 0 & 3\\
4 & 5 & 0
\end{array}
\right)\]

\[R =
\left(
\begin{array}{ccc}
0 & 2 & 5\\
1 & 0 & 3\\
4 & 5 & 0
\end{array}
\right).\]
In Code
 Create a
SquareMatrix
object

SquareMatrix
stores a 2d array of float
s called matrix

matrix[i][j]
stores \(w(i, j)\)
Your Assignment
Write a program that computes shortcut matrix as quickly as possible!
 You’ll be given
getShortcutMatrixBaseline()
 Your assignment is to optimize the code to write
getShortcutMatrixOptimized ()
Assignment Challenges
 Optimize memory access pattern for operations
 make access pattern linear, when possible
 Apply multithreading to get further speedup
 partition the problem into smaller parts
Payoff: optimized program will be 10s of times faster on your computer, 100s of times faster on HPC cluster!