- Programming Assignment 01 Posted
- ignore HPC cluster part of assignment for Friday
- accounts registered, but no documentation yet
- visit hpc.amherst.edu
- ssh access:
`[amherstid]@hpc.amherst.edu`

- First written assignment next Friday
- posted this weekend

- Office Hours
- TA (Mary Kate) Office Hours Wednesday 7–9pm, SCCE C109
- My individual OH: Thursday 1:00–2:30

- Lecture 03 Activity
- Parallelism vs Concurrency
- Embarrassingly Parallel Problem
- Limitations of Parallelism

```
void increment(int[] a) {
int i = 0;
while (i < a.length) {
a[i] = a[i] + 1;
i = i + 1;
}
}
```

If `a = [0, 0, 0, 0]`

and two threads, what are possible outcomes?

```
void increment(int[] a) {
int i = 0;
while (i < a.length) {
a[i] = a[i] + 1;
i = i + 1;
}
}
```

If `a = [0, 0, 0, 0]`

and $k$ threads, what are possible outcomes?

```
void increment(int[] a) {
int i = 0;
while (i < a.length) {
a[i] = a[i] + 1;
i = i + 1;
}
}
```

**Concurrency** performing multiple tasks that occupy overlapping time intervals

- E.g., I teach COSC 225 and COSC 273 concurrently

**Parallelism** making progress on multiple tasks at the same time

- E.g., COSC 273 and MATH 410 are taught in parallel (MWF 10-10:50)
- parallel $\implies$ concurrent

Parallelism can give performance boost

- performance is one focus of this class

Concurrency is necessary for basic functionality of computers

- cannot execute multiple programs without concurrency
- operating system typically handles issues of concurrency
- why you probably haven’t encountered concurrency before

Issues of nondeterminism exist for *concurrent* programs, not just parallel ones

```
public void increment () {
++count;
}
```

How could we **fix** the problem of mis-counting?

- Want every increment to count!

Each thread stores own private count!

- run threads until they’re done
- aggregate local counts when threads terminate

When might “easy” solution not be sufficient?

We’ll revisit this next week

A computational problem is **embarrassingly parallel** if it can be broken into many **simple** computations, (almost) all of which can be performed in parallel.

Area of a disk: $A = \pi r^2$

Pick a random point inside the framed region.

The *probability* the point lies in the disk is proportional to the disk’s area.

- area of disk is $\pi r^2$
- area of surrounding square is $(2 r)^2 = 4 r^2$
- the probability that a (uniformly) random point in the square lies in the disk is: $ \frac{\text{area of circle}}{\text{area of square}} = \frac{\pi r^2}{4 r^2} = \frac 1 4 \pi. $

so…

…to estimate $\pi $, suffices to estimate the probability that a random point point in the square lies inside the disk:

- pick a bunch of random points
- see how many lie in disk
- $p = $ proportion of points that do
- $\pi \approx 4 p$

Example of **Monte Carlo method**

Why is Monte Carlo estimation embarrassingly parallel?

How much performance increase with $k$ cores?

- What if $k \approx$ number of samples taken?

Dependencies?

```
a1 = b1 + c1;
a2 - b2 + c2;
d = a1 * a2
```

Dependency relation: directed acyclic graph (DAG)

Consider a program that requires

- $N$ elementary operations
- $T$ time to run sequentially

Suppose

- a $p$-fraction of operations can be performed in parallel
- $1-p$ fraction must be performed sequentially

Question: how long could program take with $n$ parallel machines?

With $n $ parallel machines:

- perform $p $-fraction of parallelizable ops in parallel on all $n$ machines
- total time $\frac{T \cdot p}{n}$

- perform remaining ops sequentially on a single machine
- total time $T \cdot (1 - p)$

Total time: $T \cdot (1 - p) + T \cdot \frac{p}{n} = T \cdot \left(1 - p + \frac p n\right)$

The **speedup** is the ratio of the original time $T $ to the parallel time $T \cdot \left(1 - p + \frac p n\right)$:

- $S = \frac{1}{1 - p + \frac p n}$

This relation is called **Amdahl’s Law**

This is the best performance improvement possible **in principle**

- may not be achievable in practice!

1 person can chop 1 onion per minute

Recipe calls for:

- chop 6 onions
- saute onions for 4 minutes

Note:

- chopping onions can be done in parallel
- sauteing
- takes 4 minutes no matter what
- must be accomplished after chopping

How much can the cooking process be sped up by $n $ cooks?

- For one chef, $T = 6 + 4 = 10$
- Only chopping onions is parallelizable, so $p = 6 / 10 = 0.6$
- Amdahl’s Law:
- $S = \frac{1}{1 - p - \frac{p}{n}} = \frac{1}{0.4 + \frac 1 n 0.6}$

- So:
- $n = 2 \implies S = 1.43$
- $n = 3 \implies S = 1.67$
- $n = 6 \implies S = 2$

- Always have $S < 1 / (1 - p) = 2.5$

- Second processor: 43%
- Third processor: 17%
- Fourth processor: 9%
- Fifth processor: 6%
- Sixth processor 4%

How does latency $T $ scale with $n $?

- Adding more processors has
*declining marginal utility*:- each additional processor has a smaller effect on total performance
- at some point, adding more processors to a computation is wasteful

- Another consideration:
- after parallel ops have been performed, extra processors are idle (potentially wasteful!)

The proportion of parallelizable operations $p$ is not always obvious from problem statement

- Amdahl’s law a valuable heuristic for general phenomena:
- an $n$-fold increase in parallel processing power does not typically give an $n $-fold speedup in computations
- adding new parallel processors becomes less helpful the more parallel processors you already have

- Often helpful to think about scheduling subtasks (not individual operations)
- May have relationships between tasks (e.g., one must be performed before another)

Start **Mutual Exclusion**

- How can we fix our
`Counter`

to work as intended if we need to maintain a running count that can be accessed by multiple threads?