Vector
APIIn Java
, int
and float
values are 32 bits long
In modern CPUs, registers are larger
int a = 573842;
int b = 3847253;
int c = a + b;
int a1 = 573842;
int b1 = 3847253;
int c1 = a1 + b1;
int a2 = 38657548;
int b2 = 438573;
int c2 = a2 + b2;
int[] a = new int[n];
int[] b = new int[n];
int[] c = new int[n];
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Suppose we can load step
values into each register
int[] a = new int[n];
int[] b = new int[n];
int[] c = new int[n];
for (int i = 0; i < n; i += step) {
c[i] = a[i] + b[i];
c[i+1] = a[i+1] + b[i+1];
...
c[i+step-1] = a[i+step-1] + b[i+step-1]
}
float min = Float.MAX_VALUE;
for (int k = 0; k < size; ++k) {
float x = matrix[i][k];
float y = matrix[k][j];
float z = x + y;
if (z < min) {
min = z;
}
}
shortcuts[i][j] = min;
Question. How could we (maybe) speed this up with SIMD parallelism?
Allows us to specify Vector
objects
Vector
is like fixed-size array
Vector
(bit) size to same as hardware registersNotes:
Vector
API in Java 19, available as “incubator”Vector
)Find entry-wise maximum of arrays:
VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;
...
public static float[] vectorMax(float[] a, float[] b) {
float[] c = new float[a.length];
int step = SPECIES.length();
int bound = SPECIES.loopBound(a.length);
...
}
Find entry-wise minimum of arrays:
...
int i = 0;
for (; i < bound; i += step) {
var va = FloatVector.fromArray(SPECIES, a, i);
var vb = FloatVector.fromArray(SPECIES, b, i);
var vc = va.max(vb);
vc.intoArray(c, i);
}
for (; i < a.length; i++) {
c[i] = Math.max(a[i], b[i]);
}
return c;
}
Hello, vectors!
The FloatVector has 8 lanes.
Computing max array with simple methods...
That took 625 ms.
Computing max array with vector methods...
That took 174 ms.
The arrays are equal!
Hello, vectors!
The FloatVector has 8 lanes.
Computing max array with simple methods...
That took 518 ms.
Computing max array with vector methods...
That took 66 ms.
The arrays are equal!
Java Vector
API is still an “incubator” feature
Vector
APITo use Vectors
your computer you must:
javac --version
from command line to see compiler versionjava --version
to see JRE versioninlclude Vector
package in program:
import jdk.incubator.vector.*;
compile and run telling Java you’re using incubator features:
> javac --add-modules jdk.incubator.vector [files to compile]
> java --add-modules jdk.incubator.vector [program to run]
Vector
API on HPCMust load a module with correct version of Java:
> module load amh-java/19.0.1
> javac --add-modules jdk.incubator.vector [files to compile]
> java --add-modules jdk.incubator.vector [program to run]
Better still:
sbatch
as in homework assignments with all of these commands in the test script!To give “accurate” measure of efficiency:
Input
float[] a
, size n
float[] b
, size n
Output
a[i] + b[i]
from i = 0
to n - 1
float min = Float.MAX_VALUE;
for (int i = 0; i < a.length; i++) {
float x = a[i]; float y = b[i];
float z = x + y;
if (z < min) {
min = z;
}
}
return min;
int step = SPECIES.length();
int bound = SPECIES.loopBound(a.length);
var mv = FloatVector.broadcast(SPECIES, Float.MAX_VALUE);
int i = 0;
for (; i < bound; i += step) {
var va = FloatVector.fromArray(SPECIES, a, i);
var vb = FloatVector.fromArray(SPECIES, b, i);
mv = mv.min(va.add(vb));
}
float min = mv.reduceLanes(VectorOperators.MIN);
...
Cleanup:
float min = mv.reduceLanes(VectorOperators.MIN);
for (; i < a.length; i++) {
float x = a[i];
float y = b[i];
float z = x + y;
if (z < min) {
min = z;
}
}
return min;
Vanilla vs Vector on HPC
The FloatVector has 8 lanes.
Computing min-plus with simple methods...
That took 654 ms.
Computing min-plus with vector methods...
That took 254 ms.
c = 0.0054750443
d = 0.0054750443
The values are equal!
Add vector instructions to your shortcut program!