Vector API
In Java, int and float values are 32 bits long
In modern CPUs, registers are larger
int a = 573842;
int b = 3847253;
int c = a + b;
int a1 = 573842;
int b1 = 3847253;
int c1 = a1 + b1;
int a2 = 38657548;
int b2 = 438573;
int c2 = a2 + b2;
int[] a = new int[n];
int[] b = new int[n];
int[] c = new int[n];
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Suppose we can load step values into each register
int[] a = new int[n];
int[] b = new int[n];
int[] c = new int[n];
for (int i = 0; i < n; i += step) {
c[i] = a[i] + b[i];
c[i+1] = a[i+1] + b[i+1];
...
c[i+step-1] = a[i+step-1] + b[i+step-1]
}
float min = Float.MAX_VALUE;
for (int k = 0; k < size; ++k) {
float x = matrix[i][k];
float y = matrix[k][j];
float z = x + y;
if (z < min) {
min = z;
}
}
shortcuts[i][j] = min;
Question. How could we (maybe) speed this up with SIMD parallelism?
Allows us to specify Vector objects
Vector is like fixed-size array
Vector (bit) size to same as hardware registersNotes:
Vector API in Java 19, available as “incubator”Vector)Find entry-wise maximum of arrays:
VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;
...
public static float[] vectorMax(float[] a, float[] b) {
float[] c = new float[a.length];
int step = SPECIES.length();
int bound = SPECIES.loopBound(a.length);
...
}
Find entry-wise minimum of arrays:
...
int i = 0;
for (; i < bound; i += step) {
var va = FloatVector.fromArray(SPECIES, a, i);
var vb = FloatVector.fromArray(SPECIES, b, i);
var vc = va.max(vb);
vc.intoArray(c, i);
}
for (; i < a.length; i++) {
c[i] = Math.max(a[i], b[i]);
}
return c;
}
Hello, vectors!
The FloatVector has 8 lanes.
Computing max array with simple methods...
That took 625 ms.
Computing max array with vector methods...
That took 174 ms.
The arrays are equal!
Hello, vectors!
The FloatVector has 8 lanes.
Computing max array with simple methods...
That took 518 ms.
Computing max array with vector methods...
That took 66 ms.
The arrays are equal!
Java Vector API is still an “incubator” feature
Vector APITo use Vectors your computer you must:
javac --version from command line to see compiler versionjava --version to see JRE versioninlclude Vector package in program:
import jdk.incubator.vector.*;
compile and run telling Java you’re using incubator features:
> javac --add-modules jdk.incubator.vector [files to compile]
> java --add-modules jdk.incubator.vector [program to run]
Vector API on HPCMust load a module with correct version of Java:
> module load amh-java/19.0.1
> javac --add-modules jdk.incubator.vector [files to compile]
> java --add-modules jdk.incubator.vector [program to run]
Better still:
sbatch as in homework assignments with all of these commands in the test script!To give “accurate” measure of efficiency:
Input
float[] a, size n
float[] b, size n
Output
a[i] + b[i] from i = 0 to n - 1
float min = Float.MAX_VALUE;
for (int i = 0; i < a.length; i++) {
float x = a[i]; float y = b[i];
float z = x + y;
if (z < min) {
min = z;
}
}
return min;
int step = SPECIES.length();
int bound = SPECIES.loopBound(a.length);
var mv = FloatVector.broadcast(SPECIES, Float.MAX_VALUE);
int i = 0;
for (; i < bound; i += step) {
var va = FloatVector.fromArray(SPECIES, a, i);
var vb = FloatVector.fromArray(SPECIES, b, i);
mv = mv.min(va.add(vb));
}
float min = mv.reduceLanes(VectorOperators.MIN);
...
Cleanup:
float min = mv.reduceLanes(VectorOperators.MIN);
for (; i < a.length; i++) {
float x = a[i];
float y = b[i];
float z = x + y;
if (z < min) {
min = z;
}
}
return min;
Vanilla vs Vector on HPC
The FloatVector has 8 lanes.
Computing min-plus with simple methods...
That took 654 ms.
Computing min-plus with vector methods...
That took 254 ms.
c = 0.0054750443
d = 0.0054750443
The values are equal!
Add vector instructions to your shortcut program!