Behavioral Non-Portability in Scientific Numeric Computing

The precise semantics of floating-point arithmetic programs depends on the execution platform, including the compiler and the target hardware, even if these are fully compliant with the widely adopted IEEE 754 floating-point standard. Platform dependencies infringe on the highly desirable goal of software portability (which is in fact promised by heterogeneous computing frameworks like OpenCL): the same program run on the same inputs on different platforms can produce different results. Serious doubts on the portability of numeric applications arise when these differences are behavioral, i.e. when they lead to changes in the control flow of a program.

In this research we design an algorithm that takes a numeric procedure and determines an input that may lead to different branching decisions depending on how the arithmetic in the procedure is compiled. Our algorithm proceeds in two phases. The first phase determines a candidate input, such that minor variations of the value of exactly one of the input variables causes different branching decisions. In the second phase, the algorithm efficiently searches though all possible platform parameters such as decisions made by the compiler about evaluation order, and use or non-use of FMA hardware features. The goal is to find two sets of platform parameters that lead to different branching decisions.

Our implementation of the algorithm requires minimal intervention by the user. We illustrate its operation on a diverse set of examples, characteristic of scientific numeric computing, where control flow divergence actually occurs across different execution platforms.

Experiments

We test our algorithm on three procedures: (1) ray tracing, in the raySphere function there is a branching decision over the D variable (D> 0). The inputs are s, r, which represent the coordinates of the sphere origin and the ray, and radiusSq, a scalar value giving the square of the radius of sphere; (2) molecular dynamics, this open source code calculates the Lennard Jones potential of a molecular system. The energy is calculated based on pairwise distances of atoms and their velocity and acceleration. In the ComputeAccel function a decision branch occurs over (rr , rrCut). rr is the pairwise distance between atoms j1 and j2 and rrCut is the threshold for pairwise distances. The calculation of rr is directly effected by a density variable of the atom using a face centered cubic model; (3) long summation of 32 numbers, we compare serial C code which accumulates a value to a register, and a reduction kernel, written in OpenCL which is the common way to implement long summation on a parallel architecture. The result of the sum is compared to zero.

1. Generating Unstable Inputs

To test our algorithm, we first apply transformations to each procedure: we attach a main function that calls the tested procedure; we annotate the symbolic variables and volatile expressions by calling klee_make_symbolic_with_sort and klee_tag_reorderable. Following shows the ray tracing code after transformation.


//Dot Product 3-Vectors
float dot3(float *a, float *b){     
  float r = a[0] * b[0] + a[1] * b[1] 
           + a[2] * b[2];
  klee_tag_reorderable(&r, 0, 2);
  return r;
}

int raySphere(float *r, float *s, float radiusSq) {
  float A = dot3(r,r);                       
  float B = -2.0 * dot3(s,r);               
  float C = dot3(s,s) - radiusSq;          
  float D = B*B - 4*A*C;                       
  if (D > 0)
    return 0;
} 

int main() {
  float r[3], s[3];
  float radiusSq;
  klee_make_symbolic_with_sort(&r, sizeof(r), "r", 8, 32);
  klee_make_symbolic_with_sort(&s, sizeof(s), "s", 8, 32);
  klee_make_symbolic_with_sort(&radiusSq, sizeof(radiusSq), "radiusSq", 0, 32);
  return raySphere(r, s, radius);
}

For a conditional q in the execution path, there may exist multiple inputs that exhibit control flow instability. In our experiments we split the domian of one input variable into subintervals, and run our algorithm on each of these subintervals. In the table below it shows the experiment results.

Procedure	Variable	Domain	Subinterval length	#Unstable Inputs
ray tracing	s[0]	[-50, 50]	0.01	408
molecular dynamics	r_j1[0]	[0, 1]	0.1	12
long summation	a[0]	[0, 1]	0.01	100

2. Control Flow Divergence

To show control flow divergence, we took the generated sets of inputs and tested them on six different architectures. We convert the serial c code into OpenCL code so that we can run the same code and compare the results on these heterogeneous architectures. Below is the list of these architectures.

List of Machines
	type	Manufacture	Description	Year	FMA?
1	CPU	Intel	E5 2650	2012	*
2	CPU	AMD	A8-3850	2011	N
3	GPU	NVIDIA	GF108 Quadro 600	2010	N
4	GPU	NVIDIA	Tesla C2075	2011	Y
5	GPU	NVIDIA	Tesla K20	2013	Y
6	GPU	AMD	Radeon HD 65550D	2011	N

2.1 Ray Tracing Results

For the generated 408 sets of inputs. 45 of these produce results on different sides of zero for D on different architectures. In the table below some of these inputs are shown. D1, D2 and D3 show the value of D for D1: machine 1,2, D2: machine 3,4,5 and D3: machine 6.

s[0]	s[1]	s[2]	r[0]	r[1]	r[2]	radiusSq	D1	D2	D3
-33.3704719543457	-53	-52.0185546875	-33.999900817871	-54	-53	2.9802322E-8	4.54933548	-3.562728882	0.0
30.1333904266357	-41	-40.0234375	30.870059967041	-42	-41	1.907348633E-6	3.874450684	-7.142879486	0.0
40.0934867858886	-53	-52.0185546875	40.8501510620117	-54	-53	2.9802322E-8	7.814369202	-4.13470459	0.0
40.073875427246	-53	-52.0185546875	40.8301696777343	-53	-54	2.9802322E-8	-6.409370422	7.924546242	0.0
-27.4816207885742	-53	-52.0185546875	-27.9999465942382	-54	-53	2.9802322E-8	6.105501175	-6.985953331	0.0

2.2 Molecular Dynamics Results

Following table shows some of the results. R is the difference in pairwise distances of the atoms j1 and j2 to rrCut (rr-rrCut). R1,R2 shows result for machine 1,2,6 and machine 3,4,5 respectively.

r_j1[0]	r_j1[1]	r_j1[2]	r_j2[0]	r_j2[1]	r_j2[2]	Density	rrCut	R1	R2
0.50000011920928955	23.71043968200683594	22.29841232299804688	3.81469726562E-6	3.81469726562E-6	3.81469726562E-6	0.3	2.25	0.0	-2.38418579101562E-7
0.5078125	0.12553329765796661	7.8125E-3	23.9375	23.75	22.25	0.3	2.25	-2.38418579101562E-7	0.0
0.4000244140625	0.41066083312034607	26.125	6.25E-2	1.5625E-2	6.25E-2	0.2	1.44	5.72204590376657E-8	-6.19888305131155E-8
0.10009765625	27.23913192749023438	4.8828125E-4	27.21875	0.125	25.9453125	0.2	1.44	-6.19888305131155E-8	5.72204590376657E-8
0.2000005841255188	34.19741058349609375	1.81899E-12	0.18798828125	7.4505806E-9	33.69966888427734375	0.1	0.25	-1.49011611938476E-8	0.0
0.3125	6.25E-2	33.701171875	0.300048828125	2.464281208813E-2	6.103515625E-5	0.1	0.25	0.0	-1.49011611938476E-08

2.3 Long summation and Reduction Results

The 32 input numbers have a range between [-1.05,1]. Sum (1,2,3,5,6) shows the reduction results for machine 1,2,3,5,6 and sum 4 shows the reduction results for machine 4. Our code compared the result of the summation to 0. The reduction kernel and serial summation results on the inputs generated produce results on either side of zero.

Sum 1,2,3,5,6	Sum 4	Serial Sum
-4.65661287E-8	-8.7544322E-8	4.49363142E-8
-5.96046448E-8	-8.7544322E-8	1.054722816E-7
-5.96046448E-8	-8.7544322E-8	9.1502443E-8

People:

Faculty: Miriam Leeser, Thomas Wahl
Research assistant: Mahsa Bayati, Jaideep Ramachandran, Yijia Gu

Publications:

Miriam Leeser, Jaideep Ramachandran, Thomas Wahl, and Devon Yablonski. OpenCL floating point software on heterogeneous architectures- portable or not? In Workshop on Numerical Software Verification (NSV), 2012. [pdf ]
Miriam Leeser, Saoni Mukherjee, Jaideep Ramachandran, and Thomas Wahl. Make it real: Effective floating-point reasoning via exact arithmetic. In Design Automation and Testing in Europe (DATE), 2014.
Yijia Gu, Thomas Wahl, Mahsa Bayati, and Miriam Leeser. Behavioral non-portability in scientific numeric computing. In Parallel and Distributed Computing (EURO-PAR), pages 558--569, 2015. [pdf]

Sponsorship:

This material is based upon work supported by the National Science Foundation under Grant Number CCF-1218075. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.