Behavioral Non-Portability in Scientific Numeric Computing


The precise semantics of floating-point arithmetic programs depends on the execution platform, including the compiler and the target hardware, even if these are fully compliant with the widely adopted IEEE 754 floating-point standard. Platform dependencies infringe on the highly desirable goal of software portability (which is in fact promised by heterogeneous computing frameworks like OpenCL): the same program run on the same inputs on different platforms can produce different results. Serious doubts on the portability of numeric applications arise when these differences are behavioral, i.e. when they lead to changes in the control flow of a program.

In this research we design an algorithm that takes a numeric procedure and determines an input that may lead to different branching decisions depending on how the arithmetic in the procedure is compiled. Our algorithm proceeds in two phases. The first phase determines a candidate input, such that minor variations of the value of exactly one of the input variables causes different branching decisions. In the second phase, the algorithm efficiently searches though all possible platform parameters such as decisions made by the compiler about evaluation order, and use or non-use of FMA hardware features. The goal is to find two sets of platform parameters that lead to different branching decisions.

Our implementation of the algorithm requires minimal intervention by the user. We illustrate its operation on a diverse set of examples, characteristic of scientific numeric computing, where control flow divergence actually occurs across different execution platforms.

Experiments

We test our algorithm on three procedures: (1) ray tracing, in the raySphere function there is a branching decision over the D variable (D> 0). The inputs are s, r, which represent the coordinates of the sphere origin and the ray, and radiusSq, a scalar value giving the square of the radius of sphere; (2) molecular dynamics, this open source code calculates the Lennard Jones potential of a molecular system. The energy is calculated based on pairwise distances of atoms and their velocity and acceleration. In the ComputeAccel function a decision branch occurs over (rr , rrCut). rr is the pairwise distance between atoms j1 and j2 and rrCut is the threshold for pairwise distances. The calculation of rr is directly effected by a density variable of the atom using a face centered cubic model; (3) long summation of 32 numbers, we compare serial C code which accumulates a value to a register, and a reduction kernel, written in OpenCL which is the common way to implement long summation on a parallel architecture. The result of the sum is compared to zero.

1. Generating Unstable Inputs

To test our algorithm, we first apply transformations to each procedure: we attach a main function that calls the tested procedure; we annotate the symbolic variables and volatile expressions by calling klee_make_symbolic_with_sort and klee_tag_reorderable. Following shows the ray tracing code after transformation.


//Dot Product 3-Vectors
float dot3(float *a, float *b){     
  float r = a[0] * b[0] + a[1] * b[1] 
           + a[2] * b[2];
  klee_tag_reorderable(&r, 0, 2);
  return r;
}

int raySphere(float *r, float *s, float radiusSq) {
  float A = dot3(r,r);                       
  float B = -2.0 * dot3(s,r);               
  float C = dot3(s,s) - radiusSq;          
  float D = B*B - 4*A*C;                       
  if (D > 0)
    return 0;
} 

int main() {
  float r[3], s[3];
  float radiusSq;
  klee_make_symbolic_with_sort(&r, sizeof(r), "r", 8, 32);
  klee_make_symbolic_with_sort(&s, sizeof(s), "s", 8, 32);
  klee_make_symbolic_with_sort(&radiusSq, sizeof(radiusSq), "radiusSq", 0, 32);
  return raySphere(r, s, radius);
} 

For a conditional q in the execution path, there may exist multiple inputs that exhibit control flow instability. In our experiments we split the domian of one input variable into subintervals, and run our algorithm on each of these subintervals. In the table below it shows the experiment results.

Procedure Variable Domain Subinterval length #Unstable Inputs
ray tracing s[0] [-50, 50] 0.01 408
molecular dynamics r_j1[0] [0, 1] 0.1 12
long summation a[0] [0, 1] 0.01 100

2. Control Flow Divergence

To show control flow divergence, we took the generated sets of inputs and tested them on six different architectures. We convert the serial c code into OpenCL code so that we can run the same code and compare the results on these heterogeneous architectures. Below is the list of these architectures.

type Manufacture Description Year FMA?
1 CPU Intel E5 2650 2012 *
2 CPU AMD A8-3850 2011 N
3 GPU NVIDIA GF108 Quadro 600 2010 N
4 GPU NVIDIA Tesla C2075 2011 Y
5 GPU NVIDIA Tesla K20 2013 Y
6 GPU AMD Radeon HD 65550D 2011 N
List of Machines

2.1 Ray Tracing Results

For the generated 408 sets of inputs. 45 of these produce results on different sides of zero for D on different architectures. In the table below some of these inputs are shown. D1, D2 and D3 show the value of D for D1: machine 1,2, D2: machine 3,4,5 and D3: machine 6.


s[0] s[1] s[2] r[0] r[1] r[2] radiusSq D1 D2 D3
-33.3704719543457 -53 -52.0185546875 -33.999900817871 -54 -53 2.9802322E-8 4.54933548 -3.562728882 0.0
30.1333904266357 -41 -40.0234375 30.870059967041 -42 -41 1.907348633E-6 3.874450684 -7.142879486 0.0
40.0934867858886 -53 -52.0185546875 40.8501510620117 -54 -53 2.9802322E-8 7.814369202 -4.13470459 0.0
40.073875427246 -53 -52.0185546875 40.8301696777343 -53 -54 2.9802322E-8 -6.409370422 7.924546242 0.0
-27.4816207885742 -53 -52.0185546875 -27.9999465942382 -54 -53 2.9802322E-8 6.105501175 -6.985953331 0.0

2.2 Molecular Dynamics Results

Following table shows some of the results. R is the difference in pairwise distances of the atoms j1 and j2 to rrCut (rr-rrCut). R1,R2 shows result for machine 1,2,6 and machine 3,4,5 respectively.

r_j1[0] r_j1[1] r_j1[2] r_j2[0] r_j2[1] r_j2[2] Density rrCut R1 R2
0.50000011920928955 23.71043968200683594 22.29841232299804688 3.81469726562E-6 3.81469726562E-6 3.81469726562E-6 0.3 2.25 0.0 -2.38418579101562E-7
0.5078125 0.12553329765796661 7.8125E-3 23.9375 23.75 22.25 0.3 2.25 -2.38418579101562E-7 0.0
0.4000244140625 0.41066083312034607 26.125 6.25E-2 1.5625E-2 6.25E-2 0.2 1.44 5.72204590376657E-8 -6.19888305131155E-8
0.10009765625 27.23913192749023438 4.8828125E-4 27.21875 0.125 25.9453125 0.2 1.44 -6.19888305131155E-8 5.72204590376657E-8
0.2000005841255188 34.19741058349609375 1.81899E-12 0.18798828125 7.4505806E-9 33.69966888427734375 0.1 0.25 -1.49011611938476E-8 0.0
0.3125 6.25E-2 33.701171875 0.300048828125 2.464281208813E-2 6.103515625E-5 0.1 0.25 0.0 -1.49011611938476E-08

2.3 Long summation and Reduction Results

The 32 input numbers have a range between [-1.05,1]. Sum (1,2,3,5,6) shows the reduction results for machine 1,2,3,5,6 and sum 4 shows the reduction results for machine 4. Our code compared the result of the summation to 0. The reduction kernel and serial summation results on the inputs generated produce results on either side of zero.

Sum 1,2,3,5,6 Sum 4 Serial Sum
-4.65661287E-8 -8.7544322E-8 4.49363142E-8
-5.96046448E-8 -8.7544322E-8 1.054722816E-7
-5.96046448E-8 -8.7544322E-8 9.1502443E-8

People:

Publications:

Sponsorship:

This material is based upon work supported by the National Science Foundation under Grant Number CCF-1218075. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
<National Science Foundation logo>