Testing in the 1970s and 80s

Presented by Mike Machenry
Transcribed by Richard Cobbe

Program testing can approximate a proof of program correctness, although as Dijkstra famously said, "tests can never show the absence of bugs, only their presence." While this statement is certainly true in the general case, it has led some to conclude (in error) that software testing is useless. Furthermore, if we have an "ideal test," that is, a small subset of the domain of the program that is indicative of the program's behavior on the entire domain, we can prove a program correct through testing. The difficulty, of course, is finding the ideal test and proving it ideal.

Huang, in 1975, started reasoning about testing strategies by defining a decision-to-decision path. In a flowchart representation of a program, a decision-to-decision path is simply any path of execution that does not pass through any conditional nodes. He defines a set of test inputs to be "minimally thorough" if it covers all decision-to-decision paths in the program. We can further use the conditionals at the start of these paths to write constraints that the test data must satisfy, thus helping us to write test cases.

There are, of course, some difficulties with this approach. First, some decisions may not be reachable, which can result in unsolvable constraints. Second, assignment statements before conditionals can complicate the process of solving constraints. Jeff Palm did suggest that if the code is written in SSA form (or, equivalently, we use a purely functional language), we no longer have this problem. Matthias responded that we can use this technique to find errors in programs written in the earlier teaching language levels in DrScheme, which are purely functional.

We concluded our discussion of branch coverage with an example that demonstrates this method's incompleteness: writing test cases that exercise both branches of an if expression does not guarantee that the logic of the condition itself is correct. Someone expressed surprise that people believed branch coverage testing to be complete, and Matthias clarified this: people were trying to find the holes that they knew existed in testing methods for the specific purpose of trying to improve the methods.

Also in 1975, Susan Goodenough attempts to formalize the notion of an `ideal' test suite. An ideal test T is a subset of the program's input domain such that if the program runs correctly on each element of the test suite, then the program runs correctly on each element of the larger input domain. Given this definition, our main goal is to find the ideal test T for a given program. She does not give an algorithm for finding this, but she does attempt to characterize this set and its properties.

She defines several predicates on tests T and completeness criteria C (much as those discussed by Huang above). Intuitively, these appear to be defined as follows:

Completeness
T is complete against C if every constraint in C is satisfied by at least one test in T. Further, we require that every test in T satisfy at least one constraint in C to prune out completely useless tests.
Successfulness
T is successful if the program performs correctly on all tests in T.
Reliability
The constraint set C is reliable if for all test suites T1 and T2 that are complete against C, the program has the same behavior on T1 and T2. (Note that in the original statement, she required that both T1 and T2 be successful, which is too strong a requirement.)
Validity
A set of constraints C is valid if it leads to test data that find all bugs that exist in the program.
I stress that these are somewhat intuitive definitions; there was quite a lot of discussion about her formalisms, and we finally concluded that all of them were opaque, and many were broken.

She concludes with what she calls her Fundamental Theorem (stated without proof!): if there exists a test suite T and a set of constraints C such that T is complete against C, C is reliable, C is valid, and T is successful, then the program works correctly for all inputs in the domain.

This leaves one core problem: how does one write these constraints on the test data?

In 1980, Budd et al develop the idea of mutation testing, which is an attempt to mechanize the selection of these testing criteria. They define several `mutation operators' which perturb the program in small, well-defined ways: changing variable references to a different variable, switching arithmetic operations. A test suite is complete, in their opinion, if it finds bugs in all small perturbations of the original program.

Unfortunately, there are two problems with this approach. First, it requires executing a large number of mutated programs; this is obviously not practical for testing large software systems. Second, it is predicated on the Competent Programmer Hypothesis, which (roughly speaking) requires that the program being tested is within epsilon of being correct. Errors much more complicated than straightforward typos are not likely to be caught by this method.

Finally, in 1981, Howden takes this one step further by defining classes of program operations in an attempt to reduce the number of mutations that we need to try. So, for instance, if we have a relational expression comparing an integer-valued expression with a constant, then we need only find at most two perturbations, and we can be sure that these will find all bugs in the system.