Whitebox Testing Is Overrated


Okay, so there is quite a lot of content on the internet, why you should strive for 100% code coverage, or how that’s actually a bad thing. But I don’t see many people talk about where this is actually coming from.

So for this blog post, I would like to give you a brief introduction into what blackbox and whitebox testing are, what they are trying to achieve, where their limits are, and why it might be a bad idea to solely rely on code coverage to determine the quality of the test suite.

The Scope of This Article

The whole subject of testing is incredibly divers, so for this article I want to focus specifically on blackbox and whitebox testing – otherwise it would become a book ^^. It will certainly not be a comprehensive intro into testing, but I will explain the basics that are necessary to follow along.

Just to get a few things out of the way: Most of what I’m going to talk about is only really relevant for dynamic testing (= the artifact is executed; i.e. not static code analysis, model checking, code reviews, …). Specifically, it only applies to scripted tests – as opposed to ad-hoc tests and explorative tests. Regarding test levels, we are mainly on the unit test level because that’s where whitebox testing is most interesting.

The 7 Testing Principles

The International Software Testing Qualifications Board (ISTQB) defines 7 principles about testing, that are very useful when reasoning about tests.1ISTQB CTFL Syllabus

1. Testing shows the presence of defects, not their absence.

The point of tests is to find bugs. But tests alone cannot prove the correctness of a program. This differentiates testing from formal verification (e.g. Hoare calculus).

2. Exhaustive testing is impossible.

Testing exhaustively would mean testing every single input combination. It should be immediately obvious that this is not practically possible (except for some trivial cases). In fact, it’s not even theoretically possible. Let’s say we have a Python function that is supposed to calculate the absolute value of an integer. In Python, integer variables have an arbitrary length, and therefore there is no defined upper bound for its values.2What’s New in Python 3.0

This is the key to why we need methods like blackbox and whitebox testing in the first place. In the next chapter, we’ll talk more about this.

3. Early testing saves time and money.

The earlier we detect a bug, the less we need to change later on. In the most extreme example, let’s assume something went wrong during requirements engineering – the defect is already in the specification. And because there was no feedback mechanism, it’s only detected during acceptance testing with the customer. In the worst case, we’d have to essentially start from scratch.

4. Defects cluster together.

The frequency of defects in modules often corresponds with the complexity of the modules. Thus, if a module contains bugs, it might be an indicator of the module’s complexity – there are probably more bugs to discover here.

5. Beware of the pesticide paradox.

Essentially, repeating the same tests on the same artifact will at some point stop finding new defects. So instead of repeating tests, it might be better to come up with new tests. This applies especially for manual testing, since there is a serious time and money investment connected with professional QA. Though, there is still a value in repeating tests to prevent regressions.

6. Testing is context-dependent.

Different scenarios require a different approach to testing. Testing a web shop as if it’s a control system for a commercial airplane is a waste of resources. Similarly, different project management models might require different approaches (ISTQB, for example, offers special certifications for agile testers since the requirements are so different).

7. Absence-of-errors is a fallacy.

Even assuming, the system is free of defects (which it probably is not – see 1 and 2) and was implemented perfectly according to a perfect specification, it might not be useable for the target audience. Just because we cannot find any issues, does not necessarily mean the customer is happy.

How much is enough?

Okay, so especially principles 1 and 2 beg the question: How much testing is enough testing? We cannot test exhaustively anyway. So at some point we need to stop testing – unless we just want to flush money down the drain.

This is where the idea of blackbox and whitebox testing comes in. What these methods essentially do, is provide us with metrics that we can use to determine how many tests we need – and, specifically: Which tests we need. The objective is to ensure that the tests encompass as much of the potential behavior of the code as possible, while minimizing the number of test cases.

Blackbox Testing

As the name implies, blackbox testing assumes we know nothing about the implementation – it’s a black box. What we do know is how it should behave. Either from the specification, from acceptance criteria (maybe even some Gherkin scenarios in case we are doing Behavior-Driven Development – BDD), from a Doxygen description, or even just a function name. The ISO/IEC/IEEE 29119-4 standard calls this concept “specification-based test design”.

The main technique used in blackbox testing is Equivalence Partitioning. We divide the space of possible inputs into classes that we expect to behave similarly. Classes for regular behavior are called “positive”, special cases and errors are called “negative”. By dividing the input space into partitions that behave similarly, we only need one test case per partition. In fact, we can often combine multiple positive classes into one test case.

Let’s illustrate how this works: Assume, we want to build address validation for our web shop.

Our address object has the following fields (this also serves as the specification for our example):

  • Address line 1, must always be present.
  • Address line 2, may be missing.
  • City, must be present.
  • Postal code, must be present.
  • Country code, must be present and must be a valid ISO country code.

Now we can go through the input fields and think about, what kind of cases can occur.

Classpositive/negativeInput description
1positiveAddress line 1 is present.
2negativeAddress line 1 is missing.
3positiveAddress line 2 is present.
4positiveAddress line 2 is missing.
5positiveCity is present.
6negativeCity is missing.
7positivePostal code is present.
8negativePostal code is missing.
9positiveCountry code valid.
10negativeCountry code missing.
11negativeCountry code invalid.

Notice: Class 4 is positive, since address line 2 is not mandatory. Theoretically, we could skip this case, since both partitions of address line 2 behave the same. However, I’d argue it makes sense to include it, since it’s explicitly stated in our specification.

Now that we have our classes, we can think about what the test cases could look like. As stated earlier, we can combine positive classes as long as there are no contradicting inputs. In this example, we need two test cases for positive classes, since classes 3 and 4 are mutually exclusive.

Here is how the test cases could look:

CaseEquivalence Classespos/negDescription
11, 3, 5, 7, 9positiveAll inputs present and valid.
21, 4, 5, 7, 9positiveAll inputs except address lines 2 present and valid.
32, 3, 5, 7, 9negativeEverything is valid, but address line 1 is missing.
41, 3, 6, 7, 9negativeEverything is valid, but city is missing.
51, 3, 5, 8, 9negativeEverything is valid, but postal code is missing.
61, 3, 5, 7, 10negativeEverything is valid, but country code is missing.
71, 3, 5, 7, 11negativeEverything is valid except country code.

The resulting test cases were just generated from the specification without looking at the implementation at all. Still, we can be fairly confident that most, if not all, of the possible behavior of the system has been tested.

Dealing With Edge Cases

Boundary Value Analysis is an extension of Equivalence Partitioning. The idea is to choose test cases on the boundaries between classes, to test if edge cases are handled correctly. If there is more than one edge case, we just add additional test cases.

Let’s look at another example: We have a function that calculates the reciprocal of a double value x.

The value at 0 is undefined, so we have two classes: A negative one for x == 0, and a positive one for x != 0.

When looking at the edge cases, we can see that there are actually two different edge cases for the positive class: x = Double.MIN_VALUE and x = - Double.MIN_VALUE. Therefore, we have three resulting test cases.

It should be noted that we could just as well have chosen three classes for x > 0, x < 0 and x == 0. In which case, there would have been only one test case per class – so, still three in total.

Limits of Blackbox Testing

Blackbox testing entirely ignores the source code of the program. It assumes the code is built “optimally” – in the sense that there is no logic beyond what is actually specified.

Take a look at the following implementation of the reciprocal function from earlier:

func reciprocal(x float64) (float64, error) {
   if x == 0 {
       return 0, errors.New("unable to calculate reciprocal of 0")
   }

   if x == 42 {
       // 42 is already the answer
       return x, nil
   } else {
       return 1 / x, nil
   }
}
Code language: Go (go)

In this case, blackbox testing will, likely, not be able to find the “bug” because it’s outside the specification.

Whitebox Testing

By contrast, whitebox testing will select the test cases solely based on the source code (but of course, we still need to verify the outputs against the specification – otherwise the results would be meaningless). The ISO/IEC/IEEE 29119-4 standard calls this concept “structure-based test design”.

The main method for deriving test cases are (Code3According to the ISO/IEC/IEEE definition, “coverage” could also relate blackbox testing – how much of the equivalence partitions is tested. But in everyday language, “coverage” means whitebox testing. See ISO/IEC/IEEE 29119-4.) Coverage Criteria. We add test cases to “cover” previously “uncovered” parts of the code, while trying to minimize the number of test cases. What exactly that means depends on the used criterion.

Control-Flow-Based Coverage

The most common type of Coverage Criteria are concerned with the control flow – which parts of the program have been visited by the test suite?

  • Statement Coverage (also sometimes called Line Coverage) is satisfied if every statement is executed at least once. This is what most coverage tools will report, and what people usually mean when they say “100% test coverage”.
  • Decision Coverage (also Branch Coverage) is satisfied when each branch of the program is reached once. It’s similar to statement coverage. The difference is that statement coverage ignores empty branches (as they don’t contain any statements), while decision coverage does not.
  • Condition Coverage (sometimes Predicate Coverage) is satisfied when each condition – think boolean expression – is at least once false and at least once true. It’s kind of like decision coverage on steroids. The difference is that we have to test multiple ways of entering a branch. There is even another layer, that requires all ways of entering a branch to be checked – it’s called Modified Condition/Decision Coverage.
  • Path Coverage is satisfied when all possible paths through the program are visited. This is only possible for trivial programs, since loops potentially create an infinite number of possible paths.

Data-Flow-Based Coverage

Another approach, that’s less commonly used, is to use the flow of data through the program as the basis for coverage.

First, we need to categorize the operations we perform on data in a more generic way: Variable assignments are called “defs”. Whenever an assignment is used to calculate a new value, the calculation is called a “c-use” of the “def”. Similarly, when it’s used in a predicate, the predicate is called a “p-use”.

Here are some common data-flow-based coverage criteria:

  • all-defs: Every variable assignment is used at least once.
  • all-c-uses: For every variable assignment, each of its c-uses is visited at least once.
  • all-p-uses: For every variable assignment, each of its p-uses is visited at least once.
  • all-c-some-p-uses: For every variable assignment, each of its c-uses is visited at least once. If there is none, at least one p-use is visited.
  • all-p-some-c-uses: For every variable assignment, each of its p-uses is visited at least once. If there is none, at least one c-use is visited.

(There are, oft course, even more, but for the purpose of this article, these will do.)

All this might sound complicated, but actually isn’t. Let’s look at an example:

//                      def-1
//                        |
unsigned fibonacci(unsigned target) {
   unsigned current = 1; // def-2
   unsigned last = 1;    // def-3

   //      def-4       p-use-1  def-5 & c-use-1
   //        |            |          |
   for(unsigned i = 2; i <= target; i++) {

      unsigned tmp = current;   // def-6 & c-use-2
      current = current + last; // def-7 & c-use-3
      last = tmp;               // def-8 & c-use-4
   }

   return current;              // c-use-5
}
Code language: C++ (cpp)

It’s important to note that we always think in terms of variable-assignments. In this example, p-use-1 is actually a p-use of def-1 and def-4, but also def-5. So, if we want all-p-uses coverage, we need to have at least one test case that repeats the loop – otherwise def-5 is not used. Similarly, c-use-3 is a c-use of def-2, and def-3, but also def-7 and def-8.

Direct assignments (c-use-2, c-use-4) and return statements (c-use-5) act as a c-uses.

Limits of Whitebox Testing

The problem we had with blackbox testing was, that the code might do things that are not specified, and therefore not tested. A similar problem occurs with whitebox testing: We derive our test cases purely from the implementation. So, if parts of the specification are not implemented, we are probably not going to test it.

Additionally, depending on which coverage criterion we use, some obvious bugs might not be found.

To give an example: Let’s assume we want to implement the power function for integers. There are two parameters: base and exponent. If the base and the exponent are both 0, the result is undefined – we return an error. And, since we are only using integers, when exponent is negative, we also return an error.

Let’s say, we want to use the following coverage criteria:

  • Statement Coverage
  • Decision Coverage
  • all-defs
  • all-c-some-p-uses

Our code looks like this:

//         def-1        def-2
//           |            |
fn power(base: i64, exponent: i64) -> Result<i64, &'static str> {
   //      p-use-1        p-use-2       p-use-3
   //         |              |             |
   if  exponent < 0 || (base == 0 && exponent != 0) {
      return Err("invalid arguments");
   }
   // empty branch for decision coverage
   // else {}
    
   // def-3
   let mut result = 1;
    
   // p-use-4
   for _ in 0..exponent {
      // def-4 c-use-1 c-use-2
      //  |       |       |
      result = result * base;
   }
   // empty branch for decision coverage
   // "else" { } 
    
   // c-use-3
   return Ok(result)
}
Code language: Rust (rust)

Let’s start with the following test suite:

  • base = 2, exponent = 10
  • base = 1, exponent = -1

Looking at Statement Coverage, the first test case covers everything, except the first return statement, including the for-loop body. The second test case covers the first return statement. Therefore, Statement Coverage is satisfied.

How about Decision Coverage? The first test case follows the virtual “else” branch of the if-condition, and enters the loop later. The second test case enters the “if-branch”, and immediately returns. There is, however, another branch we have not yet visited: The virtual “else” branch of the for-loop in line 22. (In this case, “else” means, the loop is never entered.)

Following the whitebox testing method, we would now add a new test case to satisfy Decision Coverage. Let’s choose: base = 10, exponent = 0

This third test case will enter the first “else” branch, and will not enter the loop, but its “else” branch instead. Decision Coverage is satisfied.

Let’s take a look at all-defs: In the first test case, def-1 is used in p-use-2 and c-use-2, def-2 is used in p-use-1 and p-use-4, def-3 is used in c-use-1, and def-4 is used in c-use-3. So, all-defs is satisfied.

What about all-c-some-p-uses? In the first test case, def-1 is used in c-use-2, def-2 doesn’t have any c-uses4Theoretically, it does have an implicit c-use in the for-loop, but for this example, it does not matter., but it is used in p-use-4. def-3 is used in c-use-1. def-4 is used in c-use-3. Thus, all-c-some-p-uses is also satisfied.

Great. We used an array of coverage criteria, both control-flow-based and data-flow-based – 100% coverage. Surely, we tested enough. The program has to be free of defects, right?

Well, here’s the problem: There is an obvious bug in the program. In line 6, the last comparison operator should be == instead of !=.

Now, to be fair: I selected both the example and the criteria specifically to illustrate this pitfall. Using Condition Coverage or all-p-uses would have caught the bug. But the thing is: So would have blackbox testing – without complex analysis, and potentially with fewer test cases on top of that.

Test-Coupling

In the OOP world, encapsulation means hiding the internal structure from the outside. The idea is to reduce coupling and make refactoring easier.

With whitebox testing, we run into a very similar problem: The tests are extremely closely coupled to the code. If we refactor the code even without changing the interfaces, in the worst case, we might need to rebuild the test suite from scratch. The alternative is to add more test cases to bring the coverage back up. Which means, there might be a bunch of completely redundant test cases in the test suite, that don’t actually test anything specific.

5. Beware of the pesticide paradox.

The 7 Testing Principles

I’m not certain if there’s already a term in use for this – I’m not aware of any. So, I’ll coin a term now: “Test Coupling”

Regarding 100% Code Coverage

Some people claim that code coverage of 100% should be the goal – usually this means statement coverage. There are also standards, especially for aeronautics and astronautics, like for example ESCC-E-ST-40C, that require 100% on certain coverage metrics depending on the criticality of the measured application.

The problem with this is, that it’s at odds with the goal of unit testing – where whitebox testing is most interesting. During unit testing, we are focused on single units, but not the glue code between units – that will be checked on the integration test level. In other words: Unit tests focus only on business logic. Requiring 100% coverage, however, would force us to test stuff that is not business logic.

As an example: Let’s say, we have a value class in Java that represents immutable objects. We have getters for all the fields5*sigh* People Don’t Understand OOP. These getters don’t contain any logic at all. However, to reach 100% statement coverage, we’ll need to write tests for them. Which doesn’t provide any real benefit, on the contrary – it increases coupling.

One solution for this is to just ignore certain parts of the code when calculating code coverage. The ISO/IEC/IEEE 29119-4 standard explicitly defines coverage in a way to allow for “impracticable” test items to be excluded. If we are a bit generous regarding the definition of “impracticable”, this might actually be okay (not really, but I won’t tell if you don’t). ^^

In fact, a lot of coverage tools and quality gates already provide features out of the box (dadum tsss) to ignore specific files when calculating code coverage. Not a clean solution, but meh, good enough.

Code Coverage in Blackbox Testing

An approach that I’ve been recommending for some time now, is to use blackbox testing, but to also incorporate some sort of coverage metric (any is fine). The reasoning is as follows:

The test suite is derived from the specification. Assuming, the structure of the code is reasonable, we can expect a pretty high code coverage as well. If all equivalence partitions are covered, the code coverage should automatically be close to 100%. If it is not, something is probably wrong – let’s investigate what is going on.

This approach might look similar to whitebox testing, but it isn’t. In whitebox testing, we derive the test cases from the code – coverage is a measure of the quality of the test suite. Here, by contrast, we derive the tests from the specification – coverage is, in a way, a measure of the quality of the implementation. (This is not greybox testing, btw.)

An approach like this could have detected the special case for 42 in the example I used to illustrate the limits of blackbox testing earlier.

Conclusion

Both blackbox and whitebox testing solve the same problem. However, in my opinion, blackbox testing is far superior, since whitebox testing has a lot more disadvantages: It might miss obvious bugs, it leads to test coupling, and 100% code coverage is impracticable in most cases. Blackbox testing, by contrast, only has the disadvantage that it cannot detect additional logic that goes beyond the specification. That said, by combining blackbox testing with whitebox testing techniques, like coverage metrics, these cases can also be detected.


I noticed recently, that my blog articles are quite diverse in “genre”: There are more lengthy educational articles like this one, posts where I rant/nerd out on language design like The Language Nightmares Are Programmed In, but also less serious/more fun stuff like HTML Is A Programming Language.

So, I’m thinking of maybe creating some proper taxonomies for them on WordPress, so I can have separate feeds for them. Maybe I’ll also add a “Today I Learned” category for short posts that don’t require weeks to write. Let’s see.

In any case: I hope you enjoyed this article, or maybe even learned something.

See you soon,
Sigma


Leave a Reply

Your email address will not be published. Required fields are marked *