The structure of a good software test

Better Code, Better Science: Chapter 4, Part 2

Jul 22, 2025

This is a section from the open-source living textbook Better Code, Better Science, which is being released in sections on Substack. The entire book can be accessed here and the Github repository is here. This material is released under CC-BY-NC.

The structure of a good test

A commonly used scheme for writing a test is "given/when/then":

given some particular situation as background
when something happens (such as a particular input)
then something else should happen (such as a particular output or exception)

Importantly, a test should only test one thing at a time. This doesn't mean that the test should necessarily only test for one specific error at a time; rather, it means that the test should assess a specific situation ("given/when"), and then assess all of the possible outcomes that are necessary to ensure that the component functions properly ("then"). You can see this in the test for zero standard deviation that we generated in the earlier example, which actually tested for two conditions (the intended value being present in the list, and the list having a length of one) that together define the condition that we are interested in testing for.

How do we test that the output of a function is correct given the input? There are different answers for different situations:

commonly known answer: Sometimes we possess inputs where the output is known. For example, if we were creating a function that computes the circumference of a circle, then we know that the output for an input radius of 1 should be 2 * pi. This is generally only the case for very simple functions.
reference implementation: In other cases we may have a standard implementation of an algorithm that we can compare against. While in general it's not a good idea to reimplement code that already exists in a standard library, in come cases we may want to extend existing code but also check that the basic version still works as planned.
parallel implementation: Some times we don't have a reference implementation, but we can code up another parallel implementation to compare our code to. It's important that this isn't just a copy of the code used in the function; in that case, it's really not a test at all!
behavioral test: Sometimes the best we can do is to run the code repeatedly and ensure that it behaves as expected on average. For example, if a function outputs a numerical value and we know the expected distribution of that value given a particular input, we can ensure that the result matches that distribution with a high probability. Such probabilistic tests are not optimal in the sense that they can occasionally fail even when the code is correct, but they are sometimes the best we can do.

Test against the interface, not the implementation

A good test shouldn't know about the internal implementation details of the function that it is testing, and changes in the internal code that do not modify the input-output relationship should not affect the test. That is, from the standpoint of the test, a function should be a "black box".

The most common way in which a test can violate this principle is by accessing the internal variables of a class that it is testing. For example, we might generate a class that performs a scaling operation on a numpy matrix:

class SimpleScaler:
    def __init__(self):
        self.transformed_ = None

    def fit(self, X):
        self.mean_ = X.mean(axis=0)
        self.std_ = X.std(axis=0)

    def transform(self, X):
        self.transformed_ = (X - self.mean_) / self.std_
        return self.transformed_

    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

We could write a test that checks the values returned by the `fit_transform()` method, treating the the class as a black box:

def test_simple_scaler_interface():
    X = np.array([[1, 2], [3, 4], [5, 6]])
    scaler = SimpleScaler()
    
    # Test the interface without accessing internals
    transformed_X = scaler.fit_transform(X)
    assert np.allclose(transformed_X.mean(axis=0), np.array([0, 0]))
    assert np.allclose(transformed_X.std(axis=0), np.array([1, 1]))

Alternatively one might use knowledge of the internals of the class to test the transformed value:

def test_simple_scaler_internals():

    X = np.array([[1, 2], [3, 4], [5, 6]])
    scaler = SimpleScaler()
    _ = scaler.fit_transform(X)
    
    # Test that the transformed data is correct using the internal
    assert np.allclose(scaler.transformed_.mean(axis=0), np.array([0, 0]))
    assert np.allclose(scaler.transformed_.std(axis=0), np.array([1, 1]))

Both of these tests pass against the class definition shown above. However, if we were to change the way that the transformation is performed (for example, we decide to use the StandardScaler function from scikit-learn instead of writing our own), then the implementation-aware tests are likely to fail unless the sample internal variable names are used. In general we should only interact with a function or class via its explicit interfaces.

Tests should be independent

In scientific computing it's common to compose many different operations into a workflow. If we want to test the workflow, then the tests of later steps in the workflow must necessarily rely upon earlier steps. We could in theory write a set of tests that operate on a shared object, but the tests would fail if executed in an incorrect order, even if the code was correct. Similarly, a failure on an early test would cause cascading failures in later tests, even if their code was correct. The use of ordered tests also prevents the parallel execution of tests, which may slow down testing for complex projects. For these reasons, we should always aim to create tests that can be executed independently.

Here is an example where coupling between tests could cause failures. First we generate two functions that make changes in place to a data frame:

def split_names(df):
    df['firstname'] = df['name'].apply(lambda x: x.split()[0])
    df['lastname'] = df['name'].apply(lambda x: x.split()[1])

def get_initials(df):
    df['initials'] = df['firstname'].str[0] + df['lastname'].str[0]

In this case, the get_initials() function relies upon the split_names() function having been run, since otherwise the necessary columns won't exist in the data frame. We can then create tests for each of these, and a data frame that they can both use:

people_df = pd.DataFrame({'name': ['Alice Smith', 'Bob Howard', 'Charlie Ashe']}) 

def test_split_names():
    split_names(people_df)
    assert people_df['firstname'].tolist() == ['Alice', 'Bob', 'Charlie']
    assert people_df['lastname'].tolist() == ['Smith', 'Howard', 'Ashe']

def test_get_initials():
    get_initials(people_df)
    assert people_df['initials'].tolist() == ['AS', 'BH', 'CA']

These tests run correctly, but the same tests fail if we change their order such that test_get_intials() runs first, because the necessary columns (firstname and lastname) have not yet been created.

One simple way to deal with this is to set up all of the necessary structure locally within each test:

def get_people_df():
    return pd.DataFrame({'name': ['Alice Smith', 'Bob Howard', 'Charlie Ashe']}) 

def test_split_names_fullsetup():
    local_people_df = get_people_df()
    split_names(local_people_df)
    assert local_people_df['firstname'].tolist() == ['Alice', 'Bob', 'Charlie']
    assert local_people_df['lastname'].tolist() == ['Smith', 'Howard', 'Ashe']

def test_get_initials_fullsetup():
    local_people_df = get_people_df()
    split_names(local_people_df)
    get_initials(local_people_df)
    assert local_people_df['initials'].tolist() == ['AS', 'BH', 'CA']

For simple functions like these this would not cause too much computational overhead, but for computationally intensive functions we would like to be able to reuse the results from the first time each function is run. In a later section we will discuss the use of fixtures which allow this kind of reuse across tests while avoiding the ordering problems that we saw above when using a global variable across tests.

Neural Strategies

Discussion about this post