The challenges of AI-generated tests

Better Code, Better Science: Chapter 4, Part 3

Jul 28, 2025

This is a section from the open-source living textbook Better Code, Better Science, which is being released in sections on Substack. The entire book can be accessed here and the Github repository is here. This material is released under CC-BY-NC.

Also: This week I will be putting out one new post every day, in honor of the Neurohackademy workshop going on this week.

Testing frameworks

One could write tests without the help of any specialized packages, but we generally use a testing framework to help automate the activities of testing. There are several testing frameworks for Python; we will focus on the popular and flexible pytest framework.

We will start with a very simple example: a function that generates the Euclidean distance between two points. Copilot generates the following for us based on the prompt in the comment:

# generate a function that calculates the distance between two points
# where each point is defined as a tuple of two numbers

import math

def distance(p1, p2):
    """Calculate the distance between two points"""
    x1, y1 = p1
    x2, y2 = p2
    return math.sqrt((x2 - x1)**2 + (y2 - y1)**2)

Now we would like to generate some tests for this code to make sure that it works properly. If we ask Copilot to generate some tests, it does a seeming decent job:

def test_distance_zero():
    assert distance((0, 0), (0, 0)) == 0

def test_distance_positive_coordinates():
    assert distance((1, 2), (4, 6)) == 5

def test_distance_negative_coordinates():
    assert distance((-1, -2), (-4, -6)) == 5

def test_distance_mixed_coordinates():
    assert distance((1, -2), (-4, 6)) == math.sqrt(125)

def test_distance_same_x():
    assert distance((3, 4), (3, 8)) == 4

def test_distance_same_y():
    assert distance((3, 4), (7, 4)) == 4

Now that we have our tests (stored in `src/BetterCodeBetterScience/distance_testing/test_distance.py`), we can run them using the pytest command:

pytest src/BetterCodeBetterScience/distance_testing

This command will cause pytest to search (by default) for any files named `test_*.py` or `*_test.py` in the relevant path, and the select any functions whose name starts with the prefix "test". Running those tests, we get an error:

>       assert distance((1, -2), (-4, 6)) == math.sqrt(125)
E       assert 9.433981132056603 == 11.180339887498949
E        +  where 9.433981132056603 = distance((1, -2), (-4, 6))
E        +  and   11.180339887498949 = <built-in function sqrt>(125)
E        +    where <built-in function sqrt> = math.sqrt

Here we see that the value returned by our function is different from the one expected by the test; in this case, the test value generated by Copilot is incorrect. In our research, it was not uncommon for AI coding agents to generate incorrect test values, so these must always be checked by a domain expert. Once we fix the expected value for that test (the square root of 89), then we can rerun the tests and see that they have passed:

python -m pytest pytest src/BetterCodeBetterScience/distance_testing
==================== test session starts =====================                                         

src/codingforscience/simple_testing/test_distance.py . [ 16%]
.....                                                  [100%]

===================== 6 passed in 0.06s ======================

Potential problems with AI-generated tests

If we are going to rely upon AI tools to generate our tests, we need to be sure that the tests are correct. One of my early forays into AI-driven test generation uncovered an interesting example of how this can go wrong.

In our early project that examined the performance of GPT-4 for coding, one of the analyses that we performed first asked GPT-4 to do was to generate a set of functions related to common problems in several scientific domains, and then to generate tests to make sure that the function performed correctly. One of the functions that was generated was the escape velocity function shown in the previous post, for which GPT-4 generated the following test:

def test_escape_velocity():

    mass_earth = 5.972e24
    radius_earth = 6.371e6
    result = escape_velocity(mass_earth, radius_earth)
    assert pytest.approx(result, rel=1e-3) == 11186.25

    mass_mars = 6.4171e23
    radius_mars = 3.3895e6
    result = escape_velocity(mass_mars, radius_mars)
    assert pytest.approx(result, rel=1e-3) == 5027.34

    mass_jupiter = 1.8982e27
    radius_jupiter = 6.9911e7
    result = escape_velocity(mass_jupiter, radius_jupiter)
    assert pytest.approx(result, rel=1e-3) == 59564.97

When we run this test (renaming it `test_escape_velocity_gpt4`), we see that one of the tests fails:

❯ pytest src/BetterCodeBetterScience/escape_velocity.py::test_escape_velocity_gpt4
================= test session starts =================
platform darwin -- Python 3.12.0, pytest-8.4.1, pluggy-1.5.0
rootdir: /Users/poldrack/Dropbox/code/BetterCodeBetterScience
configfile: pyproject.toml
plugins: cov-5.0.0, anyio-4.6.0, hypothesis-6.115.3, mock-3.14.0
collected 1 item

src/BetterCodeBetterScience/escape_velocity.py F                                      [100%]

======================== FAILURES ========================
________________ test_escape_velocity_gpt4 _______________

    def test_escape_velocity_gpt4():

        mass_earth = 5.972e24
        radius_earth = 6.371e6
        result = escape_velocity(mass_earth, radius_earth)
        assert pytest.approx(result, rel=1e-3) == 11186.25

        mass_mars = 6.4171e23
        radius_mars = 3.3895e6
        result = escape_velocity(mass_mars, radius_mars)
        assert pytest.approx(result, rel=1e-3) == 5027.34

        mass_jupiter = 1.8982e27
        radius_jupiter = 6.9911e7
        result = escape_velocity(mass_jupiter, radius_jupiter)
>       assert pytest.approx(result, rel=1e-3) == 59564.97
E       assert 60202.716344497014 ± 60.2027 == 59564.97
E
E         comparison failed
E         Obtained: 59564.97
E         Expected: 60202.716344497014 ± 60.2027

src/BetterCodeBetterScience/escape_velocity.py:52: AssertionError
================= short test summary info =================
FAILED src/BetterCodeBetterScience/escape_velocity.py::test_escape_velocity_gpt4 - assert 60202.716344497014 ± 60.2027 == 59564.97
==================== 1 failed in 0.12s ====================

It seems that the first two assertions pass but the third one, for Jupiter, fails. This failure took a bit of digging to fully understand. In this case, the code and test value are both correct, depending on where you stand on Jupiter! The problem is that planets are oblate, meaning that they are slightly flattened such that the radius around the equator is higher than at other points. NASA’s Jupiter fact sheet claims an escape velocity of 59.5 km/s, which seems to be the source of the test value. This is correct when computed using the equatorial radius of 71492 km. However, the radius given for Jupiter in GPT-4's test (69911 km) is the volumetric mean radius rather than the equatorial radius, and the value generated by the code (60.2 km/s) is correct when computed using the volumetric mean radius. Thus, the test failed not due to any problems with the code itself, but due to a mismatch in assumptions regarding the combination of test values. This example highlights the importance of understanding and checking the tests that are generated by AI coding tools.

In the next post I will go through an example of AI-assisted test generation in detail.

Neural Strategies

Discussion about this post