Computational notebooks
Better Code, Better Science: Chapter 6, Part 4
This is a possible section from the open-source living textbook Better Code, Better Science, which is being released in sections on Substack. The entire book can be accessed here and the Github repository is here. This material is released under CC-BY-NC. Thanks to Steffen Bollman for helpful suggestions on a draft of this section.
The advent of the Jupyter notebook has fundamentally changed the way that many scientists do their computational work. By allowing the mixing together of code, text, and graphics, Project Jupyter has taken Donald Knuth’s vision of “literate programming” and made it available in a powerful way to users of many supported languages, including Python, R, Julia, and more. Many scientists now do the majority of their computing within these notebooks or similar literate programming frameworks (such as RMarkdown or Quarto notebooks). Given its popularity and flexibility we will focus on Jupyter, but some of the points raised below extend to other frameworks as well.
The exploding prevalence of Jupyter notebooks is unsurprising, given their many useful features. They match the way that many scientists interactively work to explore and process their data, and provide a way to visualize results next to the code and text that generates them. They also provide an easy way to share results with other researchers. At the same time, they come with some particular software development challenges, which we discuss further below.
What is a Jupyter notebook?
Put simply, a Jupyter notebook is a structured document that allows the mixing together of code and text, stored as a JSON (JavaScript Object Notation) file. It is structured as a set of cells, each of which can be individually executed. Each cell can contain text or code, supporting a number of different languages. The user interacts with the notebook through a web browser or other interface, while the commands are executed by a kernel that runs in the background. We won’t provide an introduction to using Jupyter notebooks here; there are many of them online. Instead, we will focus on the specific aspects of Jupyter notebook usage that are relevant to reproducibility.
Many users of Jupyter notebooks work with them via the default Jupyter Lab interface within a web browser, and there are often good reasons to use this interface. However, other IDEs (including VSCode and PyCharm) provide support for the editing and execution of Jupyter notebooks. The main reason that I generally use a standalone editor rather than the Jupyter Lab interface is that these editors allow seamless integration of AI coding assistants. While there are tools that attempt to integrate AI assistants within the native Jupyter interface, they are at present nowhere near the level of the commercial IDEs like VSCode. In addition, these IDEs provide easy access to many other essential coding features, such as code formatting and automated linting.
Patterns for Jupyter notebook development
There are a number of different ways that one can work Jupyter notebooks into their scientific computing workflow. I’ll outline a number of different patterns, which are not necessarily exclusive of one another; rather, they demonstrate a variety of different ways that one might use notebooks in a scientific workflow.
All interactive notebooks, all the time
Some researchers do all of their coding interactively within notebooks. This is the simplest pattern, since it only requires a single interface, and allows full interactive access to all of the code. However, in my opinion there are often good reasons not to use this approach. Several of these are drawn from Joel Grus’ famous 2018 JupyterCon talk titled ”I don’t like notebooks”, but they all derive from my experience as a user of Jupyter notebooks for more than a decade.
Dependence on execution order
The cells in a Jupyter notebook can be executed in any order by the user, which means that the current value of all of the variables in the workspace depends on the exact order in which the previous cells were executed. While this can sometimes be evident from the execution numbers that are presented alongside each cell, for a complex notebook it can become very difficult to identify exactly what has happened. This is why most Jupyter power-users learn to reflexively restart the kernel and run all of the cells in the notebook, as this is the only way to guarantee ordered execution. This is also an issue that is commonly confusing for new users; I once taught a statistics course using Jupyter notebooks within Google Colab, and I found that very often student confusions were resolved by restarting the kernel and rerunning the notebook, reflecting their basis in out-of-order execution. Out-of-order execution is exceedingly common; an analysis of 1.4 million notebooks from Github by Pimentel and colleagues found that for notebooks in which the execution order to unambiguous, 36.4% of the notebooks had cells that were executed out of order.
Global workspace
As we discussed earlier in the book, global variables have a bad reputation for making debugging difficult, since changes to a global variable can have wide-ranging effects on the code that can be difficult to identify. For this reason, we generally try to encapsulate variables so that their scope is only as wide as necessary. However, all variables are global in a notebook, unless they are contained within a function or class defined within the notebook. However, the global scope of variables in the notebook means that if there is a variable used within a function with the same name as a variable in the global namespace, that variable can be accessed within the function. I have on more than one occasion seen tricky bugs occur when the user creates a function to encapsulate some code, but then forgets to define a variable within the function that exists in the global state. This leads to the operation of the function changing depending on the value of the global variable, in a way that can be incredibly confusing. It is for this reason that I always suggest moving functions out of a notebook into a module as soon as possible, to prevent these kinds of bugs from occurring (among other reasons); I describe this in more detail below.
Notebooks play badly with version control
Because Jupyter notebooks store execution order in the file, the file contents will change whenever a cell is executed. This means that version control systems will register non-functional changes in the file as a change, since they are simply looking for any modification of the file. I discuss this in much more detail below.
Notebooks discourage testing
Although frameworks exist for code testing within Jupyter notebooks, it is much more straightforward to develop tests for separate functions defined outside of a notebook using standard testing approaches, as outlined in Chapter 4. This a strong motivator for extracting important functions into modules, as discussed further below.
Notebooks as a rapid prototyping tool
Often we want to just explore an idea without developing an entire project, and Jupyter notebooks are an ideal platform for exploring and prototyping new ideas. This is my most common use case for notebooks today. For example, let’s say that I want to try out a new Python package for data analysis on one of my existing datasets. It’s very easy to spin up a notebook and quickly try it out. If I decide that it’s something that I want to continue pursuing, I would then transition to implementing the code in a Python script or module, depending on the nature of the project.
Notebooks as a high-level workflow execution layer
Another way to use notebooks is as a way to interactively control the execution of a workflow, when the components of the workflow have been implemented separately in a Python module. This approach addresses some of the concerns raised above regarding Jupyter notebooks, and allows the user to see the workflow in action and possibly examine intermediate products for quality assurance. If one needs to see a workflow in action, this can be a good approach.
Notebooks for visualization only
Notebooks shine as tools for data visualization, and one common pattern is to perform data analyses using standard Python scripts/modules, saving the results to output files, and then use notebooks to visualize the results. As long as most of the visualizations are standalone, e.g. as they would be if the visualization code is defined in a separate module, then one can display visualizations in a notebook without concern about state dependence or execution order. Notebooks are also easy to share (see below), which makes them a useful way to share visualizations with others.
Notebooks as literate programs
A final way that one might use notebooks is as a way to create standalone programs with rich annotation via the markdown support provided by notebooks. In this pattern, one would use a notebook editor to generate code, but then run the code as if it were a standard script, using `jupyter nbconvert --execute` to execute the notebook and generate a rendered version. While this is plausible, I don’t think it’s an optimal solution. Instead, I think that one should consider generating pure Python code using embedded notations such as the py:percent notation supported by jupytext, which we will describe in more detail below.
In the next post I will discuss the use of Jupyter notebooks as a way to mix together different programming languages.
