Version control and Jupyter notebooks
Better Code, Better Science: Chapter 6, Part 7
This is a possible section from the open-source living textbook Better Code, Better Science, which is being released in sections on Substack. The entire book can be accessed here and the Github repository is here. This material is released under CC-BY-NC. Thanks to Steffen Bollman for helpful suggestions on a draft of this section.
While notebooks have understandably gained wide traction, they also have some important limitations. Foremost, the structure of the .ipynb file makes them problematic for use in version control systems like git. The file itself is stored as a JSON (JavaScript Object Notation) object, which in Python translates into a dictionary. As an example, we created a very simple notebook and saved it to our computer. We can open it as a json file, where we see the following contents:
{’cells’: [{’cell_type’: ‘markdown’,
‘metadata’: {},
‘source’: [’# Example notebook’]},
{’cell_type’: ‘code’,
‘execution_count’: 3,
‘metadata’: {},
‘outputs’: [],
‘source’: [’import numpy as np\n’, ‘\n’, ‘x = np.random.randn(1)’]}],
‘metadata’: {’language_info’: {’name’: ‘python’}},
‘nbformat’: 4,
‘nbformat_minor’: 2} You can see that the file includes a section for cells, that in this case can contain either Markdown or Python code. In addition, it contains various metadata elements about the file. One thing you should notice is that each code cell contains an execution_count variable, which stores the number of times the cell has been executed. If we rerun the code in that cell without making any changes and then save the notebook, we will see that the execution count has incremented by one. We can see this by running `git diff` on this new file after having checked in the previous version:
- “execution_count”: 3,
+ “execution_count”: 4,This is one of the reasons why we say that notebook files don’t work well with version control: simply executing the file without any actual changes will still result in a difference according to git, and these differences can litter the git history, making it very difficult to discern true code differences.
Another challenge with using Jupyter notebooks alongside version control occurs when the notebook includes images, such as output from plotting commands. Images in Jupyter notebooks are stored in a serialized text-based format; you can see this by perusing the text of a notebook that includes images, where you will see large sections of seemingly random text, which represent the content of the image converted into text. If the images change then the git diff will be littered with huge sections of this gibberish text. One could filter these out when viewing the diffs (e.g. using grep) but another challenge is that very large images can cause the version control system to become slow and bloated if there are many notebooks with images that change over time.
There are tools that one can use to address this, such as nbstripout to remove cell outputs before committing a file, or nbdime to provide “rich diffs” that make it easier to see the differences in the current state versus the last commit. There is also a library called nbdev that provides git hooks to help with the git workflow. However, converting notebooks to pure Python code prior to committing is a straight forward way to work around these issues.
Converting notebooks to pure Python
The jupytext tool supports several formats that can encode the metadata from a notebook into comments within a python file, allowing direct conversion in both directions between a Jupyter notebook and a pure Python file. We like the py:percent format, which places a specific marker (# %%) above each cell:
# %% [markdown]
# ### Example notebook
#
# This is just a simple example
# %%
import numpy as np
import matplotlib.pyplot as pltThese cells can then be version-controlled just as one would with any Python file. To create a linked Python version of a Jupyter notebook, use the jupytext command:
❯ jupytext --set-formats ipynb,py:percent example_notebook2.ipynb
[jupytext] Reading example_notebook2.ipynb in format ipynb
[jupytext] Updating notebook metadata with ‘{”jupytext”: {”formats”: “ipynb,py:percent”}}’
[jupytext] Updating example_notebook2.ipynb
[jupytext] Updating example_notebook2.pyThis creates a new Python file that is linked to the notebook, such that edits can be synchronized between the notebook and python version.
Using jupytext as a pre-commit hook
If one wants to edit code using Jupyter notebooks while still maintaining the advantages of the pure Python format for version control (assuming one is using Git), one option is to apply Jupytext as part of a pre-commit hook, which is a git feature that allows commands to be executed automatically prior to the execution of a commit. To use this function, you must have the pre-commit Python package installed. Automatic syncing of python and notebook files can be enabled within a git repository by creating a file called .pre-commit-config.yaml within the main repository directory, with the following contents:
repos:
-
repo: local
hooks:
-
id: jupytext
name: jupytext
entry: jupytext --from ipynb --to py:percent --pre-commit
pass_filenames: false
language: python
-
id: unstage-ipynb
name: unstage-ipynb
entry: git reset HEAD **/*.ipynb
pass_filenames: false
language: systemThe first section will automatically run jupytext and generate a pure Python version of the notebook before the commit is completed. The second section will unstage the .ipynb files before committing, so that they will not be committed to the git repository (only the Python files will). This will keep the Python and Jupyter notebook files synced while only committing the Python files to the git repository.
This post marks the end of our chapter on project organization and structure. Watch for the next chapter coming soon!
