Discussion about this post

User's avatar
Neural Foundry's avatar

Superb breakdown of the discipline required to keep notebooks reproducible. The kernel restart habit is the single most underappreciated practice in data science workflows.

The global scope problem you highlighted with the function example is where most notebook bugs hide. Jupyter's execution model creates this invisible coupling between cells that violates every principle of modular code. When a function defined in a notebook cell can silently access variables from the global namespace, you lose the explicitness that makes code debuggable. The example comparing in-notebook functions versus imported functions makes this concrete—imported functions fail loudly when dependencies are missing, while in-notebook functions fail silently or produce incorrect results.

What's particularly insidious is that this problem compounds over time. A notebook that starts small and clean gradually accumulates hidden state. By the time you have 50 cells and multiple interdependent functions, you're debugging a stateful system where cell execution order matters more than the code itself. The "restart and run all" discipline is the only way tobreak that cycle, but it requires discipline that most practitioners don't develop until they've been burned multiple times.

Expand full comment
Pete Bachant's avatar

To expand on the first tip: Never trust any notebook output as "official" unless the notebook was run in batch mode, e.g., with nbconvert or papermill. Even better, the notebook should be incorporated into a pipeline so the entire project can be run all at once.

Expand full comment

No posts

Ready for more?