Scientific workflow management
Better Code, Better Science: Chapter 8, Part 1
This is a possible section from the open-source living textbook Better Code, Better Science, which is being released in sections on Substack. The entire book can be accessed here and the Github repository is here. This material is released under CC-BY-NC-ND.
In most parts of science today, the processing and analysis of data comprise many different steps. We will refer to such a set of steps as a computational workflow; while there are certainly many types of non-computational workflows in science, we will focus here on computational workflows. If you have been doing science for very long, you have very likely encountered a mega-script that implements such a workflow. This is a script that may be hundreds or even thousands of lines long that runs a single workflow from start to end. Often these scripts are handed down to new trainees over generations, such that users become afraid to make any changes lest the entire house of cards comes crashing down. I think that most of us can agree that this is not an optimal workflow, and in this chapter I will discuss in detail how to move from a mega-script to a workflow that will meet all of the requirements to provide robust and reliable answers to our scientific questions.
What do we want from a scientific workflow
First let’s ask: What do we want from a computational scientific workflow? Here are some of the factors that I think are important. First, we care about the correctness of the workflow, which includes the following factors:
Validity: The workflow includes validation procedures to ensure against known problems or edge cases.
Reproducibility: The workflow can be rerun from scratch on the same data and get the same answer, at least within the limits of uncontrollable factors such as floating point imprecision and operating system differences.
Robustness: When there is a problem, the workflow fails quickly with explicit error messages, or degrades gracefully when possible.
Second, we care about the usability of the workflow. Factors related to usability include:
Configurability: The workflow uses smart defaults, but allows the user to easily change the configuration.
Portability: We would like for the workflow to be easily runnable across multiple systems.
Parameterizability: Multiple runs of the workflow can be executed with different parameters, and the separate outputs can be tracked.
Standards compliance: The workflow leverages common standards to easily read in data and generates output using community standards for file formats and organization when available.
Third, we care about the engineering quality of the code, which includes:
Maintainability: The workflow is structured and documented so that others (including your future self) can easily maintain, update, and extend it in the future.
Modularity: The workflow is composed of a set of independently testable modules, which can be swapped in or out relatively easily.
Idempotency: This term from computer science means that the result of the workflow doesn’t change if it is re-run.
Traceability: All operations are logged, and provenance information is stored for outputs.
Finally, we care about the efficiency of the workflow implementation. This includes:
Incremental execution: The workflow only reruns a module if necessary, such as when an input changes.
Cached computation: The workflow pre-computes and reuses results from expensive operations when possible.
It’s worth noting that these different desiderata will sometimes conflict with one another (such as configurability versus maintainability), and that no workflow will be perfect. For example, a highly configurable workflow will often be more difficult to maintain.
FAIR-inspired practices for workflows
In the earlier chapter on Data Management I discussed the FAIR (Findable, Accessible, Interoperable, and Reusable) principles for data. Since those principles were proposed in 2016 they have been extended to many other types of research objects, including workflows (Wilkinson et al., 2025). The reader who is not an informatician is unfortunately likely to quickly glaze over when reading these articles, as they rely on concepts and jargon that will be unfamiliar to most scientists.
Realizing that most scientists are unlikely to go to the lengths of a fully FAIR workflow, and preferring that the perfect never be the enemy of the good, I think that we can take an “80/20” approach, meaning that we can get 80% of the benefits for about 20% of the effort. We can adhere to the spirit of the FAIR Workflows principle by adopting the following principles, based in part on the “Ten Quick Tips for Building FAIR Workflows” presented by de Visser et al. (2023):
Metadata: Provide sufficient metadata in a standard machine-readable format to make the workflow findable once it is shared.
Version control: All workflow code should be kept under version control and hosted on a public repository such as Github.
Documentation: Workflows should be well documented. Documentation should focus primarily on the scientific motivation and technical design of the workflow, along with instructions on how to run it and description of the outputs.
Standard organization schemes: Both the workflow files (code and configuration) and data files should follow established standards for organization.
Standard file formats: The inputs and outputs to the workflow should use established standard file formats rather than inventing new formats.
Configurability: The workflow should be easily configurable, and example configuration files should be included in the repository.
Requirements: The requirements for the workflow should be clearly specified, either in a file (such as
pyproject.tomlorrequirements.txt) or in a container configuration file (such as aDockerfile).Clear workflow structure: The workflow structure should be easily understandable.
There are certainly some contexts where a more formal structure adhering in detail to the FAIR Workflows standard may be required, as in large collaborative projects with specific compliance objectives, but these rough guidelines should get a researcher most of the way there.
In the next post I will move on to discussing workflow patterns.
