Running a simple workflow using GNU make
Better Code, Better Science: Chapter 8, Part 3
This is a possible section from the open-source living textbook Better Code, Better Science, which is being released in sections on Substack. The entire book can be accessed here and the Github repository is here. This material is released under CC-BY-NC.
A simple workflow example
Most real scientific workflows are complex and can often run for hours, and we will encounter such a complex workflow later in the chapter. However, we will start our discussion of workflows with a relatively simple and fast-running example that will help demonstrate the basic concepts of workflow execution. We will use the same data as above (from Eisenberg et al., 2019) to perform a simple workflow:
Load the demographic and meaningful variables files
Drop any non-numeric variables from each data frame
Join the data frames using their shared index
Compute the correlation matrix across all variables
Generate a clustered heatmap for the correlation matrix
I have implemented each of these components as a module here. The simplest possible workflow would be a script that simply imports and calls each of the methods in turn. For such a simple workflow this would be fine, but we will use the example to show how we might take advantage of more sophisticated workflow management tools.
Running a simple workflow using GNU make
One of the simplest ways to organize a workflow is using the GNU make command, which executes commands defined in a file named Makefile. make is a very handy general-purpose tool that every user of UNIX systems should become familiar with. The Makefile defines a set of labeled commands, like this:
.PHONY: all clean
all: step1.txt step2.txt
# this one takes no input, and outputs step1.txt
step1.txt:
python step1.py
# this one requires step1.txt as input, and outputs step2.txt
step2.txt: step1.txt
python step2.py -i step1.txt
clean:
-rm step1.txt step2.txtIn this case, the command make step1.txt will run the command python step1.py which outputs a file called step1.txt, unless that file already exists and the existing file is newer than its dependencies. This is one of the powerful features of make: since it checks the timestamps of existing files, it can automatically rerun commands if any of their dependencies have changed. The command make step2.txt requires step1.txt, so it will first run that action (which will do nothing if the file already exists and is newer than its dependencies). It will then perform python step2.py -i step1.txt which outputs step2.txt. The command make all will execute the all target, which includes both of the output files, and make clean will remove each of those files if they exist. The targets all and clean are referred to as phony targets since they are not meant to refer to a specific file but rather to an action. The .PHONY designation in the Makefile denotes this, such that those commands will run even if a file called “all” or “clean” happens to exist. This should already show you why make`is such a handy tool: Any time there is a command that you run regularly in a particular directory, you can put it into a Makefile and then execute it with just a single make call.
Here is how we could build a Makefile to run our simple workflow:
# if OUTPUT_DIR isn't already defined, set it to the default
OUTPUT_DIR ?= ./output
.PHONY: all clean
all: $(OUTPUT_DIR)/figures/correlation_heatmap.png
$(OUTPUT_DIR)/data/demographics.csv $(OUTPUT_DIR)/data/meaningful_variables.csv:
@echo "Downloading data..."
mkdir -p $(OUTPUT_DIR)/data $(OUTPUT_DIR)/results $(OUTPUT_DIR)/figures
uv run python scripts/download_data.py $(OUTPUT_DIR)/data
$(OUTPUT_DIR)/data/demographics_numerical.csv: $(OUTPUT_DIR)/data/demographics.csv
@echo "Filtering demographics data..."
uv run python scripts/filter_data.py $(OUTPUT_DIR)/data
$(OUTPUT_DIR)/data/meaningful_variables_numerical.csv: $(OUTPUT_DIR)/data/meaningful_variables.csv
@echo "Filtering meaningful variables data..."
uv run python scripts/filter_data.py $(OUTPUT_DIR)/data
$(OUTPUT_DIR)/data/joined_data.csv: $(OUTPUT_DIR)/data/demographics_numerical.csv $(OUTPUT_DIR)/data/meaningful_variables_numerical.csv
@echo "Joining data..."
uv run python scripts/join_data.py $(OUTPUT_DIR)/data
$(OUTPUT_DIR)/results/correlation_matrix.csv: $(OUTPUT_DIR)/data/joined_data.csv
@echo "Computing correlation..."
uv run python scripts/compute_correlation.py $(OUTPUT_DIR)/data $(OUTPUT_DIR)/results
$(OUTPUT_DIR)/figures/correlation_heatmap.png: $(OUTPUT_DIR)/results/correlation_matrix.csv
@echo "Generating heatmap..."
uv run python scripts/generate_heatmap.py $(OUTPUT_DIR)/results $(OUTPUT_DIR)/figures
clean:
rm -rf $(OUTPUT_DIR)Most of the targets (except for “clean” and “all”) refer to specific files that are required for the workflow. For example, the first target refers to the two files that need to be downloaded by the download_data.py script. This target does not rely on the outputs of any others, so there is nothing following the colon in the target name. For the others, they require particular inputs, which come after the colon; thus, if those don’t already exist then their targets will be run first. Note that make requires the use of tabs to indent commands, and will fail if spaces are used; thus, Makefile commands often need to be reformatted when copied and pasted since this often converts tabs to spaces.
We can run the entire workflow by simply running `make all`:
➤ make all
Downloading data...
mkdir -p ./output/data ./output/results ./output/figures
uv run python scripts/download_data.py ./output/data
Downloaded meaningful_variables.csv (522 rows)
Downloaded demographics.csv (522 rows)
Filtering demographics data...
uv run python scripts/filter_data.py ./output/data
Filtered meaningful_variables: (522, 193) -> (522, 193)
Filtered demographics: (522, 33) -> (522, 28)
Joining data...
uv run python scripts/join_data.py ./output/data
Meaningful variables: (522, 193)
Demographics: (522, 28)
Joined: (522, 221)
Computing correlation...
uv run python scripts/compute_correlation.py ./output/data ./output/results
Loaded joined data: (522, 221)
Saved correlation matrix: (221, 221)
Generating heatmap...
uv run python scripts/generate_heatmap.py ./output/results ./output/figures
Loaded correlation matrix: (221, 221)
Saved heatmap to output/figures/correlation_heatmap.pngThe rules that refer to specific files will only be triggered if the filename in question doesn’t exist, as we can see if we run the `make` command again:
➤ make all
make: Nothing to be done for `all'.However, if we delete the heatmap file and rerun the `make` command, then the `generate_heatmap` action will be triggered:
➤ make all
Generating heatmap...
uv run python scripts/generate_heatmap.py ./output/results ./output/figures
Loaded correlation matrix: (221, 221)
Saved heatmap to output/figures/correlation_heatmap.pngWe could also take advantage of another feature of make: it only triggers the action if a file with the name of the action doesn’t exist, or if the existing file is not newer than its dependencies. Thus, if the command was make results/output.txt, then the action would only be triggered if the file does not exist or if it was older than the inputs. This is why we had to put the .PHONY command in the makefile above: it’s telling make that those are not meant to be interpreted as file names, but rather as commands, so that they will be run even if files named “all” or “clean” exist.
For many simple workflows make can be a perfectly sufficient solution to workflow management, but we will see below why it’s not sufficient to manage a complex workflow. For those workflows we could either build our own more complex workflow management system, or we could use an existing software tool that is built to manage workflow execution, known as a workflow engine. In general I prefer to use an existing solution unless it doesn’t solve my problem, so I will now turn to discussing packages for workflow management.
In the next post I will introduce workflow engines in more detail.
