Streaming workflows and method chaining
Better Code, Better Science: Chapter 8, Part 2
This is a possible section from the open-source living textbook Better Code, Better Science, which is being released in sections on Substack. The entire book can be accessed here and the Github repository is here. This material is released under CC-BY-NC.
One of the simplest ways to build a workflow is to stream data directly from one command to another, such that the intermediate results are ephemeral since no information about the intermediate states is saved. Such a workflow is linear in the sense that there is a single pathway through the workflow. One common way that this is accomplished is through the use of pipes, which are a syntactic construct that feed the results of one process directly into the next process. Some readers may be familiar with pipes from the UNIX shell, where they are represented by the vertical bar “|”. For example, let’s say that we had a log file that contains the following entries:
2024-01-15 10:23:45 ERROR: Database connection failed
2024-01-15 10:24:12 ERROR: Invalid user input
2024-01-15 10:25:33 ERROR: Database connection failed
2024-01-15 10:26:01 INFO: Request processed
2024-01-15 10:27:15 ERROR: Database connection failedand that we wanted to generate a summary of errors. We could use the following pipeline:
grep "ERROR" app.log | sed 's/.*ERROR: //' | sort | uniq -c | sort -rn > error_summary.txtwhere:
grep “ERROR” app.logextracts lines containing the word “ERROR”sed ‘s/.*ERROR: //’replaces everything up to the actual message with an empty stringsortsorts the rows alphabeticallyuniq -ccounts the number of appearances of each unique error messagesort -rnsorts the rows in reverse numerical order (largest to smallest)> error_summary.txtredirects the output into a file callederror_summary.txt
Pipes are also commonly used in the R community, where they are a fundamental component of the tidyverse ecosystem of packages.
Method chaining
One way that streaming workflows can be built in Python is using method chaining, where each method returns an object on which the next method is called; this is slightly different from the operation of UNIX pipes, where it is the output of each command that is being passed through the pipe rather than an entire object. This is commonly used to perform data transformations in pandas, as it allows composing multiple transformations into a single command. As an example, we will work with the Eisenberg et al. (2019) dataset that we used in a previous chapter, to compute the probability of having ever been arrested separately for males and females in the sample. To do this we need to perform a number of operations:
drop any observations that have missing values for the
SexorArrestedChargedLifeCountvariablesreplace the numeric values in the
Sexvariable with text labelscreate a new variable called
EverArrestedthat binarizes the counts in theArrestedChargedLifeCountvariablegroup the data by the
Sexvariableselect the column that we want to compute the mean of (
EverArrested)compute the mean by group
We can do this in a single command using method chaining in pandas. It’s useful to format the code in a way that makes the pipeline steps explicit, by putting parentheses around the operation; in Python, any commands within parentheses are implicitly treated as a single line, which can be useful for making complex code more readable:
arrest_stats_by_sex = (df
.dropna(subset=['Sex', 'ArrestedChargedLifeCount'])
.replace({'Sex': {0: 'Male', 1: 'Female'}})
.assign(EverArrested=lambda x: (
x['ArrestedChargedLifeCount'] > 0).astype(int))
.groupby('Sex')
['EverArrested']
.mean()
)
print(arrest_stats_by_sex)Sex
Female 0.156489
Male 0.274131
Name: EverArrested, dtype: float64Note that pandas data frames also include an explicit .pipe method that allows using arbitrary functions within a pipeline.
While these kinds of streaming workflows can be useful for simple data processing operations, they can become very difficult to debug, so I would generally avoid using complex functions within a method chain.
