Neural Strategies

Estimating parameters using optimization

Russ Poldrack — Tue, 02 Jun 2026 15:02:01 GMT

This is a possible section from the open-source living textbook Better Code, Better Science, which is being released in sections on Substack. The entire book can be accessed here and the Github repository is here. This material is released under CC-BY-NC-ND.

In many cases we don’t have a closed form solution that we can use to compute the parameter estimates directly. In this case it’s common to use some form of optimization (or search) process to find the parameters that best fit the data. The simplest way to do this is to try a large range of parameter values and choose the one that best fits the sample, which is known as a grid search. This is generally done with the goal of maximizing the likelihood of the data given the model parameters, and hence is called maximum likelihood estimation. In other words, we aim to find the values of the parameters that make the observed data most likely. In practice we would generally use the log of the likelihood rather than the likelihood itself, since these values are often very small which can result in floating point errors.

In the case of the normal distribution the maximum likelihood estimate is equivalent to the estimate that minimizes the squared error, since the sample variance (which is based on the squared error) is part of the likelihood equation. But we can also use this example to see how grid search might work with our sample. I ran a grid search using a grid of 1000 possible mean values linearly spaced across [-1, 1], and 1000 possible standard deviation values spaced across [0.5, 1.5]; these particular values were based on my knowledge that the data came from a normal distribution and that these ranges should be likely to capture the parameter values in a dataset of this size. The results came out very close to those obtained using the closed-form solution; note that the maximum likelihood estimate for the standard deviation is equivalent to the population rather than sample standard deviation (i.e. it uses NN rather than N−1N−1 in its denominator), so I corrected the sample standard deviation to make the comparison fair:

Best fit mean: 0.0190,   Best fit sd: 0.9785, loglik: -1397.4353
Sample mean:   0.0193, Population sd: 0.9787, loglik: -1397.4352

We can see this visualized in Figure 1, where we see the landscape of the likelihood across a range of possible parameter values; here we use the negative log-likelihood for visualization, since optimization methods tend to use the language of minimization rather than maximization. We can see that this landscape is smooth and only has one visible minimum; this occurs because the negative log-likelihood surface for the normal distribution is convex, which guarantees that there is a single minimum and thus that regardless of where we start our search, we are guaranteed to find the global minimum by simply following the surface downward, a process central to many optimization algorithms (including the commonly used gradient descent). As we will see below, most realistic optimization problems have multiple local minima, making them much more difficult to optimize by simply following the surface downward.

Figure 1. A visualization of the negative log-likelihood landscape for a range of parameter values in the grid search for the mean and standard deviation of a normal distribution.

While grid search worked, it was exceedingly slow, taking more than 25 seconds to estimate the parameters that were estimated by closed form in less than a millisecond - a staggering 89,000 times slower! Grid search is inefficient even with just two parameters, and becomes exponentially less efficient with each additional parameter. A more effective and efficient way to estimate parameters is using optimization methods that are specifically built to search for parameter values that minimize a particular loss function. A common choice in Python is scipy.optimize.minimize(), which offers a number of algorithms for parameter search. We can implement this for our normal distribution data; because the function finds the minimum, we will use the negative log-likelihood as our target, which is equivalent to maximizing the log-likelihood:

import time
from scipy.stats import norm

def negative_log_likelihood(params, data):
    """Negative log likelihood function to minimize"""
    mu, sd = params
    # ensure sd is positive to avoid dividing by zero
    if sd <= 0:  
        return np.inf
    return -norm.logpdf(data, loc=mu, scale=sd).sum()

# initial guess
initial_params = [0, 1]

start_time = time.time()
result = minimize(negative_log_likelihood, initial_params, args=(normal_samples,), 
                  method='Nelder-Mead')

This gives us a solution that is equal to the fourth decimal place:

Optimized mean: 0.01926,  Optimized sd: 0.97873, loglik: -1397.43520
Sample mean:    0.01933, Population sd: 0.97873, loglik: -1397.43519

These estimates were obtained about 8,000 times more quickly compared to grid search, though still about 10 times slower than the closed-form solution. Note that I had to add some initial guesses for our parameter values, and for this example I used values that were close to the known true values. However, even when the starting values are far from the true values, optimization can often find them quickly and effectively. For example, setting the starting points for both mean and standard deviation to 10,000, the resulting parameter estimates were basically identical, and it still completed more than 2,700 times faster than the grid search.

It’s common to put boundaries on an optimization when there are bounds outside which we are sure that the parameter shouldn’t go. For example, in our example we know that the standard deviation cannot be negative, so we could set the lower bound on the standard deviation parameter to just above zero:

from scipy.optimize import minimize, Bounds

bounds = Bounds(lb=[-np.inf, 1e-6], ub=[np.inf, np.inf])
result = minimize(negative_log_likelihood, initial_params, args=(normal_samples,), 
                  method='L-BFGS-B', bounds=bounds)

This doesn’t have much impact on this particular problem, but with complex models and multiple parameters it’s common for parameter values to explode, and setting boundaries can help prevent that. However, as I will discuss below, it’s important to ensure that parameter estimates don’t sit at the boundaries, as this can suggest pathologies in model fitting.

Automated differentiation

The optimization methods discussed above are limited either to small numbers of parameters (like derivative-free methods such as Nelder-Mead) or small numbers of data points (like gradient-based methods such as L-BFGS that require computation of gradients across the entire dataset on each optimization step). Given this, how is it possible to train artificial neural networks that may have billions of parameters over trillions of data points? A key innovation that has enabled effective training of large models is automatic differentiation (often called autodiff for short) combined with gradient descent. Automatic differentiation takes a function definition and (when possible) automatically determines the derivatives of the loss function with respect to the parameters. Gradient descent uses those derivatives to follow the loss landscape downwards. In deep learning it’s most common to use stochastic gradient descent (SGD), which uses small mini-batches of data to iteratively estimate the gradients; even though the estimates for each individual batch are noisy, they are unbiased estimates of the true gradient and computationally cheap to obtain, such that the noise averages out over many iterations to give precise parameter estimates at comparatively low computational cost. However, given the small dataset in this sample we will use the simpler standard gradient descent over the entire dataset at once.

As an example, we can estimate parameters for the Michaelis-Menten equation from biochemistry, which describes the rate at which an enzyme converts its substrate into its product:

where V is the reaction velocity, S is the concentration of the enzyme’s substrate, V_max is the maximum reaction velocity once the enzyme is saturated with substrate, and K_m is the Michaelis constant that describes the affinity of the particular enzyme for its substrate (defined as the value of SS at which V=Vmax/2V=Vmax/2). Figure 2 shows a plot of this function for the acetylcholinesterase enzyme, along with noisy data generated from the function.

Figure 2. A plot of the Michaelis-Menten function for acetylcholinesterase, along with data sampled from this function with added Gaussian random noise.

This equation could easily be solved using simpler methods, but it’s a nice simple example to show how model parameters can be estimated using autodiff with gradient descent. We can start by defining the Michaelis-Menten function and generating some data with random noise (shown in ; plotting code omitted):

def michaelis_menten(S, V_max, K_m):
    return (V_max * S) / (K_m + S)

V_max_true = 29  # Maximum velocity (in nM/min)
K_m_true = 6     # Michaelis constant (in mM)
noise_sd = 0.5    # Standard deviation of noise

# Generate substrate concentration data points
S = np.linspace(0.1, 30, 100)  

v_true = michaelis_menten(S, V_max_true, K_m_true)
noise = np.random.normal(0, noise_sd, size=v_true.shape)
v_observed = v_true + noise

In order to invoke the automatic differentiation mechanism in PyTorch, we simply need to specify requires_grad=True for the variables that we intend to estimate:

# Convert data to PyTorch tensors
S_tensor = torch.tensor(S, dtype=torch.float32)
v_observed_tensor = torch.tensor(v_observed, dtype=torch.float32)

# specify initial guesses
V_max_init = 10.0
K_m_init = 10.0

# Initialize parameters with random guesses
# requires_grad=True enables automatic differentiation
V_max = torch.tensor(V_max_init, requires_grad=True)  
K_m = torch.tensor(K_m_init, requires_grad=True)

We also need to set up a loss function that will define how far the prediction is from the data, for which we will use the squared error:

def compute_loss(V_max, K_m, S, v_observed):
    """Compute MSE loss between predicted and observed velocities."""
    v_predicted = michaelis_menten(S, V_max, K_m)
    loss = torch.mean((v_predicted - v_observed) ** 2)
    return loss

Using this we set up our training loop that uses gradient descent to estimate the parameters (with some code omitted for clarity), and assess the parameter recovery of the model by comparing the estimates to the true values:

# Hyperparameters
learning_rate = 0.1
n_iterations = 500

# Test the loss with initial parameters
initial_loss = compute_loss(V_max, K_m, S_tensor, v_observed_tensor)
print(f"Initial loss: {initial_loss.item():.4f}")

# Gradient descent training Loop
for i in range(n_iterations):
    # Forward pass: compute loss
    loss = compute_loss(V_max, K_m, S_tensor, v_observed_tensor)
    
    # Backward pass: compute gradients via autodiff
    loss.backward()
    
    # Update parameters using gradient descent step
    # torch.no_grad() prevents these operations from being tracked
    with torch.no_grad():
        V_max -= learning_rate * V_max.grad
        K_m -= learning_rate * K_m.grad
        
        # Zero the gradients for the next iteration
        V_max.grad.zero_()
        K_m.grad.zero_()

print(f"\nFinal estimates: V_max = {V_max.item():.4f}, K_m = {K_m.item():.4f}")
print(f"True values:     V_max = {V_max_true:.4f}, K_m = {K_m_true:.4f}")

Initial loss: 188.0915
Final loss:     0.2006

Final estimates: V_max = 29.0894, K_m = 6.1336
True values:     V_max = 29.0000, K_m = 6.0000

Since there are only two parameters, we can easily visualize how the parameter estimate traverses the loss landscape as the estimation process moves from the initial guesses (in this case 10 for both parameters) to the final values, as shown in Figure 3.

Figure 3. A visualization of the log-loss landscape for the Michaelis-Menten optimization problem, showing the journey of the optimization process from the starting point to the ending point.

Local minima in optimization

The error landscape for the normal distribution example is convex, which means that there is a single global minimum that can be found simply by following the error gradient downwards. Claude Sonnet 4 initially tried to convince me that the Michaelis-Menten problem is convex, but was overruled by Claude Opus 4.5. Despite being non-convex, the error landscape of the Michaelis-Menten problem is smooth and relatively well behaved, as seen in Figure 4. However, many realistic scientific problems have highly complex non-convex likelihoods, such that there are numerous local minima that the optimization routine can get stuck in. shows an example of this.

Figure 4. A visualization of a rough loss landscape. The star shows the global minimum, and the individual trajectories show the local minima that are found when using simple gradient descent from different starting points.

There are a number of strategies that one can employ to help avoid parameter estimates that are far from the optimal answer that is located at global loss minimum:

Run the estimation algorithm multiple times with different random initializations of the parameters. If they are similar between runs then this gives confidence that the estimates don’t reflect local minima. If the parameter estimates differ yet losses are similar, this suggests that the parameters may be trading off against one another, which reflects a structural problem with the model or data such that there are many equally good points in the loss landscape. This is often referred to as non-identifiability of the parameters, and is sometimes evident in correlations between the different parameter estimates.
Use an optimizer that adapts the learning rate to the local gradient, such as ADAM or RMSprop.
Use an optimizer that explores more broadly before converging, such as the differential evolution method implemented in scipy.optimize.differential_evolution.
It can sometimes be helpful to reparameterize the model to help with convergence. For example, if the models are physically constrained to being positive, then one might consider optimizing the logarithm of the parameter rather than the natural values of the parameters; this allows the optimizer to explore the entire range of large and small numbers while respecting the positivity constraint. If the different parameters are on very different scales this can also cause problems since the optimizer needs to move at different rates in different directions of the loss space, so reparameterizing the model such that parameters are in roughly the same numeric scale can be useful.

In the next post I will discuss another strategy for parameter estimation known as simulation-based inference.

Estimating parameters: Closed-form and Bayesian estimation

Russ Poldrack — Tue, 26 May 2026 15:01:36 GMT

It is very common in science to collect data and then use those data to estimate the parameters for a given model, and it’s important to be able to validate that the estimates are valid. Given the central role of parameter estimation in code testing and validation, I now dive into the various methods that one can use to estimate model parameters, and show examples of how we might validate them. In addition to estimating model parameters, we generally also want some kind of way to quantify the uncertainty in our estimates. That is, rather than thinking of the parameter estimate as a single point value, we can ask: What range of values for the parameter are consistent with the data? This is often expressed using confidence intervals, though I will discuss below the ways that these are often misunderstood.

A central idea in this section will be the notion of parameter recovery: that is, how well can our estimation procedure recover the true parameter values using simulated data? This is particularly important in cases where we don’t have statistical guarantees on the unbiasedness of our estimates. As we will see, simulation provides a powerful tool to assess parameter recovery performance for any model.

Closed-form estimates

In some cases parameter estimates can be obtained using a closed form analytic solution. We will use the normal distribution as an example. This distribution has two parameters: a mean (sometimes called a location) that specifies where the center of the distribution falls, and a standard deviation (sometimes called a scale) that specifies the width of the distribution. The probability function for the normal distribution is:

where μ is the mean and σ is the standard deviation.

Our goal in estimating model parameters is to find estimates (in this case for the mean and standard deviation) that maximize some measure of goodness of fit with respect to the data, or equivalently, minimize some measure of error. Since we don’t want positive and negative errors to cancel each other out, we need a measure of error that is uniquely positive regardless of the direction of the error. The most common measure in statistics is the mean squared error:

where y_i is the value for the i-th observation, \hat{y_i} is the estimated value for that observation from the model, and n is the sample size1. In the case of the normal distribution, \hat{y_i} is the same for each observation: the mean. We can estimate the mean for the sample using the closed form solution:

where \bar{y} is the mean. We can then compute the standard deviation using this estimated mean:

Note that this is very similar to the mean squared error, differing in the presence of a square root as well as the use of n−1 rather than n in the demominator. The latter is meant to adjust for the fact that we lost one degree of freedom when we estimated the mean from the same data and then used it to compute the standard deviation. When the variance (the square of the standard deviation) is computed using this correction it will be unbiased, meaning that its expected value will match the true variance of the population. The standard deviation is still slightly biased, but less so than the one computed without the correction.

Figure 1 shows an example of a histogram based on samples from a normal distribution, with the theoretical normal distribution based on the estimated sample mean and standard deviation overlaid. Visually it’s clear that the fitted distribution characterizes the overall shape well, even if it mismatches the shape at finer grain, due to sampling variability.

Figure 1. A histogram of 1000 samples from a standard normal distribution (with mean of zero and standard deviation of one), with the fitted normal distribution overlaid in red.

Quantifying uncertainty in closed-form estimation

In general we want not just a point estimate for our parameter but also an estimate of our uncertainty in that estimate. The confidence interval is the most commonly used method for expressing uncertainty around an estimate, and with closed form expressions it’s often possible to compute the confidence interval directly. A confidence interval is expressed in terms of a percentage, but the meaning of this percentage is often misinterpreted (see the discussion in my Statistical Thinking for more on this). The term “95% confidence interval” seems to imply that it is an interval in which we have 95% confidence that the true value of the parameter falls. However, that violates the frequentist statistical logic that underlies the computation of the confidence interval, which treats the true value as fixed, and thus it either falls in the interval or it doesn’t. Instead, the more appropriate interpretation of a frequentist confidence interval is that it is the interval that would capture the true population mean 95% of the time for samples from the same population. I prefer to frame it in a slightly different, if somewhat less precise way: The confidence interval expresses the range of plausible values for the parameter given our data, and thus tells us something about the precision of our estimate: All else being equal, a sample estimate with a narrower confidence interval is more precise than an estimate with a wider confidence interval.

Using our example from above, we can compute a confidence interval for our estimate of the sample mean. This requires that we have a probability distribution that is associated with our statistic; in this case, the Student’s t distribution is appropriate since we have estimated the standard deviation as well as the mean. The t distribution has slightly wider tails than the normal distribution, which helps account for the added uncertainty in our estimate of the standard deviation. The equation for the confidence interval around the mean using the t distribution is:

where s_y is the sample standard deviation, n is the sample size, and t_α/2,n−1 is the critical value of the t distribution with n−1 degrees of freedom at the α/2 percentile. α defines our confidence level, and it is divided by two since we are interested in both the positive and negative directions. In our case, this results in a confidence interval of [-0.09028, 0.03249]. We can use a simulation to confirm that this interval indeed captures the sample mean 95% of the time for new samples from the same distribution:

# Simulation parameters
n_simulations = 100000
confidence_level = 0.95
alpha = 1 - confidence_level
random_state = 42
true_mean, true_sd = 0, 1
sample_size = 1000

# Track how many times the CI captures the true mean
captures = 0

# Run simulations
for i in range(n_simulations):
    # Draw a new sample from the population
    sample = norm.rvs(loc=true_mean, scale=true_sd, 
        size=sample_size, random_state=random_state)
    
    # Calculate sample statistics
    sample_mean_sim = np.mean(sample)
    sample_sd_sim = np.std(sample, ddof=1)
    
    # Calculate confidence interval
    df = sample_size - 1
    t_crit = t.ppf(1 - alpha/2, df)
    se = sample_sd_sim / np.sqrt(sample_size)
    margin = t_crit * se
    
    ci_low = sample_mean_sim - margin
    ci_high = sample_mean_sim + margin
    
    # Check if CI captures the true mean
    if ci_low <= true_mean <= ci_high:
        captures += 1

# Calculate coverage rate
coverage_rate = captures / n_simulations

print(f"Simulation results:")
print(f"Number of simulations: {n_simulations}")
print(f"Sample size per simulation: {sample_size}")
print(f"True population mean: {true_mean}")
print(f"Confidence level: {confidence_level * 100}%")
print(f"\nCoverage rate: {coverage_rate:.4f} ({coverage_rate * 100:.2f}%)")

Simulation results:
Number of simulations: 100000
Sample size per simulation: 1000
True population mean: 0
Confidence level: 95.0%

Coverage rate: 0.9503 (95.03%)

Here we see that the observed proportion of samples where the sample mean falls within the confidence interval is very close to the 95% that we expect based on statistical theory.

The bootstrap as a general method for quantifying uncertainty

There are often cases where we don’t have a sampling distribution that we can use to form a confidence interval. In these cases, we can use a technique known as the bootstrap. This method takes advantage of resampling, meaning that we repeatedly draw samples with replacement from our full sample. We can do this using the scipy.stats.bootstrap() function, which performs the bootstrap on a sample given any statistical function:

from scipy.stats import bootstrap

# use the bias-corrected/accelerated method ('BCa')
res = bootstrap((normal_samples,), np.mean, confidence_level=0.95,      
    n_resamples=10000, method='BCa', random_state=random_state)

print(f'Bootstrap 95% CI for mean: '
    f'[{res.confidence_interval.low:.5f}, '
    f'{res.confidence_interval.high:.5f}]')
print(f'CI based on t-distribution: [{ci_lower:.5f}, {ci_upper:.5f}]')

Bootstrap 95% CI for mean:  [-0.09104, 0.03078]
CI based on t-distribution: [-0.09028, 0.03249]

Here we see that the bootstrap procedure gives results that are very close to those obtained using the closed form solution, but has the advantage of being usable with nearly any statistic (except for those based on extreme values) regardless of whether or not there is a closed form estimator and/or the sampling distribution is analytically tractable.

Bayesian estimation

I noted above that the interpretation of the frequentist confidence interval is counterintuitive for most people, which leads to common misunderstandings, even among experts (Hoekstra et al., 2014). We would like a way of generating an interval that expresses our confidence about the true parameter value, but we can’t do this in the frequentist framework. However, there is a different approach to statistics that allows us to generate such an interval, known as Bayesian statistics after the Reverend Thomas Bayes whose famous equation forms the basis of this approach.

Bayesian statistics is based on a different conception of probability from the frequentist approach that underlies the standard confidence interval. Under the frequentist conception, probabilities are meant to refer to the long-run frequencies of outcomes across many samples, while the true parameter value is viewed as fixed. For this reason, it doesn’t make sense to a frequentist to say that there is a particular probability of the true parameter value; it simply is what it is. Bayesians, on the other hand, view probabilities as degrees of belief, and treat the estimation of parameters from data as a way to sharpen our belief - that is, as a learning opportunity. This means that it is perfectly legitimate in the Bayesian framework to say that there is a 95% probability that the true value of a parameter lies within a particular interval.

The fundamental idea in Bayesian statistics is that we start with a set of beliefs (known as a prior distribution), we obtain some relevant data, and then use the likelihood of those data given the possible parameter values to update our beliefs, generating a posterior distribution. I won’t go into detail about Bayesian methods here; see my Statistical Thinking for a basic overview, and Gelman et al. (2013) or McElreath (2020) for more detailed overviews. Instead I will show an example of Bayesian estimation applied to our example data above. There are several Python packages that can be used to perform Bayesian estimation; I will use the popular PyMC package. The first section sets up the Bayesian model, with priors for the mean (mu) and standard deviation (sigma) that are very broad and thus will have little influence on the outcome; in Bayesian terms these are referred to as weakly informative priors. We then perform sampling to obtain an estimate of the posterior distribution of the parameters given the data. Using these distributions, we can then find the narrowest set of values that contain 95% of the mass of posterior distribution, which are known as the highest density interval (HDI) (which is a type of credible interval that contains the most likely values). This interval serves as a Bayesian alternative to the frequentist confidence interval, allowing us to legitimately describe it as the interval that has a 95% probability of containing the true value.

import pymc as pm
import arviz as az

# Bayesian estimation using PyMC
with pm.Model() as model:
    # Priors for unknown model parameters
    mu = pm.Normal('mu', mu=0, sigma=1000)  # Prior for mean
    sigma = pm.HalfNormal('sigma', sigma=100)  # Prior for standard deviation (must be positive)
    
    # Likelihood (sampling distribution) of observations
    likelihood = pm.Normal('likelihood', mu=mu, sigma=sigma, observed=normal_samples)
    
    # Posterior sampling
    trace = pm.sample(10000, tune=1000, return_inferencedata=True, random_seed=42)

# Extract posterior estimates
posterior_mean = trace.posterior['mu'].mean().values
posterior_sd = trace.posterior['sigma'].mean().values

# extract highest density interval
hdi = az.hdi(trace, hdi_prob=0.95)
hdi_values = hdi.mu.values

print(f"Posterior mean: {posterior_mean:.5f}, Posterior sd: {posterior_sd:.5f}")
print(f"Sample mean: {sample_mean:.5f}, Sample sd: {sample_sd:.5f}")
print(f'95% HDI values: {hdi_values}')
print(f'95% CI based on t-distribution: [{ci_lower:.5f}, {ci_upper:.5f}]')

Posterior mean: -0.02895, Posterior sd: 0.99033
Sample mean: -0.02889, Sample sd: 0.98922
95% HDI values:                 [-0.08993, 0.03335]
95% CI based on t-distribution: [-0.09028, 0.03249]

In this case, the Bayesian HDI turns out to be very close to the parametric confidence interval. We can also obtain a visualization of the full posterior distributions obtained through Bayesian estimation, which are shown in :

# Visualize posterior distributions
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Plot posterior for mu
az.plot_posterior(trace, var_names=['mu'], ax=axes[0])
axes[0].axvline(sample_mean, color='red', linestyle='--', label='Sample mean')
axes[0].legend()

# Plot posterior for sigma
az.plot_posterior(trace, var_names=['sigma'], ax=axes[1])
axes[1].axvline(sample_sd, color='red', linestyle='--', label='Population sd')
axes[1].legend()

Figure 2. Posterior distributions for mean (mu) and standard deviation (sigma) obtained using Bayesian estimation, with the 95% highest density interval shown by the gray bar at the base of the plot.

Bayesian estimation can be particularly useful when one has a strong prior belief about the value of a parameter and wishes to update that belief based on data. For example, let’s say that there was a published dataset that reported a particular parameter value, and a researcher performs additional observations and wants to update that parameter estimate. Bayesian estimation allows this by the specification of the prior probability distribution. In the example above we used a relatively non-informative prior for the mean (a normal distribution with mean of zero and standard deviation of 1000, which allows for a very wide set of possibilities). However, if we have existing data then we can use those data to inform our subsequent analyses, consistent with the idea that Bayesian inference is a form of learning from data. One can also provide a prior based on one’s scientific hypotheses or expectations, and the ability to incorporate prior knowledge into parameter estimation is generally taken as a strength of Bayesian methods; however, one must be sure that the prior doesn’t overwhelm the data dogmatically, effectively forcing a particular answer regardless of what the data say.

One drawback of Bayesian methods is that they can be very computationally expensive. For example, the Bayesian estimation above took a bit over 2 seconds using 4 parallel sampling processes, which is much slower than the 189 microseconds required for closed-form estimation and also substantially slower than the optimization methods discussed in the next section. There are alternative Bayesian methods known as variational Bayes that use mathematical tricks to speed up estimation, but often require substantial mathematical skill to develop, though some packages like PyMC now offer built-in variational Bayes methods.

In the next post I will turn to parameter estimation using optimization methods.

I apologize for the wonky formatting of the mathematical features in the text, unfortunately Substack doesn’t seem to support in-line LaTeX formatting.

Simulating data

Russ Poldrack — Tue, 19 May 2026 15:02:39 GMT

In this post I will continue the discussion of simulation, focusing on how to generate simulated data from a mathematical model or existing data.

Simulating data from a model

In some cases, we want to simulate data that have particular structure in order to test whether our code can properly identify the structure in the data. Depending on the kind of structure one needs to create, there are often existing tools that can help generate the data. For example, the scikit-learn package has a large number of data generators that are often useful, either on their own or as a starting point to develop a custom generator. Similarly, the NetworkX graph analysis package has a large number of graph generators available.

Let’s say that we have developed a tool that implements a novel method for the discovery of causal relationships from timeseries data. We would like to generate data from a known causal graph (which is represented as a directed acyclic graph, just like our workflow graphs in the previous chapter). For this, we can use an existing graph; I chose one based on a dataset of gene expression in E. coli bacteria that was used by Schafer & Strimmer (2005) and is shared via the pgmpy package:

from IPython.display import Image
from pgmpy.utils import get_example_model

# Load the model
ecoli_model = get_example_model('ecoli70')

# Visualize the network and save to an image file
viz = ecoli_model.to_graphviz()
viz.draw(IMAGE_DIR / 'ecoli.png', prog='dot')

Figure 1 shows the resulting rendering of that network, which has 46 nodes (representing individual genes) and 70 directed edges (representing causal relationships on gene expression between nodes).

Figure 1. A plot of the graphical model for the E. Coli gene expression data generated by Schafer & Strimmer, 2005.

Given this DAG, we then need to generate timeseries data for expression of each gene that reflect the causal relationships between the genes as well as the autocorrelation in gene expression within genes measured over time. For this, we turn to the tigramite package, which is primarily focused on causal discovery from timeseries data, but also includes a function that can generate timeseries data given a graphical model. However, the tigramite package requires a different representation of the graphical model than the one obtained from pgmpy, so we have to convert the edge representation from the original to the link format required for tigramite:

def generate_links_from_pgmpy_model(model, coef=0.5, ar_param=0.6):
    nodes, edges = model.nodes(), model.edges()
    noise_func = lambda x: x 
    links = {}

    # create dicts mapping node names to indices and vice versa
    node_to_index = {node: idx for idx, node in enumerate(nodes)}
    index_to_node = {idx: node for node, idx in node_to_index.items()}

    # add edges from the pgmpy model
    for edge in edges:
        cause = node_to_index[edge[0]]
        effect = node_to_index[edge[1]]
        # for simplicity, use lag 1, constant coef and no edge noise
        links.setdefault(effect, []).append( ((cause, -1), coef, noise_func) )

    # add a self-connection to all nodes to simulate autoregressive behavior
    for node in nodes:
        idx = node_to_index[node]
        links.setdefault(idx, []).append( ((idx, -1), ar_param, noise_func) )

    return links, node_to_index, index_to_node

We can then create a function to take in the original model, convert it, and generate timeseries data for the model:

def generate_data(model, noise_sd=1, tslength=500, seed=42, coef=0.5, ar_param=0.6):
    links, node_to_index, index_to_node = generate_links_from_pgmpy_model(model, 
        coef=coef, ar_param=ar_param)
    rng = np.random.default_rng(seed)
    # Calculate total length including transient period
    data, _ = structural_causal_process(links, T=tslength, seed=seed)
    data = rng.normal(scale=noise_sd, size=data.shape) + data
    # Prepare data for tigramite
    return DataFrame(data), index_to_node

# we will need the index_to_node mapping later
ecoli_dataframe, _, index_to_node = generate_data(ecoli_model, noise_sd=1, 
    tslength=500, seed=42)

Now that we have the dataset we can test out our estimation method. Since I don’t actually have a new method for causal estimation on timeseries, I will instead use the PCMCI method described by Runge et al, 2019 and implemented in the tigramite package:

from tigramite.pcmci import PCMCI
from tigramite.independence_tests.parcorr import ParCorr

def run_pcmci(dataframe):
    # Initialize PCMCI with partial correlation-based independence test
    pcmci = PCMCI(dataframe=dataframe, cond_ind_test=ParCorr())
    # Run PCMCI to discover causal links
    results = pcmci.run_pcmci(tau_max=1, pc_alpha=None)
    return results

results = run_pcmci(ecoli_dataframe)

The results from this analysis include a list of all of the edges that were identified from the data using causal discovery, which we can summarize to determine how well the model performed. First we need to extract the links that were discovered from the results which pass our intended false discovery rate threshold:

def extract_discovered_links(results, index_to_node, q_thresh=0.00001):
    discovered_links = []
    fdr_p = results['fdr_p_matrix'][:, :, 1]  # use only lag 1 p-values
    links = np.where(fdr_p < q_thresh)
    for (i, j) in zip(links[0], links[1]):
        if not i == j:
            discovered_links.append((index_to_node[i], index_to_node[j]))
    return discovered_links

discovered_links = extract_discovered_links(results, index_to_node, .01)

Then we can summarize the results:

def get_edge_stats(edges, discovered_links, verbose=True):
    true_edges = set(edges)
    discovered_edges = set(discovered_links)
    true_positives = true_edges.intersection(discovered_edges)
    false_positives = discovered_edges.difference(true_edges)
    false_negatives = true_edges.difference(discovered_edges)

    true_positive_rate = len(true_positives) / len(true_edges) if len(true_edges) > 0 else 0
    
    # Precision: proportion of discoveries that are true
    precision = len(true_positives) / len(discovered_edges) if len(discovered_edges) > 0 else 0
    
    # False Discovery Rate: proportion of discoveries that are false
    false_discovery_rate = len(false_positives) / len(discovered_edges) if len(discovered_edges) > 0 else np.nan
    
    f1_score = (2 * len(true_positives)) / (2 * len(true_positives) + \
        len(false_positives) + len(false_negatives)) if (len(true_positives) + len(false_positives) + len(false_negatives)) > 0 else np.nan
    
    if verbose:
        print(f'{len(true_edges)} true edges')
        print(f'discovered {len(discovered_edges)} edges')
        print(f"True Positive Rate (Recall): {true_positive_rate:.2%}")
        print(f"Precision: {precision:.2%}")
        print(f"False Discovery Rate: {false_discovery_rate:.2%}")
        print(f"F1 Score: {f1_score:.2%}")

    return {
        'true_positives': true_positives,
        'false_positives': false_positives,
        'false_negatives': false_negatives,
        'true_positive_rate': true_positive_rate,
        'precision': precision,
        'false_discovery_rate': false_discovery_rate,
        'f1_score': f1_score
    }

edge_stats = get_edge_stats(ecoli_model.edges(), discovered_links)

70 true edges
discovered 87 edges
True Positive Rate (Recall): 100.00%
Precision: 80.46%
False Discovery Rate: 19.54%
F1 Score: 89.17%

The results showed that the model performed quite well, detecting all of the true relationships and only two false relationships. In general we would want to do additional validation to make sure that the results behave in the way that we expect. For example, we would expect better model performance with stronger signal, and we would expect fewer nodes identified when the p-value threshold is more stringent. We can use the functions generated above to run a simulation of this:

# loop over signal levels and q values to see effect on performance

noise_sd = 1
tslength = 500
q_values =  [1e-6, 1e-5, 1e-4, 1e-3, 1e-2]
signal_levels = np.arange(0, 0.7, 0.1)
performance_results = []

for signal_level in signal_levels:
    dataframe, index_to_node = generate_data(ecoli_model, noise_sd=noise_sd, tslength=tslength, seed=42, coef=signal_level, ar_param=0.6)
    results = run_pcmci(dataframe)
    for q in q_values:
        discovered_links = extract_discovered_links(results, index_to_node, q_thresh=q)
        edge_stats = get_edge_stats(ecoli_model.edges(), discovered_links, verbose=False)
        performance_results.append({
        'noise_sd': noise_sd,
        'q_value': q,
        'tslength': tslength,
        'signal_level': signal_level,
        'true_positive_rate': edge_stats['true_positive_rate'],
        'precision': edge_stats['precision'],
        'false_discovery_rate': edge_stats['false_discovery_rate'],
        'f1_score': edge_stats['f1_score']
    })

performance_df = pd.DataFrame(performance_results)

We can then plot these results, as shown in Figure 2. The results confirm that the model is performing as expected, with increasing recall as a function of increasing true signal and decreasing FDR threshold.

Figure 2. A plot of observed true positive rate (TPR) and false discovery rate (FDR) at increasing signal levels for varying FDR thresholds.

Simulating data based on existing data

It’s very common for researchers to collect a dataset of interest and then develop code that implements their analysis on that dataset to ask their questions of interest. However, this approach raises a concern that the choices made in the course of analysis might be biased by the specific features of the dataset (Gelman & Loken, 2019). In particular, decisions might be made that reflect the noise in the dataset, rather than the true signal, which is often referred to as overfitting (discussed further below). In some fields (particularly in physics) it is common to perform blind analysis (MacCoun & Perlmutter, 2015), in which analysts are given data that are either modified or relabeled, in order to prevent them from being biased by their hypotheses. One way to achieve this in the context of data analysis is to develop the code using a simulated dataset that has some of the same features as the real dataset, such that one can implement the code, validate it, and then immediately apply it to the real data once they are made available. To achieve this, one needs to be able to generate simulated data based on an existing dataset; for blind analysis, the generation of the simulated data should be performed by a different member of the research team. For example, in some cases I have generated the simulated data for a study based on the real data and provided those to my students, only providing them with the real data once the code was implemented and validated.

The important question in generating simulated data from real data is what specific features one intends to capture from the real data. This generally will require some degree of domain expertise in order to understand the features of the data. Some common features that one might wish to replicate are:

Data types (e.g. categorical, integer, floating point)
Marginal distributions of the values (minimally the range, preferably the shape or summary statistics)
Joint distributions of the variables (e.g. capturing correlations between variables)

It’s generally important to avoid including features in the model that are directly relevant to the hypothesis. For example, if the hypothesis relates to correlations between specific variables in the dataset, then the correlation in the simulated data should not be based on the correlation in the real data, lest the analysis be biased.

Here I will focus primarily on tabular data; while there are simulators to generate more complex types of data, such as genome wide association data and functional magnetic resonance imaging data, these require substantial domain expertise to use properly, whereas tabular data are widely applicable. For simple datasets it may be most appropriate to generate simulated data by hand; here I will use the Synthetic Data Vault (SDV) Python package, which has powerful tools for generating many kinds of synthetic data.

As an example, I will use the Eisenberg et al. (2018) data that you have already seen on a couple of occasions. I’ll start by picking out a few variables and then using SDV to create a synthetic dataset whose distributions for each variable match those in the original, but the columns are generated independently, which removes any correlations between columns. The full analysis is shown here. After loading and combining the demographic and behavioral data frames, selecting a few important variables, and joining them into a single frame (df_orig), I then use SDV to generate simulated data for each variable, shuffling each column after generation to destroy any correlations:

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import Metadata

def generate_independent_synthetic_data(df, random_seed=42):
    """
    Generate synthetic data where all variables are independent.
    
    Uses SDV to model the full dataset, then shuffles each column 
    independently to break all correlations while preserving marginal distributions.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Original dataframe to generate synthetic version of
    random_seed : int, optional
        Random seed for reproducibility (default: 42)
        
    Returns:
    --------
    pd.DataFrame
        Synthetic dataframe with same shape and column names as input,
        but with independent variables
    """
    # Suppress the metadata saving warning
    warnings.filterwarnings('ignore', message='We strongly recommend saving the metadata')
    
    # Set random seed
    if random_seed is not None:
        np.random.seed(random_seed)
    
    # Create metadata for the full dataset
    metadata = Metadata.detect_from_dataframe(
        data=df,
        table_name='full_data'
    )
    
    # Create synthesizer for the full dataset
    synthesizer = GaussianCopulaSynthesizer(
        metadata,
        enforce_rounding=False,
        enforce_min_max_values=True,
        default_distribution='norm'
    )
    
    # Fit synthesizer to the full dataset
    synthesizer.fit(df)
    
    # Generate synthetic data
    df_synthetic = synthesizer.sample(num_rows=len(df))
    
    # CRITICAL: Shuffle each column independently to break all correlations
    # This preserves the marginal distribution of each variable but eliminates dependencies
    for col in df_synthetic.columns:
        shuffled_values = df_synthetic[col].values.copy()
        np.random.shuffle(shuffled_values)
        df_synthetic[col] = shuffled_values
    
    return df_synthetic

We can then visualize the correlations and distributions for the original data and the synthetic data; in Figure 3 we see that the distributions in the synthetic data are very similar to those in the original data, while in Figure 4 we see that the synthetic data do not include the original correlations.

Figure 3. A comparison of the distributions of the original and synthetic data for several of the variables in the example dataset.

Figure 4. A comparison of the correlations matrices for the numeric variables in the original and synthetic data.

The SDV package also offers many additional tools for more sophisticated generation of synthetic data. In subsequent sections I will show additional ways to use synthetic data for validation of scientific data analysis code.

Validating scientific software using simulations

Russ Poldrack — Tue, 12 May 2026 15:01:18 GMT

So far I have focused very heavily on reproducibility, that is, the ability to generate the same answer when code is run repeatedly. However, it’s easy to reliably generate the wrong answer! In measurement theory there is a fundamental distinction between reliability and validity: Reliability means performing the same method repeatedly results in highly similar results, whereas validity refers to whether the estimated result is close to the true result. In this chapter we turn to the validation of scientific software, by which I mean the degree to which it performs the intended task as expected and gets the answers right.

Creating simulations

Creating simulations is perhaps the most important tool that computers offer the scientist, as captured in a well-worn quote by Press et al. in their Numerical Recipes book:

offered the choice between mastery of a five-foot shelf of analytical statistics books and middling ability at performing statistical Monte Carlo simulations, we would surely choose to have the latter skill. (p.691)

Simulations are indeed a powerful way to understand a system even when it’s not analytically tractable. More importantly, they are generally the only way that we can establish ground truth against which we can compare our models. As scientists we never know the true process that generates our data, but with simulations we can have complete control over the data generation process.

My previous book, Statistical Thinking gives an overview of how to use simulations in the context of statistics; here I will focus primarily on the use of simulation in the context of software validation, but I recommend that book for background reading if you aren’t already familiar with the concept of a statistical distribution.

Generating random numbers

The most fundamental requirement in nearly any simulation is the ability to generate random numbers.1 What makes a series of numbers random is that it is impossible (or at least nearly impossible) to predict the next value in the series. Random numbers are defined by the distribution that characterizes them, which is a mathematical function that describes the “shape” of the data when they are summarized according to the relative frequency of different values or ranges of values. Picking the correct distribution is essential to ensure that any simulation performs as advertised. Fortunately, there are lots of existing packages that provide tools to generate random numbers for nearly any distribution; we will focus on the NumPy package here since it is the most commonly used.

The simplest distribution is the uniform distribution, in which any possible value (within a particular range for continuous values) has the same probability of occurring. We can generate uniform random variates by first creating a random number generator object using np.random.default_rng(), and then calling rng.uniform() which returns random samples from the distribution:

rng = np.random.default_rng()
rng.uniform(size=10)

array([0.56449692, 0.6880841 , 0.43249236, 0.28950554, 0.02708363,
       0.61239335, 0.30663968, 0.3854357 , 0.57454511, 0.07974661])

In this case, rng.uniform() by default generates floating point values that fall within [0, 1]; this can be changed using the location and scale parameters to the function. If we generate a large number of these then we can create a distribution plot (often called a histogram) showing how the numbers are distributed, as shown in Figure 1.

Figure 1. Distribution plots for 1,000,000 random samples from each of six different distributions.

For purposes of reproducibility it’s often useful to be able to regenerate exactly the same series of random samples. We can do this by specifying a random seed, which gives the random number generating a starting point. If you are going to do simulations then it is important to understand the specific random number generator that your code will use. This blog post provides an excellent introduction to the NumPy random generation system; here I will only give a brief overview. Previously it was common to use the global NumPy random seed function (np.random.seed()) to set the seed, and this is still necessary when using packages that access the global random number generator. However, the best practice is to generate a random number generator object (using np.random.default_rng()), and then call the methods of that object to obtain random numbers, as I did above. This prevents surprises in case other functions modify the global seed, helps isolate your specific generator, and enables multiple parallel generators. Here is an example:

rng = np.random.default_rng(seed=42)
rng.uniform(size=4)

array([0.77395605, 0.43887844, 0.85859792, 0.69736803])

If we run this again, we see that a different series of numbers is generated:

rng.uniform(size=4)

array([0.09417735, 0.97562235, 0.7611397 , 0.78606431])

However, if we generate another object with the same seed, we will see that it gives us the same values as above:

rng2 = np.random.default_rng(seed=42)
rng2.uniform(size=4)

array([0.77395605, 0.43887844, 0.85859792, 0.69736803])

rng2.uniform(size=4)

array([0.09417735, 0.97562235, 0.7611397 , 0.78606431])

Setting random seeds is important to enable exact reproducibility of results generated using random numbers. However, it’s also important to ensure that one’s results are robust to the choice of random seed, as I will discuss later in the context of machine learning analyses.

Choosing a distribution

Choosing the right distribution for a simulation often comes down to understanding the data-generating process and kind of data that are being modeled. Here are a few examples of distributions and their common use cases:

Discrete outcomes

Bernoulli: A distribution of binary outcomes (often interpreted as success vs failure) given a probability of a positive outcome.
- Examples: Whether a patient responds to a given treatment, whether a hard drive fails within a particular period of time.
Binomial: A distribution of the number of successes across a specific number of Bernoulli trials, given a probability of a positive outcome.
- Examples: The number of patients who respond to treatment in a clinical trial treatment group, the number of hard drives that fail within a particular period of time at a particular data center
Categorical: A distribution with several distinct possible outcomes, each of which has a particular probability:
- Examples: Eye color across a population, programming languages used by programmers in a company
Uniform: A specific form of a discrete categorical variable with equal probability
- Examples: equiprobable physical outcomes such as a dice roll.
Multinomial: A multivariate generalization of the binomial, representing the counts of multiple possible outcomes across a set of independent trials with fixed probabilities of each outcome.
- Examples: Frequencies of types of stars in a galaxy, frequencies of cell types in a tissue sample

Continuous outcomes:

Uniform: A distribution with equal probability density for all values within the range.
- Examples: probabilities of events when there is no prior knowledge, equiprobable continuous physical outcomes
Beta: A generalization of the uniform distribution that models the probability of values within a range but allows different values to have different probabilities.
- Examples: Prior probabilities in a Bayesian model, proportions of time spent in a particular state.
Normal (or Gaussian): A symmetric distribution centered around a mean, which commonly arises when an outcome is generated based on many small additive contributions. The Central Limit Theorem explains why this occurs so frequently, as it should arise for sums of independent random variables sampled from any distribution, as long as the sample size is large enough and the distribution has finite variance.
- Examples: Height of individuals in a population, measurement errors for continuous variables
Log-normal: Distribution of positive continuous values whose logarithm is normally distributed. It reflects the expected values of a product of independent random variables.
- Examples: Wealth and income distributions in a population, biological growth processes

Count data:

Poisson: This is a distribution of counts of events within a fixed interval assuming that the events are independent and occur at a constant rate. Unlike the binomial there is no limit on the number of events that can occur, and it is the limiting case of the Binomial when the sample size nn approaches infinity and the probability pp approaches zero (and n∗pn∗p remains constant).
- Examples: The number of atoms decaying within a particular period, the number of emails that a person receives within a day
Negative binomial: A distribution that models count data that are overdispersed, meaning that their variance is greater than their mean. It can also be interpreted as representing the number of failures that occur before a given number of successes.
- Example: Commonly used in genomics to model read counts in genomic sequencing.

Waiting time data:

Exponential: A distribution of waiting times in a process governed by a Poisson distribution, with a constant hazard rate (i.e. the probability of happening in the next period is independent of whether the event has happened yet).
- Examples: Time between atomic decays, time between hard drive failures
Weibull: A generalization of the exponential distribution that allows modeling of waiting times with increasing, constant, or decreasing hazard rates.
- Examples: Response times in human behavior, time to failure for some electronic devices

It is essential to choose the right distributions for a simulation; otherwise the results may be misleading at best or meaningless at worst.

In the next post I will discuss how to simulate data from a mathematical model,

For the sake of convenience I will use the term random numbers for series of numbers generated by a computational algorithm, but it is more precise to call them pseudorandom numbers, because the series will ultimately repeat after a very long time. See Chapter 3 of Knuth’s Seminumerical Algorithms (Vol. 2 of The Art of Computer Programming) for a detailed discussion.

Managing complex scientific workflows

Russ Poldrack — Tue, 05 May 2026 15:01:44 GMT

We now turn to a more realistic and complex scientific data analysis workflow. For this example I will use an analysis of single-cell RNA-sequencing data to determine how gene expression in immune system cells changes with age. This analysis will utilize a large openly available dataset that includes data from 982 people comprising about 1.3 million peripheral blood mononuclear cells (i.e. white blood cells) for about 35K transcripts. I chose this particular example for several reasons:

It is a realistic example of a workflow that a researcher might actually perform.
It has a large enough sample size to provide a robust answer to our scientific question.
The data are large enough to call for a real workflow management scheme, but small enough to be processed on a single laptop (assuming it has decent memory).
The workflow has many different steps, some of which can take a significant amount of time (over one hour)
There is an established Python library (scanpy) that implements the necessary workflow components.
It’s an example outside of my own research domain, to help demonstrate the applicability of the book’s ideas across a broader set of data types.

I will use this example to show how to move from a monolithic analysis script to a well-structured and usable workflow that meets most of the desired features described above.

Note: I am not an expert in RNA-seq analysis. I would welcome comments from experts on the workflow that I have implemented here.

Starting point: One huge notebook

I developed the initial version of this workflow as many researchers would: by creating a Jupyter notebook that implements the entire workflow, which can be found here. The total execution time for this notebook is about two hours on an M3 Max Macbook Pro.

The problem of in-place operations

What I found as I developed the workflow was that I increasingly ran into problems that arose because the state of particular objects had changed. This occurred for two reasons at different points. In some cases it occurred because I saved a new version of the object to the same name, resulting in an object with different structure than before. Second, and more insidiously, it occurred when an object passed into a function was modified by the function internally. This is known as an in-place operation, in which a function modifies an object directly rather than returning a new object that can be assigned to a variable.

In-place operations can make code particularly difficult to debug in the context of a Jupyter notebook, because it’s a case where out-of-order execution can result in very confusing results or errors, since the changes that were made in-place may not be obvious. For this reason, I generally avoid any kind of in-place operations if possible. Rather, any function should immediately create a copy of the object that was passed in, and then do its work on that copy, returning at the end of the function for assignment to a new variable. One can then re-assign it to the same variable name if desired, which is more transparent than an in-place operation but still makes the workflow dependent on the exact state of execution and can lead to confusion when debugging. Some packages allow a feature called “copy-on-write” which defers actually copying the data in memory until it is actually modified, which can make copying more efficient; this feature is becoming the default in pandas.

If one must modify objects in-place, then it is good practice to announce this loudly. The loudest way to do this would be to put “inplace” in the function name. Another cleaner but less loud way is through conventions regarding function naming; for example, in PyTorch it is a convention that any function that ends with an underscore (e.g. tensor.mul_(x)) performs an in-place operation whereas the same function without the underscore (tensor.mul(x)) returns a new object. Another way that some packages enable explicit in-place operations is through a function argument (e.g. inplace=True in pandas), though this is being phased out from many functions in pandas because “It is generally seen (at least by several pandas maintainers and educators) as bad practice and often unnecessary” (PDEP-8).

One way to prevent in-place operations altogether is to use data types that are immutable, meaning that they can’t be changed once created. This is one of the central principles in functional programming languages (such as Haskell), where all data types are immutable, such that one is required to create a new object any time data are modified. Some native data types in Python are immutable (such as tuples and frozensets), and some data science packages also provide immutable data types; in particular, the Polars package (which is meant to be a high-performance alternative to pandas) implements its version of a data frame as an immutable object, and the JAX package (for high-performance numerical computation and machine learning) implements immutable numerical arrays.

Converting from Jupyter notebook to a runnable python script

As we discussed in an earlier chapter, converting a Jupyter notebook to a pure Python script is easy using jupytext. This results in a script that can be run from the command line. However, there can be some commands that will block execution of the script; in particular, plotting commands can open windows that will block execution until they are closed. To prevent this, and to ensure that the results of the plots are saved for later examination, I replaced all of the plt.show() commands that display a figure to the screen with plt.savefig() commands that save the figures to a file in the results directory. (This was an easy job for the Copilot agent to complete.)

Decomposing a complex workflow

The first thing we need to do with a large monolithic workflow is to determine how to decompose it into coherent modules. There are various reasons that one might choose a particular breakpoint between modules. First and foremost, there are usually different stages that do conceptually different things. In our example, we can break the workflow into several high-level processes:

Data (down)loading
Data filtering (removing subjects or cell types with insufficient observations)
Quality control
- identifying bad cells on the basis of mitochondrial, ribosomal, or hemoglobin genes or hemoglobin contamination
- identifying “doublets” (two cells captured in a single barcode)
Preprocessing
- Count normalization
- Log transformation
- Identification of high-variance features
- Filtering of nuisance genes
Dimensionality reduction
UMAP generation
Clustering
Pseudobulking (aggregating cells within an individual)
Differential expression analysis
Pathway enrichment analysis (GSEA)
Overrepresentation analysis (Enrichr)
Predictive modeling

In addition to a conceptual breakdown, there are also other reasons that one might want to further decompose the workflow:

There may be points where one might need to restart the computation (e.g. due to computational cost).
There may be sections where one might wish to swap in a new method or different parameterization.
There may be points where the output could be reusable elsewhere.

Resumable workflows

I asked Claude Code to help modularize the monolithic workflow, using a prompt that provided the conceptual breakdown described above. The resulting code ran correctly, but crashed about two hours into the process due to a resource issue that appeared to be due to asking for too many CPU cores in the differential expression analysis. This left me in the situation of having to rerun the entire two hours of preliminary workflow simply to get to a point where I could test my fix for the differential expression component, which is not a particularly efficient way of coding. The problem here is that the workflow execution was stateful, in the sense that the previous steps need to be rerun prior to performing the current step in order to establish the required objects in memory. The solution to this problem is to implement the workflow in a resumable way, which doesn’t require that earlier steps be rerun if they have already been completed. One way to do this is by implementing a process called checkpointing, in which the intermediate state is stored for each step. These checkpoint files can then be used to start the workflow at any point without having to rerun all of the previous steps.

Another important feature of a workflow related to resumability is idempotency, which means that a workflow will result in the same answer when run multiple times. This is related to, but not the same as, the idea of resumability. For example, a resumable workflow that saves its outputs to cache files could fail to be idempotent if the results were appended to the output file with each execution, rather than overwriting them. This would result in different outputs depending on how many times the workflow has been executed. Thus, when we use caching we should be sure to either reuse the existing file or rewrite it completely with a new version.

When it comes to building a resumable workflow, there are two options:

Build it from scratch
Use an existing workflow engine

While there may sometimes be good reasons to build a workflow from scratch (and I actually do so in developing this chapter: see here), one’s first inclination should always be to see whether there is an existing tool that can solve the problem.

Managing a complex workflow with Snakemake

In this example I will focus on the Snakemake workflow engine. Figure 1 shows the DAG generated by Snakemake for the RNA-seq analysis workflow.

Figure 1. The execution graph for the RNA-seq analysis workflow visualized as a DAG.

The Snakemake file is substantially larger for this complex workflow; to keep it from being even longer, we can also move some of the rules into separate rules files using the following syntax within the Snakemake file:

# Include modular rule files
include: "rules/common.smk"
include: "rules/preprocessing.smk"
include: "rules/pseudobulk.smk"
include: "rules/per_cell_type.smk"

Pipeline optimization

The first time that I ran this workflow using Snakemake, I noticed that it was substantially slower than when I ran it using the custom workflow engine. When I asked Claude about this, it gave me a reasonable answer:

The Snakemake rules don’t specify a threads directive. By default, Snakemake runs each job with 1 thread, which limits parallelization of computationally intensive operations like neighbor computation and UMAP (which use numba internally).

I asked Claude to fix this, and it returned the following change:

Fixes applied:
Added threads: 8 to dimensionality_reduction rule
Added threads: 8 to quality_control rule (Scrublet)
Added threads: 4 to preprocess rule
Set NUMBA_NUM_THREADS and OMP_NUM_THREADS environment variables in dimred.py In contrast, tasks run in the main process have access to all CPUs by default, which is why it was faster.

This involved creating rules like this one:

rule dimensionality_reduction:
...
    threads: 8

The dimred.py script then uses the snakemake.threads variable to set the relevant environment variables. This rule defines the ideal situation: that is, what is the maximum number of threads that our code can take advantage of? In general a number from 4-8 is optimal here, given the overhead that comes with multithreading (as we will discuss further in the later chapter on Performance Optimization). While this might seem problematic (e.g., what if there are only four cores available?), Snakemake deals with it gracefully. If there are more cores available than the limit, then Snakemake will (if appropriate) spawn multiple processes in parallel. If there are fewer than the number requested, it will simply use what is available. There is a separate command line argument to Snakemake (--cores) that specifies the maximum number of cores that can be utilized on the computer.

Parametric sweeps

A common pattern in some computational research domains is the parametric sweep, where a workflow is run using a range of values for specific parameters in the workflow. A key to successful execution of parametric sweeps is proper organization of the outputs so that they can be easily processed by downstream tools. Snakemake provides the ability to easily implement parametric sweeps simply by specifying a list of parameter values in the configuration file. For example, let’s say that we wanted to assess predictive accuracy using several values of the regularization parameter (known as alpha) for a ridge regression model. We could first specify a setting within our config.yaml file containing these values:

ridge_alpha:
  - 0.1
  - 1.0
  - 10.0

We would then add wildcards to the inputs and/or outputs for the relevant rules, expanding the parameters so that each unique value (e.g. each of our different models) becomes an expected input/output:

rule all:
    input:
        expand("results/ridge/alpha_{param}/model.pkl",
               param=config["ridge_alpha"])

rule train:
    input:
        "data/train.csv"
    output:
        "results/ridge/param_{param}/model.pkl"
    shell:
        "python train.py --model ridge --param {wildcards.param} -o {output}"

It is also possible to generate parameters based on earlier steps in the workflow. In our RNA-seq workflow, we determine in an earlier step which specific cell types to include, based on their prevalence in the dataset. These cell types are then used to run the per-cell-type analyses in a later step, executing the same enrichment and pathway analyses on each of the selected cell types. This kind of data-dependent computational graph requires the use of the advanced checkpointing features in Snakemake.

One could certainly perform the parametric sweep outside of the workflow engine (e.g. by running several Snakemake jobs for each set of values or by looping over the values within the main job script rather than at the workflow layer). However, there are several advantages to doing it within a coherent workflow. First, it ensures that all of the runs are performed using exactly the same software environment and workflow. If the different parameter settings were run in different workflows, then it is possible that the software environment could change between runs, so one would need to do additional validation to ensure that it was identical across runs. Second, it maximizes the use of system resources, since the workflow manager can optimally split the work across the available number of cores/threads. Running multiple snakemake jobs at once has the potential to request more threads than available, which can sometimes substantially reduce performance. Manually managing system resources can require substantial effort. Third, it enables the use of values from earlier workflow steps to determine the parameters for sweeping at later layers, as in the cell-type example above. Finally, it makes incremental changes easy and economical; if one additional value of the parameter is added, Snakemake will only run the computations for the new value.

Workflow testing strategies

Russ Poldrack — Tue, 28 Apr 2026 15:01:42 GMT

Software testing is just as essential for workflow development as it is for any other kind of software, but there are a few particular issues that are uniquely relevant for testing workflows:

It may not be possible to perform testing using realistic datasets, either due to data size or to restrictions on data access.
Smaller test datasets may not fully exercise the same behaviors as real datasets.
Results are often non-deterministic, such as those involving Monte Carlo simulations or search processes with random initialization, making it difficult to test for a specific outcome.
These tests often require comparison of equality between floating point numbers, but those require some degree of tolerance for floating point errors, which can be difficult to choose in order to simultaneously catch real errors and avoid false alarms.
Scientific workflows often take a very long time to run (hours or even days), making full integration tests infeasible.
Finally, in scientific workflows we often don’t know the right answer to expect from code, if we are performing operations that have never been done before.

I will leave the question of analytic accuracy to be addressed in the following chapter on validation. Here I will focus primarily on unit testing of workflow components and integration testing using a minimal test dataset.

Generating tests using AI agents

Because I didn’t use test-driven development to generate this workflow, I was faced with the task of having to generate a substantial amount of testing code for the completed workflow. To get a feel for the scale, the source code for our RNA-seq analysis contained 83 functions defined across 13 files. In total these files included 3,239 lines, though there are actually many fewer lines of code, since many lines are blank or contain comments. In addition many individual commands are split across lines to increase readability. Using the radon package for source code analysis, we can compute the logical lines of code, which is the number of executable statements:

$  uv run radon raw *.py | awk '/LLOC:/ {sum += $2} END {print sum}'
1038

Thus, we need to generate tests for 83 functions comprising more than 1000 statements! As I discussed in Chapter 4, we can use AI tools to help generate test code, though this code must be examined in detail by a knowledgeable human in order to ensure that the tests adequately exercise the relevant functions.

I started by generating a CLAUDE.md file to create the test framework:

This is a set of implementations of an analysis workflow for single-cell RNA-seq data analysis. These are meant to exemplify different ways of building a workflow.  The main goal of this development project is to develop a testing framework for this workflow.  This will involve two main steps:

- develop unit tests for the functions defined within src/example_rnaseq
- develop integration tests for the snakemake workflow defined in snakemake_workflow

Some of these tests can be performed using automatically generated data. However, the integration tests will require a test dataset. this should be based on the actual data that can be found at $DATADIR/dataset-OneK1K_subset-immune_raw.h5ad.  To minimize the size of this dataset, we should first select a subset of 30 donors from the dataset.  we should look for donors that vary in the number of cells, with some having high numbers and some having low numbers. The donors should also vary in age so that the subset covers the entire distribution of ages in the dataset. then we should select a subset of about 500 genes.  these should include:

- a set of genes from a pathway (TNF-alpha Signaling via NF-KB) known to be associated with aging, found in tests/data/HALLMARK_TNFA_SIGNALING_VIA_NFKB.v2025.1.Hs.json
- a set of about 200 other highly variable genes
- a set of about 100 weakly variable genes

This dataset should be saved to tests/data/testdata.h5ad.

## Coding guidelines

- Think about the problem before generating code.
- Write code that is clean and modular. Prefer shorter functions/methods over longer ones.
- Prefer reliance on widely used packages (such as numpy, pandas, and scikit-learn); avoid unknown packages from Github.
- Do not include *any* code in `__init__.py` files.
- Use pytest for testing.
- Use functions rather than classes for tests. Use pytest fixtures to share resources between tests.

Claude Code took about 20 minutes to generate an entire test framework for the code, comprising 215 test functions and 19 test fixtures. Interestingly, Claude disregarded my instructions to use functions rather than classes for tests, generating 78 test classes. While I usually prefer tests to be in pure functions rather than classes so that novices can more easily understand them, I decided in this case to stay with the class-based implementation since I don’t mind it and it does make the organization of the tests a bit cleaner.

The initial test set for this project had no tests for one of the modules, and other modules with significant portions untested. I was able to improve this by having Claude Code analyze the code coverage report and identify important parts of the code that were not currently covered, which moved the test coverage from 69% to 88% of the 870 statements in the code that were identified by the coverage tool.

Avoiding the happy path

Because it is essential for AI-generated tests to be assessed by a knowledgeable human, I proceeded to read all of the tests that had been generated by Claude. Fortunately they were all easily readable and clearly named, which made it relatively easy to see some potential problems right away. Several kinds of issues arose.

Because AI agents have a strong tendency to generate tests that pass, they will sometimes miss potential problems - this is commonly referred to as following the “happy path”. Several of the tests performed very minimal checking of outputs that would miss potential problems. For example, it generated the following test which, according to its name, should test whether a PCA embedding is generated using the harmonypy package:

    def test_creates_harmony_embedding(self, adata_with_pca):
        """Test that Harmony creates a new embedding."""
        adata, use_rep = run_harmony_integration(adata_with_pca.copy())

        if use_rep == "X_pca_harmony":
            assert "X_pca_harmony" in adata.obsm
            assert adata.obsm["X_pca_harmony"].shape[0] == adata.n_obs

The use_rep variable contains “X_pca_harmony” if the harmonypy package is installed and successfully applied to the data, otherwise it falls back on standard PCA and sets use_rep to “X_pca”. But it’s clear here that this package only checks for the harmony embedding in the case that it was successfully created (if use_rep == “X_pca_harmony”), in which case it makes sure that it is present in the dataset and has the right shape. Thus, the test could pass even if the harmony embedding was not successfully created. Here is the improved version to address this issue:

    def test_creates_harmony_embedding(self, adata_with_pca):
        """Test that Harmony creates a new embedding with correct shape."""
        pytest.importorskip("harmonypy")

        adata, use_rep = run_harmony_integration(adata_with_pca.copy())

        assert use_rep == "X_pca_harmony"
        assert "X_pca_harmony" in adata.obsm
        assert adata.obsm["X_pca_harmony"].shape[0] == adata.n_obs
        assert adata.obsm["X_pca_harmony"].shape == adata.obsm["X_pca"].shape

In other cases, the tests that were generated were too minimal, allowing obvious failure cases to pass. For example, the integration test named “test_pseudobulk_pipeline_runs” included the following code:

        result = run_pseudobulk_pipeline(
            adata,
            group_col="cell_type",
            donor_col="donor_id",
            metadata_cols=["development_stage", "sex"],
            min_cells=1,  # Low threshold for test data
            figure_dir=temp_output_dir,
        )

        # Check outputs
        assert result is not None
        assert result.n_obs > 0
        assert "n_cells" in result.obs.columns
        assert "cell_type" in result.obs.columns
        assert "donor_id" in result.obs.columns

Pseudobulking is an operation that should summarize all cells of a given type for each donor, but none of the test conditions actually check that it has been properly applied. In fact, these tests could pass if run_pseudobulk_pipeline() simply passed the original data back without doing anything to it! This is a case where domain knowledge is essential to get the tests right and avoid the happy path. In several other cases the tests called pytest.skip() (which causes the test to be skipped) for outcomes that really should have triggered a test failure. For example, it skipped the integration tests for the full dataset if the dataset hadn’t already been created, and it also skipped the Snakemake integration functions if the Snakemake call failed (which it initially did because of a missing argument).

These examples highlight the need to closely examine the test code that is generated by AI agents. However it’s worth noting that although it took a significant amount of human time to read over the AI-generated tests, the time spent was still far less than if I had undertaken writing the test code without AI assistance, and Claude was also able to fix all of the issues to my satisfaction after I raised them.

Property-based testing for workflows

The tests initially developed for this workflow were built around the known characteristics of the expected data. However, there are many “unknown unknowns” when it comes to input data, and it’s important to make sure that the code deals gracefully with problematic inputs. We can test this using a property-based testing approach; as I discussed in Chapter 4, this involves the generation of many different datasets that vary, and checking whether the code deals with them appropriately. When I asked the coding agent to identify plausible candidates for property-based testing using the Hypothesis package, it generated tests centered on several different properties:

Proper parsing of a range of filenames for the BIDS format parser used in caching
Consistency of hashing operations
Proper JSON serialization of a range of values
Proper processing of input lists with a range of inputs

These are all good applications of property-based testing because they focus on invariant features that should be true regardless of the inputs (i.e. the same input should always generate the same hash, or loading a serialized dataset should return the same values as the original). I further pushed it to identify plausible candidates for testing of the effects of numerical edge cases, such as division by zero, which is another valuable use case for property-based testing. This paid off when the property-based tests generated by Claude identified a bug that had not been caught by the previous tests:

The property-based test found a real bug! The prepare_enrichr_plot_data function produces infinity when computing -log10(0) for zero p-values. The GSEA version handles this with + 1e-10, but the Enrichr version doesn’t.

This highlights the utility of property-based testing alongside standard unit tests.

This post completes the chapter on workflows. In the next post I will proceed to the chapter on validation of scientific code.

Tracking provenance in workflows

Russ Poldrack — Tue, 21 Apr 2026 15:01:27 GMT

As I discussed in the earlier chapter on data management, it is essential to be able to track the provenance of files in a workflow. That is, how did the file come to be, and what other files did it depend on? Snakemake stores a substantial amount of metadata that allows us to reconstruct much of the provenance of any file generated by a workflow. The relevant data are stored within the .snakemake/metadata directory, which on first glance seems to contain a bunch of gibberish:

➤  ls .snakemake/metadata

cmVzdWx0cy9jb3JyZWxhdGlvbl9tYXRyaXguY3N2
ZGF0YS9kZW1vZ3JhcGhpY3MuY3N2
ZGF0YS9kZW1vZ3JhcGhpY3NfbnVtZXJpY2FsLmNzdg==
ZGF0YS9qb2luZWRfZGF0YS5jc3Y=
ZGF0YS9tZWFuaW5nZnVsX3ZhcmlhYmxlc19udW1lcmljYWwuY3N2
ZGF0YS9tZWFuaW5nZnVsX3ZhcmlhYmxlcy5jc3Y=
ZmlndXJlcy9jb3JyZWxhdGlvbl9oZWF0bWFwLnBuZw==

These filenames are actually versions of the original filenames that have been encoded into a *base64* representation that makes them easily saveable as a single file. We can decode them using the base64 python package:

In: encoded_name
Out: 'ZGF0YS9tZWFuaW5nZnVsX3ZhcmlhYmxlc19udW1lcmljYWwuY3N2'

In: base64.b64decode(encoded_name).decode()
Out: 'data/meaningful_variables_numerical.csv'

These files are stored in JSON format and contain a dictionary with relevant information about the provenance of each file:

metadata_path = f".snakemake/metadata/{encoded_name}"
with open(metadata_path) as f:
    print(md_dict)
```
```python
{'record_format_version': 6,
 'code': '        f"{BASEDIR}/scripts/filter_data.py"\n',
 'rule': 'filter_meaningful_variables',
 'input': ['data/meaningful_variables.csv'],
 'log': ['logs/filter_meaningful_variables.log'],
 'params': [],
 'shellcmd': None,
 'incomplete': False,
 'starttime': 1767284094.3992934,
 'endtime': 1767284095.986599,
 'job_hash': 278889477,
 'conda_env': 'bmFtZTogc2ltcGxlX3dvcmtmbG93CmNoYW5uZWxzOgogIC',
 'software_stack_hash': 'd41d8cd98f00b204e9800998ecf8427e',
 'container_img_url': 'docker://jupyter/scipy-notebook:x86_64-ubuntu-22.04',
 'input_checksums': {}}

Using this information we could reconstruct the DAG for the workflow, or identify the specific files that went into generating each other file.

Generating a PROV representation from Snakemake

As I mentioned in the chapter on data management, there is a emerging standard for representation of provenance information, known as PROV. Although Snakemake does not directly support the generation of PROV representations, there is a package called makeprov that can generate a PROV representation from a Snakemake workflow. After running our workflow and installing the makeprov package, we simply need to run the snakemake.makeprov command to generate the PROV output from within the working directory of the workflow (where the .snakemake directory is located):

➤ uv run python -m makeprov.snakemake --prov-path prov/snakemake -- --snakefile path/to/Snakefile --nolock

This will generate a file called prov/snakemake.json that contains representations of each of the entities and activities in the workflow. For example, the representation of the data/meaningful_variables_numerical.csv output file would look like this:

    {
      "id": "urn:snakemake:file/data/meaningful_variables_numerical.csv",
      "type": "prov:Entity",
      "format": "text/csv",
      "extent": 1191564,
      "modified": "2026-01-01T16:14:55.986599+00:00",
      "identifier": "sha256:6e09083ea4f474b420cf1ca9f42486e4a509ad020a39f5432b4c183e7f92e519",
      "wasGeneratedBy": "urn:snakemake:job/4",
      "label": "data/meaningful_variables_numerical.csv"
    },

and the representation of the job that created it (`filter_meaningful_variables`) would look like this:

    {
      "id": "urn:snakemake:job/4",
      "type": "prov:Activity",
      "wasAssociatedWith": "urn:snakemake:agent/snakemake",
      "used": [
        "urn:snakemake:file/data/meaningful_variables.csv"
      ],
      "label": "filter_meaningful_variables (jobid=4)",
      "snakemake:rule": "filter_meaningful_variables",
      "snakemake:status": "ok",
      "snakemake:plan": "no update"
    },

These files provide a very useful representation of the provenance for a workflow, and demonstrate the power of using a workflow engine that stores rich metadata about the workflow and its execution.

In the next post I will discuss how to scale to complex scientific workflows.

From idea to talk in less than 24 hours

Russ Poldrack — Mon, 20 Apr 2026 19:21:08 GMT

This weekend I did something that I’m pretty sure very few people have ever done before: I took an idea, executed a completely new research project, and gave a conference talk about it, all within 24 hours. This post tells the story of how AI enabled me to do this, and what I learned.

The conference

I was in Princeton, NJ for the annual meeting of the Society of Experimental Psychologists, of which I am a member. It’s a very broad meeting where members can come and talk about whatever they are interested in, and I had signed up to give a talk about causal inference in the context of fMRI statistical modeling. The talks are only 15 minutes long, which is a very difficult amount of time to talk for because it’s hard to flesh out an idea in that little amount of time. I was scheduled to talk in the afternoon of Day 2 of the meeting, so I spent Monday just listening to talks.

One of the talks that piqued my interest was by John Hummel, who is currently a faculty member at UIUC who I have known since our days on the faculty together at UCLA. John was talking about a paper that he and Rachel Heaton published last year, which focused on requirements for a computational model to perform symbolic processing. They make the following claim:

We propose that two kinds of hierarchical integration—integration of multiple role bindings into multiplace predicates, and integration of multiple correspondences into structure mappings—are minimal requirements, on top of basic dynamic binding, to realize symbolic thought. We tested this hypothesis in a systematic collection of 17 simulations that explored the ability of cognitive architectures with and without the capacity for multiplace predicates and structure mapping to perform various kinds of tasks. The simulations were as generic as possible, in that no task could be performed based on any diagnostic features, depending instead on the capacity for multiplace predicates and structure mapping. The results are consistent with the hypothesis that, along with dynamic binding, multiplace predicates and structure mapping are minimal requirements for basic symbolic thought. These results inform our understanding of how human brains give rise to symbolic thought and speak to the differences between biological intelligence, which tends to generalize broadly from very few training examples, and modern approaches to machine learning, which typically require millions or billions of training examples. The results we report also have important implications for bioinspired artificial intelligence.

John made very clear in his talk that he thought that these results demonstrated that current LLMs will never be able to perform true relational reasoning, because they don’t have the necessary representational and/or processing apparatus. In the Q&A period, I suggested that we sit down with Claude Opus 4.7 (which hereafter I will just call “Opus”) to see if his claim stands up to frontier AI models. My feeling is that it’s generally a losing proposition to make impossibility claims about future AI models, and I wanted to see how close the current models were to achieving his supposedly impossible ability.

When I got back from the conference in the afternoon, I decided to play around with trying to test the model on my own before dinner. I first ran the paper through the Opus chatbot with the following prompt:

The author of the attached paper claims that there is a set of tasks that LLMs cannot perform because they do not contain the basic elements for symbolic thought: dynamic binding, multiplace predicates, and structure mapping. Please review the paper and tell me whether you think this is a reasonable argument.

Claude came back with extensive comments, which you can see in their entirety here - they started with this very Claude-like intro:

This is a thoughtful paper, and the authors are making a more careful argument than a quick read might suggest. Let me separate what I think is right, what I think is questionable, and what remains genuinely open.

I then asked it to help me build some example problems to test reasoning in a way that avoids training data contamination:

can you develop a novel version of their task that I could test out on an LLM but that they could not legitimately claim would be in the models’ training diet?

It provided a set of guidelines for developing the problems, as well as suggesting a set of ways to make the problems even more challenging for the LLM. I placed its suggestions into a markdown file and put it in a new directory along with the original paper, and then fired up Claude Code and issued the following prompt:

I would like to generate a set of benchmark problems for relational reasoning. In another session I had claude read the paper and lay out some versions of the problem to develop; that transcript is in PROBLEM.md, and the paper is available at hummel_heaton.pdf. I would like to create five instances of the each of the variants that are described in the PROBLEM.md file, and then build a harness to run them on each version of claude (opus, sonnet, and haiku) and quantify their performance. please develop a plan

Within about 45 minutes I had a working script that could run the problems against each of the models, which I started running and then left for dinner. (See the full codebase here).

The analyses

Over dinner I told my colleagues about my ongoing analyses, and one of them said to me “oh, you have to talk about this tomorrow!”. So when I got home from dinner around 10:30 pm, I was anxious to see what the results looked like to determine whether it would make sense to create a new talk. Here is what I saw:

The numbers refer to the proportion of correct outputs, meaning that all of the models had scored 100% on the problems, which clearly suggested that something was wrong. So I fired up Claude Cowork in the code directory and issued the following prompt:

this directory is for a project that is trying to test ideas presented in a paper by hummel, which can be found at hummel_heaton.pdf. I asked claude code to create a set of tests to check whether LLMs have the relational reasoning capacity that Hummel claims they should not have - this is described in PROBLEMS.md. I ran the tests on LLMS including Haiku, Sonnet, and Opus, and all of them were able to perform the task perfectly. This has me worried that there might be somethign subtly wrong with the examples that is making it too easy for the models. can you please examine the problems (in problems/) and identify whether there are any issues that might be allowing the LLM to solve the problem without true relational reasoning?

Its judgment was swift:

I’ve examined the problems carefully and found several serious shortcuts that let LLMs solve the task without any relational reasoning. The short version: the correct answer is uniquely identifiable by its button configuration alone. Here are the specific issues, with evidence from the generator code.

It laid out a number of issues; see here for the full transcript. I then asked Cowork to fix the problems, which it did fairly quickly, also adding a control condition to provide a negative control in order to ensure that there wasn’t a scoring problem. I also decided to add the ability to run the prompts against a local open-weight LLM using my Mac, in order to see how a very small model would do (in this case, Gemma 3n), mostly because I wanted to include a model that I was pretty sure should fail. I set this running and went to bed. At this point I had decided that if there were usable results that I would create a new talk the next morning, which made it a bit difficult to sleep, so about 30 minutes later I got up and checked the results, seeing this:

The fact that performance scaled with model capacity suggested to me that these were plausible results. I also examined the error patterns, which showed the expected prevalence of feature match errors for the weaker models:

At this point I knew there was something to talk about, so I headed back to bed knowing that I had to get up early the next morning and create a new talk. After a few hours of fitful sleep, I was up the next morning to start working on the talk. I developed some introductory slides based on the results so far, and then headed over the conference where I sat in the back of the room working on further analyses. During this session I noticed something weird - the cross-domain results were roughly the same as these other problems, when it should be a much harder problem. When I looked at one of the cross domain problems I discovered that Claude had made an error and that it wasn’t actually cross-domain (i.e. it used the same domain for both scenarios), so I asked Claude to examine the problems:

can you please examine the cross-domain problems? they don’t actually seem to be cross-domain to me

It diagnosed the problem:

You’re right — these aren’t cross-domain. Each problem uses one domain for both scenarios (org_chart ↔ org_chart, garden ↔ garden, etc.). The only thing “crossing” is the surface vocabulary relative to the default buttons-on-objects setup in baseline. Within a given problem, memory and perception share the same ontology, the same relation names, and the same feature noun.

I asked it to fix the problem, which it did - in addition, it generated a set of tests to ensure that the problem were really testing relational knowledge:

Let me do a rigorous sanity check: write three solvers (structural, feature-matching, semantic-alignment) and check their accuracy on the new cross-domain set
Excellent! Exactly the pattern we want:
FEATURE_MATCH scores 0/5 on every variant (feature-twin shortcut blocked).
STRUCTURAL scores 5/5 on every variant that has a structural answer (all except control).
SEMANTIC_ALIGN scores 0/5 on cross_domain — the newly-added semantic-alignment shortcut is defeated.
On control, the feature-match solver reliably picks the twin (5/5), confirming control’s diagnostic property.

I ran the new code, and the results were striking:

All of the models were at floor on the cross_domain problem (shown in green)! I added this figure into my talk, and moved on to creating some more slides. I wanted to provide a graphical representation of one of the problems, so I asked claude to create a visualization, which it did. Here is the cross-modal problem that was posed to the LLM:

I'm going to describe two scenarios. In the memory scenario, a novel employee called a skiv has a property: it can be activated by assigning the specialty at a specific position on it. Your job is to figure out which plant in the perception scenario is the skiv-analog, and therefore which leaf on it activates it.
Memory scenario: There are three employees in an organization: a skiv, a tunk, and a drog. The skiv reports-to the tunk. The tunk sits-beside the drog. The skiv has a yellow color-coded specialty on top, a cyan color-coded specialty on side, and a brown color-coded specialty on bottom. The tunk has a orange color-coded specialty on top, a red color-coded specialty on side, and a black color-coded specialty on bottom. The drog has a green color-coded specialty on top, a purple color-coded specialty on side, and a pink color-coded specialty on bottom. Assigning the specialty at the bottom position of the skiv activates it.
Perception scenario: There are three plants in a garden: a snig, a clop, and a trob. The trob grows-beside the snig. The snig is-growing-under the clop. The snig has a yellow colored leaf on top, a cyan colored leaf on side, and a brown colored leaf on bottom. The clop has a orange colored leaf on top, a red colored leaf on side, and a black colored leaf on bottom. The trob has a green colored leaf on top, a pink colored leaf on side, and a purple colored leaf on bottom.
Which plant in the perception scenario is the skiv-analog, and which leaf activates it?

And here are the visualizations:

Once I saw this, I quickly thought: wait, if Claude can create a graphical model of the problem, then why can’t it use this kind of reasoning to solve the problem? My initial prompt had given the model little context on how to solve it, other than suggesting chain-of-thought reasoning:

You are solving relational reasoning problems. Each problem has a memory scenario and a perception scenario. Your task is to map objects in the perception scenario to objects in the memory scenario based on their relational structure (how they relate to each other), then answer a specific question. Think step by step: first identify the relations in each scenario, then find the mapping that preserves relational structure, then answer.

I asked Claude to help generate a prompt to provide the model with more detailed instructions on how to solve the problem, based on its work on the graphical visualization:

using a standard prompt the models are unable to solve this problem. please suggest a prompt based on your work above that would help an LLM be more likely to successfully solve a problem like this one.

It did so, creating (see full prompt here), and after having Claude Code add the ability to use custom prompts I was able to run the models using this new problem. After a few excruciating minutes of waiting, I had the answer:

All of the models were now at ceiling on the cross-domain problems! At this point it was approaching lunch and I was done with analyses, so I finished the slides and sent them to the conference organizer to put onto the main computer for my presentation in a couple of hours. I then went to lunch and showed the results to several of the attendees, particularly John Hummel because I didn’t want him to feel ambushed when I gave my talk.

The full slide deck is available here if you are interested to see how it came out.

The talk

Given what a crazy experience this was, I felt remarkably calm going into the talk - much more calm that I had earlier in the morning when I was under the gun to produce a coherent slide deck with useful results. When my time came to talk, I zoomed through my 27 slides in about 12 minutes, leaving a couple of minutes for questions. The most interesting question centered around whether providing the long crafted prompt that allowed near-ceiling performance counted as “cheating”. I tend to think that it isn’t; the prompt provides the strategy, but if the model didn’t have the necessary representational apparatus then it shouldn’t be able to solve each individual problem. Another question concerned the use of chain-of-thought reasoning; I didn’t have results to directly address this since I hadn’t run the analyses without the CoT section of the prompt on the full problem set. It’s an interesting question as to how well the models would do without any CoT, but it seems clear to me that CoT would be necessary since the strategic prompting is required for success on the most difficult problem, and it’s hard to see how that would be useful to the model without the ability to use CoT.

I think it’s fair to say that the talk set the meeting abuzz. I had several people approach me afterwards telling me how exciting the talk was, and it spurred numerous conversations in the coffee break that followed. It was certainly the most exciting day I have ever had at a scientific conference.

Takeaways

I took away several lessons from this experience.

Agentic AI is a superpower. There is no way that I could have achieved this kind of turnaround without the current Claude ecosystem, as both Claude Code and Cowork were central in helping me complete the project. While the talk probably raised more questions than it answered, it addressed a legitimate scientific question arising from another talk, which in the past would have unfolded over weeks or months.
Testing LLMs is hard. It’s very difficult to ensure that the problems don’t allow shortcuts such that they can solve the problem without actuallly having the intended capability. I’m far from the first to say this, and in fact this is a critical insight from all of comparative and developmental psychology (e.g., see this by my colleague Mike Frank), but this was the first time that I have lived it.
It’s rarely a good idea to trust the first answer you get. Over the course of about 12 hours I had 4 different answers to the basic question of whether LLMs can solve the relational reasoning problem, and I’m sure that as I continue to work on this there will be additional twists.
Coding agents like Claude Code are amazing but also clearly make mistakes, as we saw here and as I have documented in my book Better Code, Better Science. As the project becomes more complex, those mistakes can become increasingly difficult to detect. Just as with code generation, lots of validation is required to trust the results.

Would I recommend going from idea to a conference talk in 24 hours? Of course not, and I hope that it doesn’t become an expectation in the future! But the fact that it was even possible in this case speaks to the superpowers provided by the current AI toolchain.

Best practices for Snakemake workflows

Russ Poldrack — Tue, 14 Apr 2026 15:01:20 GMT

The Snakemake team has published a set of best practices for the creation of Snakemake workflows, some of which I will outline here, along with one of my own (the first).

Using a working directory

By default Snakemake looks for a Snakefile in the current directory, so it’s tempting to run the workflow from the code repository. However, Snakemake creates a directory called .snakemake to store metadata in the directory where the workflow is run, which one generally doesn’t want to mix with the code. Thus, it’s best to run the command using the `--snakefile` directive to point to the `Snakefile` located in the code directory, and setting the working directory to the intended output directory using the `-d` flag. This will fail if you run the command from a location other than the source folder if the paths in the snakemake rules are specified using relative paths, like this:

script:
    f"scripts/aggregate_results.py"

This happens because relative paths inside the Snakefile are interpreted as relative to the working directory, not the directory where the Snakefile is located. Instead, we need to use the workflow.basedir prefix, which refers to the directory where the Snakefile is located:

script:
    f"{workflow.basedir}/scripts/aggregate_results.py"

Workflow organization

There is a standard format for the organization of Snakemake workflow directories, which one should follow when developing new workflows.

Snakefile formatting

Snakemake comes with a set of commands that help ensure that Snakemake rule and config files are properly formatted and follow best practices. As I mentioned above, there is a static analysis tool (i.e., a “linter”, akin to ruff or flake8 for Python code), which can automatically identify syntax errors and logical problems with Snakemake rule files. Users of uv should note that this tool assumes that one is using the Conda environment manager or a container, and it raises an issue for any rule that doesn’t specify a Conda or container environment. Nonetheless, if those are ignored the linter can be useful in identifying problems. There is also a formatting tool called snakefmt (separately installed) that optimally formats Snakemake files in the way that black or ruff format Python code. These can both be useful tools when developing a new workflow.

Configurability

Workflow configuration details should be stored in configuration files, such as the config.yaml files that we have used in our workflow examples. However, these files should not be used for runtime parameters, such as the number of cores or the output directory; those should instead be handled using Snakemake’s standard command line arguments.

Report generation

One of the very handy features of Snakemake is its ability to generate reports for workflow execution. Report generation is as simple as:

➤ uv run snakemake -c 1 --report output/report.html -d output

This command uses the metadata stored in the .snakemake directory along with details provided in separate report formatting files that are located within the report directory alongside the Snakefile. In order for an output (such as a figure) to be included in the report, it needs to be marked with a report flag in the output section of the relevant rule. For example, to have a correlation heatmap added to the report, I used the following statement:

rule generate_heatmap:
    input:
        f"{RESULTS_DIR}/correlation_matrix.csv",
    output:
        report(
            f"{FIGURES_DIR}/correlation_heatmap.png",
            caption=f"{BASEDIR}/report/heatmap.rst",
            category="Results",
        ),

Running the report generation command generates a single self-contained HTML file, with any figures embedded within the file, making them very handy for sharing.

In the next post I will discuss tracking provenance in workflows.

Reproducible environments with Snakemake

Russ Poldrack — Tue, 07 Apr 2026 15:01:30 GMT

In this post I will discuss the two methods that Snakemake provides for ensuring a reproducible execution environment.

Reproducible environments with Conda

Snakemake comes with native support for Conda environments, which helps ensure reproducibility across systems. As I discussed in Chapter 2, I don’t love conda, but in lieu of support for uv it’s a reasonable solution for reproducible snakemake workflows. After first installing Conda on our system (if necessary), we then need to identify all of the packages that are necessary for our workflow to succeed, and then add those to a YAML file. Here is the example for our simple workflow, which I placed in envs/simple.yaml:

name: bettercode
channels:
  - conda-forge
dependencies:
  - numpy=2.4.0
  - pandas=2.3.3
  - matplotlib=3.10.8
  - seaborn=0.13.2

When we run the workflow, we will see that Snakemake first builds a local Conda environment within the working directory. In this case I am running it from a different directory than the source directory, so I need to specify the location of the Snakefile:

➤ uv run snakemake --sdm conda --snakefile /path/to/Snakefile -d ./ --cores 15

Building DAG of jobs...
Creating conda environment /path/to/snakemake_workflow/envs/simple.yml...
Downloading and installing remote packages.
Cleaning up conda package tarballs.
Environment for /path/to/snakemake_workflow/envs/simple.yml created (location: .snakemake/conda/0f65d58d0ced6388a583c7e1b77c240e_)

It then uses this environment to execute the code. It’s worth nothing that this will leave the conda environment in place within the hidden .snakemake directory, which can take up a significant amount of disk space if there are a lot of dependencies.

Reproducible environments with containers

As I discussed in Chapter 2, software containers are increasingly used as a means for creating reproducible software environments. Snakemake has built-in support for the Apptainer container tool, which is available for Linux and installed on most high-performance computing systems, but unfortunately not easily usable on Mac or Windows systems. Here I will show an example of a containerized version of the simple workflow above, running on my local Linux system.

Using containers is easiest if you can find an existing Docker container that contains all of the necessary dependencies for your code. Fortunately there is a large number of containers available via the Docker Hub, and given the simple dependencies that our workflow requires, I was easily able to find a container containing the necessary packages. I added this to my config.yaml file:

# Container image (used with --sdm apptainer)
container: "docker://jupyter/scipy-notebook:x86_64-ubuntu-22.04"

and also added the definition to my Snakemake file:

# Container image for all rules (used with --sdm apptainer)
container: config["container"]

and then ran the snakemake command specifying Apptainer as my dependency management system:

➤ uv run snakemake --cores 1 --sdm apptainer -d ./output

Building DAG of jobs...
Pulling singularity image docker://jupyter/scipy-notebook:x86_64-ubuntu-22.04.
...

As with Conda, it’s worth noting that Snakemake will store the Apptainer image within the .snakemake directory, which can sometimes be quite large; for the Jupyter image linked above, it was about 1.2 GB, but I have seen containers up to 10 GB or more on occasion.

In the next post I will lay out a set of best practices for Snakemake workflows.

Workflow management using Snakemake

Russ Poldrack — Tue, 31 Mar 2026 15:01:23 GMT

Warning: This post is longer than usual!

In this post will use the Snakemake workflow system for our example workflow from the previous post, which I chose for several reasons:

It is a very well-established project that is actively maintained.
It is Python-based, which makes it easy for Python users to grasp.
Because of its long history and wide use, AI coding assistants are quite familiar with it and can easily generate the necessary files for complex workflows.

Snakemake is a sort of “make on steroids”, designed specifically to manage complex computational workflows. It uses a Python-like syntax to define the workflow, from which it infers the computational graph and optimizes the computation. The Snakemake workflow is defined using a Snakefile, the most important aspect of which is a set of rules that define the different workflow steps in terms of their outputs. Here is an initial portion of the Snakefile for our simple workflow:

# Base directory (where Snakefile is located)
BASEDIR = workflow.basedir

# Output directories (relative to working directory set via -d)
DATA_DIR = "data"
RESULTS_DIR = "results"
FIGURES_DIR = "figures"
LOGS_DIR = "logs"

# Load configuration
configfile: f"{BASEDIR}/config/config.yaml"

# Global report
report: f"{BASEDIR}/report/workflow.rst"

# Default target
rule all:
    input:
        f"{FIGURES_DIR}/correlation_heatmap.png",

What this does is first specify a set of directories; the BASEDIR variable refers to the directory where the Snakefile is found, while the other directories are specified with the respect to the working directory that is specified using the -d argument. It then specifies the location of the configuration file, which is a YAML file that defines various parameters for the workflow. Here are the contents of the config file for our simple example:

# Data URLs
meaningful_variables_url: "https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/meaningful_variables_clean.csv"
demographics_url: "https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/demographics.csv"

# Correlation settings
correlation_method: "spearman"

# Heatmap settings
heatmap:
  figsize: [12, 10]
  cmap: "coolwarm"
  vmin: -1.0
  vmax: 1.0

The only rule shown above is the all rule, which takes as its input the correlation figure that is the final output of the workflow. If Snakemake is called and that file already exists, then it won’t be rerun (since it’s the only requirement for the rule) unless 1) the --force flag is included, which forces rerunning the entire workflow, or 2) a rerun is triggered by one of the changes to the input files, parameters, or the code itself. If the file doesn’t exist, then Snakemake examines the additional rules to determine which steps need to be run in order to generate that output. In this case, it would start with the rule that generates the correlation figure:

# Step 5: Generate clustered heatmap
rule generate_heatmap:
    input:
        f"{RESULTS_DIR}/correlation_matrix.csv",
    output:
        report(
            f"{FIGURES_DIR}/correlation_heatmap.png",
            caption=f"{BASEDIR}/report/heatmap.rst",
            category="Results",
        ),
    params:
        figsize=config["heatmap"]["figsize"],
        cmap=config["heatmap"]["cmap"],
        vmin=config["heatmap"]["vmin"],
        vmax=config["heatmap"]["vmax"],
    log:
        f"{LOGS_DIR}/generate_heatmap.log",
    conda:
        f"{BASEDIR}/envs/simple.yml"
    script:
        f"{BASEDIR}/scripts/generate_heatmap.py"

This step uses the generate_heatmap.py script to generate the correlation figure, and it requires the correlation_matrix.csv file as input. Note that while there is a conda directive in this rule, which we will discuss further in a later post, conda is not actually used unless the --use-conda flag is provided. Snakemake would then work backward to identify which step is required to generate that file, which is the following:

# Step 4: Compute correlation matrix
rule compute_correlation:
    input:
        f"{DATA_DIR}/joined_data.csv",
    output:
        f"{RESULTS_DIR}/correlation_matrix.csv",
    params:
        method=config["correlation_method"],
    log:
        f"{LOGS_DIR}/compute_correlation.log",
    conda:
        f"{BASEDIR}/envs/simple.yml"
    script:
        f"{BASEDIR}/scripts/compute_correlation.py"

By working backwards this way from the intended output, Snakemake can reconstruct the computational graph that we saw in the previous post. It then uses this graph to plan the computations that will be performed.

Snakemake scripts

In order for Snakemake to execute each of our modules, we need to wrap those modules in a script that can use the configuration information from the config file. Here is an example of what the generate_heatmap.py script would looks like:

from pathlib import Path
import pandas as pd
from bettercode.simple_workflow.visualization import (
    generate_clustered_heatmap,
)

def main():
    """Generate and save clustered heatmap."""
    input_path = Path(snakemake.input[0])
    output_path = Path(snakemake.output[0])
    figsize = tuple(snakemake.params.figsize)
    cmap = snakemake.params.cmap
    vmin = snakemake.params.vmin
    vmax = snakemake.params.vmax

    # Load correlation matrix
    corr_matrix = pd.read_csv(input_path, index_col=0)
    print(f"Loaded correlation matrix: {corr_matrix.shape}")

    # Generate heatmap
    generate_clustered_heatmap(
        corr_matrix,
        output_path=output_path,
        figsize=figsize,
        cmap=cmap,
        vmin=vmin,
        vmax=vmax,
    )
    print(f"Saved heatmap to {output_path}")

if __name__ == "__main__":
    main()

You can see that the code refers to snakemake even though we haven’t explicitly imported it; this is possible because the script is executed within the Snakemake environment which makes that object available, which contains all of the configuration details.

Once the Snakefile and other related files are ready to go, we can first check whether they are properly formatted using snakemake --lint, which is a static analysis tool specifically for Snakemake configuration files. Here is the result (note: some output lines are omitted for conciseness):

➤ uv run snakemake --lint --cores 1 -d ./output

Congratulations, your workflow is in a good condition!

We are now ready to run the snakemake command, which will need to include several configuration directives. Let’s first do a “dry run” which creates the execution graph but doesn’t actually execute anything:

➤ uv run snakemake --dry-run --cores 1 -d ./output

Building DAG of jobs...

Job stats:
job                              count
-----------------------------  -------
all                                  1
compute_correlation                  1
download_demographics                1
download_meaningful_variables        1
filter_demographics                  1
filter_meaningful_variables          1
generate_heatmap                     1
join_datasets                        1
total                                8

... (omitting intermediate output)

Job stats:
job                              count
-----------------------------  -------
all                                  1
compute_correlation                  1
download_demographics                1
download_meaningful_variables        1
filter_demographics                  1
filter_meaningful_variables          1
generate_heatmap                     1
join_datasets                        1
total                                8

Reasons:
    (check individual jobs above for details)
    input files updated by another job:
        all, compute_correlation, filter_demographics, filter_meaningful_variables, generate_heatmap, join_datasets
    output files have to be generated:
        compute_correlation, download_demographics, download_meaningful_variables, filter_demographics, filter_meaningful_variables, generate_heatmap, join_datasets
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Once we have confirmed that everything is set up properly, we can then use snakemake to run the workflow:

➤ uv run snakemake --cores 1 -d ./output

Building DAG of jobs...

Job stats:
job                              count
-----------------------------  -------
all                                  1
compute_correlation                  1
download_demographics                1
download_meaningful_variables        1
filter_demographics                  1
filter_meaningful_variables          1
generate_heatmap                     1
join_datasets                        1
total                                8

Select jobs to execute...
Execute 1 jobs...

[Wed Dec 24 08:17:57 2025]
localrule download_demographics:
    output: output/data/demographics.csv
    log: output/logs/download_demographics.log
    jobid: 7
    reason: Missing output files: output/data/demographics.csv
    resources: tmpdir=/var/folders/r2/f85nyfr1785fj4257wkdj7480000gn/T
Downloaded 522 rows from https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/demographics.csv
Saved to output/data/demographics.csv
[Wed Dec 24 08:17:58 2025]
Finished jobid: 7 (Rule: download_demographics)
1 of 8 steps (12%) done

... (omitting intermediate output)

8 of 8 steps (100%) done
Complete log(s): .snakemake/log/2025-12-24T081757.266320.snakemake.log

It’s important to know that when Snakemake is run, it stores metadata regarding the workflow in a hidden directory called .snakemake, including the log file mentioned at the end of the output above. This directory can carry over hidden state from previous runs that can become confusing when debugging. It’s sometimes useful to remove this directory as a last resort (when the --force and --cleanup-metadata are not sufficient).

One handy feature of Snakemake is that, just like make, we can give it a specific target file and it will perform only the portions of the workflow that are required to regenerate that specific file. For example, let’s say that the file output/data/demographics.csv became corrupted and we needed to recreate it. This could be done using the command:

➤ uv run snakemake output/data/demographics.csv --cores 1 -d ./output

Updating the workflow when inputs change

Once the workflow has completed successfully, re-running it will not result in the re-execution of any of the analyses:

➤ uv run snakemake --cores 1 -d ./output

Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).

However, Snakemake checks several features of the workflow (by default) when generating its DAG to see if anything relevant has changed. By default it checks to see if any of the following have changed (configurable using the -rerun-triggers`flag):

modification times of input files
the code specified within the rule
the input files or parameters for the rule

Snakemake also checks for changes in the details of the software environment, but as of the date of writing this only works for Conda environments.

As an example, I will first update the modification time of the meaningful variables file from a previous successful run using the touch command:

➤  ls -l ./output/data/meaningful_variables.csv
Permissions Size User     Date Modified Name
.rw-r--r--@ 1.2M poldrack 24 Dec 10:11  data/meaningful_variables.csv

➤  touch ./output/data/meaningful_variables.csv

➤  ls -l ./output/data/meaningful_variables.csv
Permissions Size User     Date Modified Name
.rw-r--r--@ 1.2M poldrack 24 Dec 10:14  data/meaningful_variables.csv

You can see that the touch command updated the modification time of the file. Now let’s rerun the snakemake command:

➤ uv run snakemake --cores 1 -d ./output

Building DAG of jobs...

Job stats:
job                            count
---------------------------  -------
all                                1
compute_correlation                1
filter_meaningful_variables        1
generate_heatmap                   1
join_datasets                      1
total                              5

Similarly, Snakemake will rerun the workflow if any of the scripts used to run the workflow are modified. However, it’s important to note that it will not identify changes in the modules that are imported. In that case you would need to rerun the workflow using the --force flag in order to re-execute the relevant steps.

In the next post I will talk about managing reproducible environments in Snakemake.

Remembering Robert Bilder

Russ Poldrack — Fri, 27 Mar 2026 13:20:41 GMT

My friend and long-time collaborator Robert (Bob) Bilder died last August of cancer at the age of 69. On April 16 we will hold a conference at UCLA celebrating his life, with talks by many of his longtime collaborators. This post is an outline of my talk, which will be a personal memorial to Bob, who had a foundational impact on my career and thinking. If you want to learn more about Bob’s career, here is a memorial published in Neuropsychopharmacology.

Robert (Bob) Bilder, 1956-2025

I met Bob just after I moved to UCLA in 2002; he had also moved there around the same time from Albert Einstein College of Medicine and North Shore-Long Island Jewish Research Institute. Bob was a clinical neuropsychologist whose work had focused on the neuropsychology of psychiatric disorders. I had started collaborating with psychiatry researchers when I was at MGH, particularly Larry Seidman (who also passed away tragically young in 2017), which had piqued my interest in this direction. Meeting Bob was a revelation, because he was deeply steeped in both cognitive science and neuroscience in addition to his home domain of neuropsychology, giving us a ton of common ground.

The Consortium for Neuropsychiatric Phenomics

Over the next few years Bob and I became increasingly close colleagues, largely focused around the development of a large project that came to be known as the “CNP”. The idea of “phenomics” was made popular by our UCLA colleagues Nelson Freimer and Chiara Sabatti in their 2003 paper, “The Human Phenome Project”. In the wake of the initial sequencing of the human genome in 2001, Freimer and Sabbati proposed that the hard work for biomedical science now laid in the understanding of the universe of phenotypes that arise when the genotype interacts with the environment, and in particular in the way that those phenotypes vary across individuals. Bob was interested in an understanding of psychological and neural phenotypes, which he couched in terms of a framework that I often referred to as the “layer cake” - here is an example reprinted from Figure 4 in Poldrack et al. (2011):

The early 2000s were also a time when many were questioning the way that neuropsychiatric disorders were understood. There was an emerging focus on the degree to which these disorders may reflect the dysfunction of specific underlying neural systems as well as a growing realization that diagnostic categories in psychiatry did not map neatly on underlying neural or psychological mechanisms. It was this set of ideas that would ultimately result in the Research Domain Criteria (RDoC) framework (Insel et al., 2010; Cuthbert and Insel, 2013) that has played a central role in focusing NIMH research onto the underlying dimensions of mental illness. Bob was at the forefront of this movement, and around 2005-6 he began recruiting a dream team of researchers at UCLA that would develop a transdisciplinary approach to understanding the neural basis of psychiatric disorders, focused specifically on the domains of memory and cognitive control. The approach would be transdiagnostic, recruiting healthy individuals along with people diagnosed with several different disorders (schizophrenia, bipolar disorder, and ADHD). It would also be trans-species, with collaborators studying these domains in rodent and nonhuman primate models. Bob was also very forward-thinking in including an entire branch of the project focused on informatics, out of which would grow the Cognitive Atlas project (more about this below).

The set of consortium grants that we developed were funded in 2007 as part of the NIH Roadmap Initiative, and over the next few years we set out to collect a large behavioral/genetics sample of ~2000 individuals (known as the LA2K sample), and a smaller sample of about 300 individuals imaged with fMRI on a range of tasks covering the domains of memory and cognitive control (known as the LA3C sample). I moved from UCLA to the University of Texas in 2009, but I remained closely involved in the project. During my time at UT we developed and launched the OpenfMRI data sharing project, which shared neuroimaging data in a completely open manner. We later shared the entire LA3C imaging dataset through OpenfMRI, and they remain available on the OpenNeuro platform that succeeded OpenfMRI. We also published a data descriptor describing the dataset (Poldrack et al., 2016), which has been cited nearly 500 times since its publication. While not all of these citations reflect an actual reuse of the data, we know that the data has been reused many times, and is one of the most highly accessed datasets on the OpenNeuro archive.

The phenomics mindset remained with me, and in 2012 I developed a study that took it in a different direction. One of the things that is clear but often forgotten about neuropsychiatric disorders is their variability within an individual; a person diagnosed with schizophrenia can be completely disabled one week, and then largely functional several weeks later, and other disorders show similar intra-individual variabilty. Yet our imaging studies usually assume that a single snapshot of a person is a definitive picture of their diagnosis. When I started thinking about how we might understand this variability, I realized that there was a missing link: We knew nothing about how brain function varies over time within a healthy individual. I knew that it would be practically difficult to repeatedly image an individual, so I decided to start with myself. In a study that I came to call the MyConnectome study (Poldrack et al., 2015), I imaged myself repeatedly over the course of 18 months, ultimately scanning myself more than 100 times and taking 48 blood draws. These data provided what was, at the time, the most in-depth picture of brain function and physiology that had ever been collected on an individual. This study also identified a striking degree of individual variability in the functional organization of the cortex (Laumann et al., 2015), which had not been reliably identified with the small amounts of imaging data that were acquired in earlier studies. This finding has since fed into the development of precision function mapping approaches that in the last couple of years have identified robust differences in functional brain organization related to disorders include depression (Lynch et al., 2024), OCD (Vaghi et al., 2025), and Parkinson’s disease (Ren et al., 2026).

We have also taken the phenomics approach in another direction, collecting large amounts of behavioral data on relatively large samples of individuals in order to understand the structure of cognitive control processes. In our first study of this type, we collected a large number of phenotypes, totalling 22 self-report surveys and 37 behavioral tasks, across the domain of cognitive control and related functions. This dataset allowed us to characterize the large-scale structure of the domain, revealing a couple of important findings. First, we found that whereas survey measures were in some cases strongly predictive of real-world outcomes, cognitive task measures showed very little predictive validity for real-world behavior. Additional analyses of a subset of subjects retested after several months (Enkavi et al., 2018) showed that this may reflect low test-retest reliability in task measures. Second, we found that there were negligible or weak correlations between survey measures and task measures, even though they are often taken to index the same underlying constructs. Instead, the task domain seemed to be primarily organized according to the different aspects of speeded responding that can be identified using a diffusion decision model (DDM), as shown in Figure 4 of Eisenberg et al. (2019):

These two projects are just a couple of the ways in which phenomics thinking has influenced my lab’s work in years since I left UCLA. While Bob was not a direct collaborator on these projects, we discussed them often, and his fingerprints are all over them; I simply don’t think I would have ended up doing this work if it weren’t for Bob’s influence.

Informatics and Ontologies

Bob was also a visionary regarding the role of informatics in understanding minds and brains and their disorders. One of the projects within the CNP was focused entirely on informatics, particularly focused on developing a “Hypothesis Web” that could link together data across multiple sources and levels. This was directly driven by the progress that Bob had witnessed in the field of genomics, where the Gene Ontology and related tools had allowed rapid advances in biological understanding. Bob envisioned that we could mine data to better understand the structure of the “layer cake” that I showed above.

A number of tools came out of this, but the one that has had the most enduring impact is the Cognitive Atlas (Poldrack et al., 2011). The aim of this project was to develop a controlled vocabulary of the terms used to define psychological functions and psychological measurements (such as tasks or surveys), and to annotate the relationships between them. We made good headway on this in the early days, but for more than a decade the project received almost no interest, outside of philosophers were were interested in the notion of “cognitive ontologies” from a theoretical point of view. It wasn’t until the last few years that the world became interested in the Cognitive Atlas, as you can see in this plot of citations to the paper, which have spiked in just the last few years:

I think there are a couple of reasons for this increased interest. One is that the National Academies convened a meeting on the topic, which led to a report titled “Ontologies in the Behavioral Sciences: Accelerating Research and the Spread of Knowledge (2022)”. This report made six recommendations, which focused on the need for increased funding, policy, and education around the importance of ontologies in the behavioral sciences.

Simultaneously, the Cognitive Atlas has increasingly been used in conjunction with AI tools to annotate data and generate new knowledge. As one example, Menuet et al. (2022) used an enriched version of the Cognitive Atlas to annotate data from the Neurovault data archive (an archive of statistical maps that we also run) and then used these data to train a neural network to decode cognitive functions from brain images, as shown in Figure 1 of their paper:

They showed that they could decode a large number of mental processes on held-out data, with greater decoding accuracy than models that don’t use the ontology.

We also showed several years ago that the Cognitive Atlas could be useful for predicting neuroimaging signals via “cognitive encoding models”, which annotate individual tasks with their Cognitive Atlas features and then use linear models to fit the data via those features (Walters et al., 2022). We showed that these models substantially improved our ability to predict activation maps on unseen tasks, based only on their Cognitive Atlas annotations. This provided yet another proof of concept that ontologies are a powerful tool for data analysis and understanding.

Ontologies are also playing an important role in powering new agentic AI tools for data analysis and understanding. In the Brain-Researcher project developed by Zijiao Chen in my lab recently, we have used a knowledge graph that includes the Cognitive Atlas as the basis for a neuroscience data analysis agent, which can autonomously identify data and generate code to answer neuroscientific questions based on natural language queries. The machine-readable nature of the Atlas makes it easily ingestible by AI agents, so we expect that it will become increasingly useful for AI applications. I’m really sad that Bob isn’t able to see this, because it reflects the mature version of the ideas that he inspired our group to hatch back in the mid-2000’s.

Bob’s legacy

Bob clearly had an important impact on my career, but even more importantly, he was a close friend and mentor to me. We had many conversations in his office at the UCLA Neuropsychiatric Institute, which were wide-ranging, enlightening, and fun. My wife Jen and I were fortunate to be able to see Bob one last time in May in Los Angeles, and enjoy some steak (cooked rare of course) and a nice red wine with Bob, his wife Debbie, and his daughter Alexandra. We miss him dearly but I hope that our work will continue to carry forward his legacy well into the future.

Using a workflow engine

Russ Poldrack — Tue, 24 Mar 2026 15:00:58 GMT

There is a wide variety of workflow engines available for data analysis workflows, most of which are centered around the concept of an “execution graph”. This is a graph in the sense described by graph theory, which refers to a set of nodes that are connected by lines (known as “edges”). Workflow execution graphs are a particular kind of graph known as a directed acyclic graph, or DAG for short. Each node in the graph represents a single step in the workflow, and each edge represents the dependency relationships that exist between nodes. DAGs have two important features. First, the edges are directed, which means that they move in one direction that is represented graphically as an arrow. These represent the dependencies within the workflow. For example, in our workflow step 1 (obtaining the data) must occur before step 2 (filtering the data), so the graph would have an edge from step 1 with an arrow pointing at step 2. Second, the graph is acyclic, which means that it doesn’t have any cycles, that is, it never circles back on itself. Cycles would be problematic, since they could result in workflows that executed in an infinite loop as the cycle repeated itself.

Most workflow engines provide tools to visualize a workflow as a DAG. Figure 8.2 shows our example workflow visualized using the Snakemake tool that we will introduce below:

Figure 8.2: The execution graph for the simple example analysis workflow visualized as a DAG.

The use of DAGs to represent workflows provides a number of important benefits:

The engine can identify independent pathways through the graph, which can then be executed in parallel
If one node of the graph changes, the engine can identify which downstream nodes need to be rerun
If a node fails, the engine can be configured to continue with executing the nodes that don’t depend on the failed node either directly or indirectly

There are a couple of additional benefits to using a workflow engine, which we will discuss in more detail in the context of a more complex workflow. The first is that they generally deal automatically with the storage of intermediate results (known as caching or checkpointing), which can help speed up execution when nothing has changed and allow continued execution if the process is interrupted. The second is that the workflow engine uses the execution graph to optimize the schedule of computations, only performing those operations that are actually needed. This is similar in spirit to the concept of lazy execution used by packages like Polars, in which the system optimizes computational efficiency by first analyzing the full computational graph.

General-purpose versus domain-specific workflow engines

With the growth of data science within industry and research, there has been an explosion of new workflow management systems that aim to solve particular problems; a list of these can be found at awesome-workflow-engines. It’s also worth noting that there are a number of domain-specific workflow engines that are specialized for particular kinds of data and workflows. Examples include Galaxy which is specialized for bioinformatics and genomics, and Nipype which is specialized for neuroimaging analysis workflows. If your research community uses one of these then it’s worth exploring that engine as your first option, since it will probably be well supported within the community. However, a benefit of using a general-purpose engine is that they will often be better maintained and supported, and AI tools will likely have more examples to work from in generating workflows.

In the next post I will take a deep dive into workflow management using the Snakemake workflow engine.

Running a simple workflow using GNU make

Russ Poldrack — Tue, 17 Mar 2026 15:01:24 GMT

A simple workflow example

Most real scientific workflows are complex and can often run for hours, and we will encounter such a complex workflow later in the chapter. However, we will start our discussion of workflows with a relatively simple and fast-running example that will help demonstrate the basic concepts of workflow execution. We will use the same data as above (from Eisenberg et al., 2019) to perform a simple workflow:

Load the demographic and meaningful variables files
Drop any non-numeric variables from each data frame
Join the data frames using their shared index
Compute the correlation matrix across all variables
Generate a clustered heatmap for the correlation matrix

I have implemented each of these components as a module here. The simplest possible workflow would be a script that simply imports and calls each of the methods in turn. For such a simple workflow this would be fine, but we will use the example to show how we might take advantage of more sophisticated workflow management tools.

Running a simple workflow using GNU make

One of the simplest ways to organize a workflow is using the GNU make command, which executes commands defined in a file named Makefile. make is a very handy general-purpose tool that every user of UNIX systems should become familiar with. The Makefile defines a set of labeled commands, like this:

.PHONY: all clean

all: step1.txt step2.txt

# this one takes no input, and outputs step1.txt
step1.txt:
    python step1.py

# this one requires step1.txt as input, and outputs step2.txt
step2.txt: step1.txt
    python step2.py -i step1.txt

clean:
    -rm step1.txt step2.txt

In this case, the command make step1.txt will run the command python step1.py which outputs a file called step1.txt, unless that file already exists and the existing file is newer than its dependencies. This is one of the powerful features of make: since it checks the timestamps of existing files, it can automatically rerun commands if any of their dependencies have changed. The command make step2.txt requires step1.txt, so it will first run that action (which will do nothing if the file already exists and is newer than its dependencies). It will then perform python step2.py -i step1.txt which outputs step2.txt. The command make all will execute the all target, which includes both of the output files, and make clean will remove each of those files if they exist. The targets all and clean are referred to as phony targets since they are not meant to refer to a specific file but rather to an action. The .PHONY designation in the Makefile denotes this, such that those commands will run even if a file called “all” or “clean” happens to exist. This should already show you why make`is such a handy tool: Any time there is a command that you run regularly in a particular directory, you can put it into a Makefile and then execute it with just a single make call.

Here is how we could build a Makefile to run our simple workflow:

# if OUTPUT_DIR isn't already defined, set it to the default
OUTPUT_DIR ?= ./output

.PHONY: all clean

all: $(OUTPUT_DIR)/figures/correlation_heatmap.png

$(OUTPUT_DIR)/data/demographics.csv $(OUTPUT_DIR)/data/meaningful_variables.csv:
	@echo "Downloading data..."
	mkdir -p $(OUTPUT_DIR)/data $(OUTPUT_DIR)/results $(OUTPUT_DIR)/figures
	uv run python scripts/download_data.py $(OUTPUT_DIR)/data

$(OUTPUT_DIR)/data/demographics_numerical.csv: $(OUTPUT_DIR)/data/demographics.csv
	@echo "Filtering demographics data..."
	uv run python scripts/filter_data.py $(OUTPUT_DIR)/data

$(OUTPUT_DIR)/data/meaningful_variables_numerical.csv: $(OUTPUT_DIR)/data/meaningful_variables.csv
	@echo "Filtering meaningful variables data..."
	uv run python scripts/filter_data.py $(OUTPUT_DIR)/data

$(OUTPUT_DIR)/data/joined_data.csv: $(OUTPUT_DIR)/data/demographics_numerical.csv $(OUTPUT_DIR)/data/meaningful_variables_numerical.csv
	@echo "Joining data..."
	uv run python scripts/join_data.py $(OUTPUT_DIR)/data

$(OUTPUT_DIR)/results/correlation_matrix.csv: $(OUTPUT_DIR)/data/joined_data.csv
	@echo "Computing correlation..."
	uv run python scripts/compute_correlation.py $(OUTPUT_DIR)/data $(OUTPUT_DIR)/results

$(OUTPUT_DIR)/figures/correlation_heatmap.png: $(OUTPUT_DIR)/results/correlation_matrix.csv
	@echo "Generating heatmap..."
	uv run python scripts/generate_heatmap.py $(OUTPUT_DIR)/results $(OUTPUT_DIR)/figures
	
clean:
	rm -rf $(OUTPUT_DIR)

Most of the targets (except for “clean” and “all”) refer to specific files that are required for the workflow. For example, the first target refers to the two files that need to be downloaded by the download_data.py script. This target does not rely on the outputs of any others, so there is nothing following the colon in the target name. For the others, they require particular inputs, which come after the colon; thus, if those don’t already exist then their targets will be run first. Note that make requires the use of tabs to indent commands, and will fail if spaces are used; thus, Makefile commands often need to be reformatted when copied and pasted since this often converts tabs to spaces.

We can run the entire workflow by simply running `make all`:

➤  make all
Downloading data...
mkdir -p ./output/data ./output/results ./output/figures
uv run python scripts/download_data.py ./output/data
Downloaded meaningful_variables.csv (522 rows)
Downloaded demographics.csv (522 rows)
Filtering demographics data...
uv run python scripts/filter_data.py ./output/data
Filtered meaningful_variables: (522, 193) -> (522, 193)
Filtered demographics: (522, 33) -> (522, 28)
Joining data...
uv run python scripts/join_data.py ./output/data
Meaningful variables: (522, 193)
Demographics: (522, 28)
Joined: (522, 221)
Computing correlation...
uv run python scripts/compute_correlation.py ./output/data ./output/results
Loaded joined data: (522, 221)
Saved correlation matrix: (221, 221)
Generating heatmap...
uv run python scripts/generate_heatmap.py ./output/results ./output/figures
Loaded correlation matrix: (221, 221)
Saved heatmap to output/figures/correlation_heatmap.png

The rules that refer to specific files will only be triggered if the filename in question doesn’t exist, as we can see if we run the `make` command again:

➤  make all
make: Nothing to be done for `all'.

However, if we delete the heatmap file and rerun the `make` command, then the `generate_heatmap` action will be triggered:

➤  make all

Generating heatmap...
uv run python scripts/generate_heatmap.py ./output/results ./output/figures
Loaded correlation matrix: (221, 221)
Saved heatmap to output/figures/correlation_heatmap.png

We could also take advantage of another feature of make: it only triggers the action if a file with the name of the action doesn’t exist, or if the existing file is not newer than its dependencies. Thus, if the command was make results/output.txt, then the action would only be triggered if the file does not exist or if it was older than the inputs. This is why we had to put the .PHONY command in the makefile above: it’s telling make that those are not meant to be interpreted as file names, but rather as commands, so that they will be run even if files named “all” or “clean” exist.

For many simple workflows make can be a perfectly sufficient solution to workflow management, but we will see below why it’s not sufficient to manage a complex workflow. For those workflows we could either build our own more complex workflow management system, or we could use an existing software tool that is built to manage workflow execution, known as a workflow engine. In general I prefer to use an existing solution unless it doesn’t solve my problem, so I will now turn to discussing packages for workflow management.

In the next post I will introduce workflow engines in more detail.

Streaming workflows and method chaining

Russ Poldrack — Tue, 10 Mar 2026 15:02:44 GMT

One of the simplest ways to build a workflow is to stream data directly from one command to another, such that the intermediate results are ephemeral since no information about the intermediate states is saved. Such a workflow is linear in the sense that there is a single pathway through the workflow. One common way that this is accomplished is through the use of pipes, which are a syntactic construct that feed the results of one process directly into the next process. Some readers may be familiar with pipes from the UNIX shell, where they are represented by the vertical bar “|”. For example, let’s say that we had a log file that contains the following entries:

2024-01-15 10:23:45 ERROR: Database connection failed
2024-01-15 10:24:12 ERROR: Invalid user input
2024-01-15 10:25:33 ERROR: Database connection failed
2024-01-15 10:26:01 INFO: Request processed
2024-01-15 10:27:15 ERROR: Database connection failed

and that we wanted to generate a summary of errors. We could use the following pipeline:

grep "ERROR" app.log | sed 's/.*ERROR: //' | sort | uniq -c | sort -rn > error_summary.txt

where:

grep “ERROR” app.log extracts lines containing the word “ERROR”
sed ‘s/.*ERROR: //’ replaces everything up to the actual message with an empty string
sort sorts the rows alphabetically
uniq -c counts the number of appearances of each unique error message
sort -rn sorts the rows in reverse numerical order (largest to smallest)
> error_summary.txt redirects the output into a file called error_summary.txt

Pipes are also commonly used in the R community, where they are a fundamental component of the tidyverse ecosystem of packages.

Method chaining

One way that streaming workflows can be built in Python is using method chaining, where each method returns an object on which the next method is called; this is slightly different from the operation of UNIX pipes, where it is the output of each command that is being passed through the pipe rather than an entire object. This is commonly used to perform data transformations in pandas, as it allows composing multiple transformations into a single command. As an example, we will work with the Eisenberg et al. (2019) dataset that we used in a previous chapter, to compute the probability of having ever been arrested separately for males and females in the sample. To do this we need to perform a number of operations:

drop any observations that have missing values for the Sex or ArrestedChargedLifeCount variables
replace the numeric values in the Sex variable with text labels
create a new variable called EverArrested that binarizes the counts in the ArrestedChargedLifeCount variable
group the data by the Sex variable
select the column that we want to compute the mean of (EverArrested)
compute the mean by group

We can do this in a single command using method chaining in pandas. It’s useful to format the code in a way that makes the pipeline steps explicit, by putting parentheses around the operation; in Python, any commands within parentheses are implicitly treated as a single line, which can be useful for making complex code more readable:

arrest_stats_by_sex = (df
    .dropna(subset=['Sex', 'ArrestedChargedLifeCount'])
    .replace({'Sex': {0: 'Male', 1: 'Female'}})
    .assign(EverArrested=lambda x: (
           x['ArrestedChargedLifeCount'] > 0).astype(int))
    .groupby('Sex')
    ['EverArrested']
    .mean()
)
print(arrest_stats_by_sex)

Sex
Female    0.156489
Male      0.274131
Name: EverArrested, dtype: float64

Note that pandas data frames also include an explicit .pipe method that allows using arbitrary functions within a pipeline.

While these kinds of streaming workflows can be useful for simple data processing operations, they can become very difficult to debug, so I would generally avoid using complex functions within a method chain.

Scientific workflow management

Russ Poldrack — Tue, 03 Mar 2026 16:00:43 GMT

In most parts of science today, the processing and analysis of data comprise many different steps. We will refer to such a set of steps as a computational workflow; while there are certainly many types of non-computational workflows in science, we will focus here on computational workflows. If you have been doing science for very long, you have very likely encountered a mega-script that implements such a workflow. This is a script that may be hundreds or even thousands of lines long that runs a single workflow from start to end. Often these scripts are handed down to new trainees over generations, such that users become afraid to make any changes lest the entire house of cards comes crashing down. I think that most of us can agree that this is not an optimal workflow, and in this chapter I will discuss in detail how to move from a mega-script to a workflow that will meet all of the requirements to provide robust and reliable answers to our scientific questions.

What do we want from a scientific workflow

First let’s ask: What do we want from a computational scientific workflow? Here are some of the factors that I think are important. First, we care about the correctness of the workflow, which includes the following factors:

Validity: The workflow includes validation procedures to ensure against known problems or edge cases.
Reproducibility: The workflow can be rerun from scratch on the same data and get the same answer, at least within the limits of uncontrollable factors such as floating point imprecision and operating system differences.
Robustness: When there is a problem, the workflow fails quickly with explicit error messages, or degrades gracefully when possible.

Second, we care about the usability of the workflow. Factors related to usability include:

Configurability: The workflow uses smart defaults, but allows the user to easily change the configuration.
Portability: We would like for the workflow to be easily runnable across multiple systems.
Parameterizability: Multiple runs of the workflow can be executed with different parameters, and the separate outputs can be tracked.
Standards compliance: The workflow leverages common standards to easily read in data and generates output using community standards for file formats and organization when available.

Third, we care about the engineering quality of the code, which includes:

Maintainability: The workflow is structured and documented so that others (including your future self) can easily maintain, update, and extend it in the future.
Modularity: The workflow is composed of a set of independently testable modules, which can be swapped in or out relatively easily.
Idempotency: This term from computer science means that the result of the workflow doesn’t change if it is re-run.
Traceability: All operations are logged, and provenance information is stored for outputs.

Finally, we care about the efficiency of the workflow implementation. This includes:

Incremental execution: The workflow only reruns a module if necessary, such as when an input changes.
Cached computation: The workflow pre-computes and reuses results from expensive operations when possible.

It’s worth noting that these different desiderata will sometimes conflict with one another (such as configurability versus maintainability), and that no workflow will be perfect. For example, a highly configurable workflow will often be more difficult to maintain.

FAIR-inspired practices for workflows

In the earlier chapter on Data Management I discussed the FAIR (Findable, Accessible, Interoperable, and Reusable) principles for data. Since those principles were proposed in 2016 they have been extended to many other types of research objects, including workflows (Wilkinson et al., 2025). The reader who is not an informatician is unfortunately likely to quickly glaze over when reading these articles, as they rely on concepts and jargon that will be unfamiliar to most scientists.

Realizing that most scientists are unlikely to go to the lengths of a fully FAIR workflow, and preferring that the perfect never be the enemy of the good, I think that we can take an “80/20” approach, meaning that we can get 80% of the benefits for about 20% of the effort. We can adhere to the spirit of the FAIR Workflows principle by adopting the following principles, based in part on the “Ten Quick Tips for Building FAIR Workflows” presented by de Visser et al. (2023):

Metadata: Provide sufficient metadata in a standard machine-readable format to make the workflow findable once it is shared.
Version control: All workflow code should be kept under version control and hosted on a public repository such as Github.
Documentation: Workflows should be well documented. Documentation should focus primarily on the scientific motivation and technical design of the workflow, along with instructions on how to run it and description of the outputs.
Standard organization schemes: Both the workflow files (code and configuration) and data files should follow established standards for organization.
Standard file formats: The inputs and outputs to the workflow should use established standard file formats rather than inventing new formats.
Configurability: The workflow should be easily configurable, and example configuration files should be included in the repository.
Requirements: The requirements for the workflow should be clearly specified, either in a file (such as pyproject.toml or requirements.txt) or in a container configuration file (such as a Dockerfile).
Clear workflow structure: The workflow structure should be easily understandable.

There are certainly some contexts where a more formal structure adhering in detail to the FAIR Workflows standard may be required, as in large collaborative projects with specific compliance objectives, but these rough guidelines should get a researcher most of the way there.

In the next post I will move on to discussing workflow patterns.

Archiving research data

Russ Poldrack — Tue, 24 Feb 2026 16:01:24 GMT

At the end of a project the data may seem like they are no longer needed, but in many cases there are reasons to retain the data beyond the end of the project. Funding agencies often have a required data retention period beyond the end of the grant. For example, the US National Institutes of Health (NIH) requires that records be retained for three years beyond the end of the funding. Some universities also have their own data retention requirements; for example, my institution (Stanford University) also has a three-year data retention requirement, whereas Johns Hopkins University has a five-year retention requirement. In my opinion it is preferable to retain data, at least in archival form, as long as possible. I have received requests to share data more than 15 years after the original study completion, and it was only due to long-term retention of these data that we were able to honor these requests.

Archiving of research data can take several forms:

Physical hard drive: Datasets up to several terabytes can be stored on a physical hard drive kept in a secure and safe location. This is not an optimal storage method, primarily because physical hard drives can fail over time. If one insists on using physical hard drives, then I would suggest placing the data on two different hard drives (preferably different models, to diversify vulnerability to hardware issues), and that those drives are stored in different locations to avoid risk of destruction in a disaster such as a flood or fire.
Archival storage media: In the past it was common to store data onto media such as writable DVDs or cartridge disks. A major problem with this kind of archiving is the inevitable obsolescence of storage media formats: When was the last time you saw a computer with a DVD reader, much less a Zip drive (which was the common medium when I was a postdoc)? This method should generally be avoided, except as a redundant backup to another storage form.
Cloud storage: Increasingly, commercial cloud storage providers such as Dropbox or Google Drive are used as archival storage locations. However, these providers do not guarantee long-term availability of the data, and to not offer verification that would allow one to ensure that the data haven’t been silently corrupted. I generally use these tools as an additional redundant store for much of my older data, but I would not want to rely upon them as my sole archival storage.

As we will discuss in more detail in our later chapter on sharing of research objects, it is generally preferably to archive data in a location that has a long-term preservation policy and verifiability. This can include institutional repositories (usually run by librarians, who have deep expertise in archiving), general purpose repositories (like OSF or Zenodo), or domain-specific repositories.

This is the last section of the data management chapter. In the next post I will move on to the next chapter, focused on workflows and workflow management.

Version control for data

Russ Poldrack — Tue, 17 Feb 2026 16:00:33 GMT

In the case of original data we never want to allow any changes, but for derived data we will often end up making changes to our workflows that result in changes in the data. As an example, let’s say that we are analyzing RNA-sequencing data, and we receive a notice that a bug was found in the specific version of STAR that we had used for sequence alignment. We would like to be able to track these changes, so that we know which data we are working with at any point in time. In many laboratories, this achieved through file naming, resulting in a menagerie of files with names like dataset_new_fixed_v2.tsv This can make it difficult to determine exactly which data were used in any analysis. In Chapter 2 we discussed the many reasons why we use version control for code, and many of those also apply to data as well. In the case of data, it is particularly important to be able to track the what, when, and why of any changes to the data, which is exactly the purpose of version control systems.

Using git for data version control

When the relevant data are small (e.g., smaller than a few megabytes) and stored in a text format (such as CSV/TSV), one can simply use git to track changes in the data. (We will discuss in a later chapter why Github is not an optimal platform for sharing data, at least not on its own.).

However, git does not work well for version control on larger datasets using binary data files. Git is able to efficiently store version information about code because it tracks the specific differences in the code between versions (known as a diff), and only stores the differences. Thus, if one has a very large code file and changes one line, only that one line difference is stored in the git database. However, with binary data this strategy is not effective, and git has to store the entire new dataset each time, leading to bloated repositories and very slow performance.

Using DataLad for version control on larger datasets

A solution to this problem is to use a version control tool that is specifically designed for large data. There are several tools that address this problem; we will focus on DataLad, which is a data management system that functions very similarly to git. It is based on a tool called git-annex, but provides much greater ease of use for researchers. (Full disclosure: Our group collaborates with the DataLad group and our grants have supported some of their development work.)

An important note: DataLad is quite powerful but has a significant learning curve, and takes a bit of time to get accustomed to. In particular, its use of symbolic links can sometimes confuse new users. Having said that, let’s look at some simple examples.

Creating a local DataLad dataset

Let’s say that we want to create a new dataset on our local computer that will be tracked by DataLad. We first create a new repository:

➤  datalad create -d . my_datalad_repo

add(ok): my_datalad_repo (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
create(ok): my_datalad_repo (dataset)

This creates a new directory, called my_datalad_repo and sets it up as a DataLad dataset. We then go into the directory and create a subdirectory called data, and then download some data files from another project. We do this using the datalad download-url function, which will both download the data and save them to the DataLad dataset:

➤  datalad download-url -d . -O my_datalad_repo/data/ https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/demographics.csv

[INFO   ] Downloading ‘https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/demographics.csv’ into ‘/Users/poldrack/Dropbox/code/BetterCodeBetterScience/my_datalad_repo/data/’
download_url(ok): /Users/poldrack/Dropbox/code/BetterCodeBetterScience/my_datalad_repo/data/demographics.csv (file)
add(ok): data/demographics.csv (file)
save(ok): my_datalad_repo (dataset)
add(ok): my_datalad_repo (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)


➤  datalad download-url -d . -O my_datalad_repo/data/ https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/meaningful_variables_clean.csv

[INFO   ] Downloading ‘https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/meaningful_variables_clean.csv’ into ‘/Users/poldrack/Dropbox/code/BetterCodeBetterScience/my_datalad_repo/data/’
download_url(ok): /Users/poldrack/Dropbox/code/BetterCodeBetterScience/my_datalad_repo/data/meaningful_variables_clean.csv (file)
add(ok): data/meaningful_variables_clean.csv (file)
save(ok): my_datalad_repo (dataset)
add(ok): my_datalad_repo (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)

A DataLad dataset is also a git repository, which we can see if we use the `git log` command:

➤  git log

commit 948cc31262fcddda3bfc56b222687710861c57d1 (HEAD -> text/datamgmt-Nov3)
Author: Russell Poldrack 
Date:   Mon Dec 15 13:40:52 2025 -0800

    [DATALAD] Download URLs

    URLs:
      https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/meaningful_variables_clean.csv

commit 9b4b8b29e08a21974dc52e3026405b878078f07b
Author: Russell Poldrack 
Date:   Mon Dec 15 13:40:29 2025 -0800

    [DATALAD] Download URLs

    URLs:
      https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/demographics.csv

Here we see the commit messages that were automatically created by DataLad, first for creating the new dataset and then for downloading the URLS. The datalad download-url function adds the URL to the log, which is useful for provenance tracking. If one wishes to download a large number of files, there is also a datalad addurls command that can download multiple files based on a single text file (TSV, JSON, etc) containing the relevant URLs and information.

Modifying files

Now let’s say that we want to make a change to one of the files and save the changes to the dataset. Files tracked by DataLad are read-only (”locked”) by default. If we want to edit them, then we need to use `datalad unlock` to unlock the file:

➤  datalad unlock my_datalad_repo/data/demographics.csv

unlock(ok): my_datalad_repo/data/demographics.csv (file)

We then use a Python script to make the change, which in this case is removing some columns from the dataset:

➤  python src/BetterCodeBetterScience/modify_data.py my_datalad_repo/data/demographics.csv

We can now use datalad status to see that the file has been modified:

➤  datalad status

 modified: my_datalad_repo (dataset)

And we can then save it using datalad save:

➤  datalad save -d . -m “Modified demographics.csv” my_datalad_repo/data/demographics.csv

add(ok): data/demographics.csv (file)
save(ok): my_datalad_repo (dataset)
add(ok): my_datalad_repo (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)

DataLad doesn’t have a staging area like git does, so there is no need to first add and then commit the file; datalad save is equivalent to adding and then committing the changes. If we then check the status we see that there are no changes waiting to be saved:

➤  datalad status

nothing to save, working tree clean

Using `datalad run`

Although the previous example was meant to provide background on how DataLad works, in practice there is actually a much easier way to accomplish these steps, which is by using the datalad run command. This command will automatically take care of fetching and unlocking the relevant files, running the command, and then committing the files back in, generating a commit message that tracks the specific command that was used:

➤  datalad run -i my_datalad_repo/data/demographics.csv -o my_datalad_repo/data/demographics.csv -- uv run src/BetterCodeBetterScience/modify_data.py my_datalad_repo/data/demographics.csv
[INFO   ] Making sure inputs are available (this may take some time)
unlock(ok): my_datalad_repo/data/demographics.csv (file)
[INFO   ] == Command start (output follows) =====
      Built bettercodebetterscience @ file:///Users/poldrack/Dropbox/code/BetterCode
Uninstalled 1 package in 1ms
Installed 1 package in 1ms
[INFO   ] == Command exit (modification check follows) =====
run(ok): /Users/poldrack/Dropbox/code/BetterCodeBetterScience (dataset) [uv run src/BetterCodeBetterScience/modif...]
add(ok): data/demographics.csv (file)
save(ok): my_datalad_repo (dataset)
add(ok): my_datalad_repo (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)

# show the most recent commit
➤  git log -1
commit 3ef3b94a0abffec6a8db7570a97339f48ee728ed (HEAD -> text/datamgmt-Nov3)
Author: Russell Poldrack 
Date:   Mon Dec 15 13:28:06 2025 -0800

    [DATALAD RUNCMD] uv run src/BetterCodeBetterScience/modif...

    === Do not change lines below ===
    {
     “chain”: [],
     “cmd”: “uv run src/BetterCodeBetterScience/modify_data.py my_datalad_repo/data/demographics.csv”,
     “exit”: 0,
     “extra_inputs”: [],
     “inputs”: [
      “my_datalad_repo/data/demographics.csv”
     ],
     “outputs”: [
      “my_datalad_repo/data/demographics.csv”
     ],
     “pwd”: “.”
    }
    ^^^ Do not change lines above ^^^

If one uses DataLad for data versioning then the datalad run command can be very helpful for running commands on those data.

Pushing data to a remote repository

DataLad is a particularly powerful tool for sharing data across systems. It allows one to push or pull data from a number of different remote storage systems; in this example we will use the Open Science Framework (OSF) as our storage location, because it is particularly easy to use with DataLad.

We first need to install and set up the datalad-osf Python package, per the DataLad documentation. We also need to create an account on the OSF site, and obtain a Personal Access Token for login. We can then use DataLad to authenticate with OSF:

➤  datalad osf-credentials                                                
You need to authenticate with ‘https://osf.io’ credentials. https://osf.io/settings/tokens provides information on how to gain access
token:
osf_credentials(ok): [authenticated as Russell Poldrack ]

Having authenticated with OSF, we can now create a new OSF project using DataLad:

➤  datalad create-sibling-osf --title datalad-test-project -s osf
create-sibling-osf(ok): https://osf.io/htprk/
[INFO   ] Configure additional publication dependency on “osf-storage”
configure-sibling(ok): . (sibling)

Once the project is created, we can push the contents of our dataset to our OSF project:

➤  datalad push --to osf
copy(ok): data/demographics.csv (file) [to osf-storage...]
copy(ok): data/meaningful_variables_clean.csv (file) [to osf-storage...]
publish(ok): . (dataset) [refs/heads/main->osf:refs/heads/main [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->osf:refs/heads/git-annex [new branch]]
action summary:
  copy (ok: 2)
  publish (ok: 2)

These data now exist on OSF, and can be cloned to our local machine using datalad clone:

➤  datalad clone osf://htprk/
[INFO   ] Remote origin uses a protocol not supported by git-annex; setting annex-ignore
install(ok): /Users/poldrack/Downloads/htprk (dataset)

➤  tree htprk
htprk
└── data
    ├── demographics.csv -> ../.git/annex/objects/f7/Mm/MD5E-s58237--dc5b157fb9937eae2166d73ee943c766.csv/MD5E-s58237--dc5b157fb9937eae2166d73ee943c766.csv
    └── meaningful_variables_clean.csv -> ../.git/annex/objects/J5/X6/MD5E-s1248729--e4fbac610f1f5e25e04474e55209ef56.csv/MD5E-s1248729--e4fbac610f1f5e25e04474e55209ef56.csv

Notice that the files in the cloned dataset directory are actually symbolic links; the actual file contents are not downloaded when the dataset is cloned. We can see this if we try to look at the size of the datafile:

➤  wc data/demographics.csv
wc: data/demographics.csv: open: No such file or directory

To actually download the file contents, we can use `datalad get`, after which we will see that the file contents are available:

➤  datalad get .                                                          1 ↵
get(ok): data/demographics.csv (file) [from web...]
get(ok): data/meaningful_variables_clean.csv (file) [from web...]
action summary:
  get (ok: 2)

➤  wc data/demographics.csv
     523    1276   58237 data/demographics.csv

One can also push data using DataLad to a range of other remote hosts; see the DataLad documentation for more on this.

In the next post I will complete the data management chapter with a discussion of archiving data.

Handling sensitive data

Russ Poldrack — Tue, 10 Feb 2026 16:02:12 GMT

Handling of sensitive data

Researchers in some fields, particularly those who work with data obtained from human subjects, often handle data are sensitive, meaning that they may require a higher degree of security and/or additional procedures to protect the privacy and confidentiality of the research subjects.

Data security

Sensitive data often require additional protections from potential breach. The minimum requirement is generally that the data are housed on an encrypted file system and any transfers are made via an encrypted channel, and that access to the system is controlled. Some datasets include more stringent security measures in their Data Use Agreement. For example, the Adolescent Brain Cognitive Development (ABCD) study, a widely used dataset on brain and cognitive development, requires that any systems used to house or process the data must meet a specific standard for sensitive information known as NIST SP 800-171. This standard comprises 17 “families” of security requirements that a system must meet to be compliant:

Access Control
Maintenance
Security Assessment and Monitoring
Awareness and Training
Media Protection
System and Communications Protection
Audit and Accountability
Personnel Security
System and Information Integrity
Configuration Management
Physical Protection
Planning
Identification and Authentication
Risk Assessment
System and Services Acquisition
Incident Response
Supply Chain Risk Management

In general this level of security certification will be limited to computer systems run by an organizational IT group rather than by an individual investigator, due to the stringency of the requirements.

Deidentification

Deidentification generally involves the removal of specific identifying information that could potentially be used to reidentify a human subject. In the US, this generally relies upon the Safe Harbor provision in the Health Insurance Portability and Accountability Act of 1996 (HIPAA), which states the following criteria for rendering a dataset deidentified:

(i) The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed:

(A) Names
(B) All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:(1) The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and(2) The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000
(C) All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
(D) Telephone numbers
(E) Fax numbers
(F) Email addresses
(G) Social security numbers
(H) Medical record numbers
(I) Health plan beneficiary numbers
(J) Account numbers
(K) Certificate/license numbers
(L) Vehicle identifiers and serial numbers, including license plate numbers
(M) Device identifiers and serial numbers
(N) Web Universal Resource Locators (URLs)
(O) Internet Protocol (IP) addresses
(P) Biometric identifiers, including finger and voice prints
(Q) Full-face photographs and any comparable images
(R) Any other unique identifying number, characteristic, or code, except as permitted by paragraph (c) of this section; and

(ii) The covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.

In the US, deidentification of data is generally sufficient to render them non-sensitive, whereas this is generally not the case in European countries covered by the General Data Protection Regulation (GDPR).

Anonymization

Anonymization refers to the modification of data in a way that can essentially guarantee that the subjects cannot be reidentified. For example, one might modify ages so that they are stated in ranges (such as 20-25 years old) instead of a specific year. These methods generally change the data in ways that could potentially affect downstream analyses, and thus many researchers shy away from using anonymized data unless absolutely necessary.

One method that is often used for large datasets is known as differential privacy, which involves adding noise to analytic results in a way that can provably prevent reidentification. For example, this method is now used by the US Census Bureau to protect individuals. This has the benefit of providing a provable mathematical guarantee of privacy by quantifying the maximum degree of privacy loss given a particular amount of noise added. However, this method may have adverse effects on the data, such by disparately impacting small sub-populations within a larger dataset.

In the next post I will talk about version control for data.

Metadata, data documentation, and provenance

Russ Poldrack — Tue, 03 Feb 2026 16:02:16 GMT

Metadata

Metadata refers to “data about data”, and generally is meant to contain the information that is needed to interpret a dataset. In principle, someone who obtains a dataset should be able to understand and reuse the data using only the metadata provided alongside the dataset. There are many different types of metadata that might be associated with a study, and it is usually necessary to decide how comprehensive to be in providing detailed metadata. This will often rely upon the scientific expertise and judgment of the researcher, to determine which particular metadata would be essential for others to usefully interpret and reuse the data.

An important concept in metadata is the ontology. In the context of bioinformatics, an ontology is a structured representation of the entities that exist in a domain (defined by a controlled vocabulary) and the relationships between these entities. One of the best known examples in the Gene Ontology, which represents classes of biological entities including Molecular Functions, Cellular Components, and Biological Processes. As an example, this figure shows a Gene Ontology graph for the entity “node of Ranvier”, which is a component of a neuron (obtained from here).

Ontologies are very useful for specifying metadata, because they allow us to know exactly what a particular entry in the metadata means, and thus allow us to establish link between equivalent entities across datasets. For example, let’s say that a researcher wants to query a database for datasets related to insulin signaling in pancreatic beta cells in Type II diabetes, and that there are three relevant datasets in the database. Without an ontology, each of the teams might use different terms to refer to these cells (such as “pancreatic beta cells”, “insulin-producing cells”, and “islet beta cells”), making it difficult to link the datasets. However, if each of the datasets were to include metadata linked to a specific ontology (in this case, the identifier CL:0000169 from the Cell Ontology, which refers to “type B pancreatic cell”), then it becomes much easier to find and link these datasets. There are at present a broad range of ontologies available for nearly every scientific domain; the BioPortal project provides a tool to search across a wide range of existing ontologies.

Metadata file formats

An important feature of metadata is that it needs to be machine-readable, meaning that it is provided in a structured format that be automatically parsed by a computer. Common formats are Extensible Markup Language (XML) and JavaScript Object Notation (JSON). JSON is generally simpler and more human-readable, but it doesn’t natively provide the ability to define attributes for particular entries (such as the units of measurement) or link to ontologies. An extension of JSON known as JSON-LD (JSON for Linked Data) provides support for the latter, by allowing links to controlled vocabularies.

For example, let’s say that I wanted to represent information about an author (myself) in JSON, which I might do like this:

{
  “name”: “Russell Poldrack”,
  “affiliation”: “Stanford University”,
  “email”: “russpold@stanford.edu”
}

Now let’s say that someone else wanted to search across datasets to find researchers from Stanford University. They would have no way of knowing that I used the term “affiliation” as opposed to “organization”, “institution”, or other terms. We could instead represent this using JSON-LD, which is more verbose but allows us to link to a vocabulary (in this case schema.org) that defines these entities by providing a @context tag:

{
  “@context”: “https://schema.org”,
  “@type”: “Person”,
  “name”: “Russell Poldrack”,
  “affiliation”: {
    “@type”: “Organization”,
    “name”: “Stanford University”
  },
  “email”: “russpold@stanford.edu”
}

Data documentation

While metadata is generally meant to be used by computers, it is also important to provide human-readable documentation for a dataset, so that other researchers (or one’s own self in the future) can understand and reuse the data successfully. There are two forms of documentation that can be important to provide.

Data dictionaries

A data dictionary provides information about each of the variables in a dataset. These are meant to be human readable, though it can often be useful to share them in a machine-readable format (such as JSON) so that they can also be used in programmatic ways. A data dictionary includes information such as:

an understandable description of the variable
the data type (e.g. string, integer, Boolean)
the allowable range of values

For example, a study of immune system function in human participants might include the following in its data dictionary:

| Variable Name | Data Type | Allowable Values | Description |
|---------------|-----------|------------------|-------------|
| age           | Integer   | 0-120            | Age of the participant in years |
| gender        | String    | M, W, O          | Participant’s self-identified gender |
| crp           | Numeric   | 0.1-50.0, -90, -98, -99| C-reactive protein level (mg/L) |

Codebooks

A codebook is meant to be a more human-friendly description of the content of the dataset, focusing on how the data were collected and coded. It often includes a detailed description of each variable that is meant to help understand and interpret the data. For the example above, the codebook might include the following:

Variable Information

Variable name: crp
Variable label: High-sensitivity C-reactive protein
Variable definition: A quantitative measure of C-reactive protein in blood serum.

Measurement and Coding

Data Type: Numeric (Floating Point, 2 decimal places)
Units of Measurement: mg/L (milligrams per Liter)
Measurement Method: Immunoturbidimetric assay.
Instrument: Roche Cobas c702 clinical chemistry analyzer.
Allowable Range: 0.10 - 50.00
- Note: The lower limit of detection for this assay is 0.10 mg/L.
Values and Codes:
- [Numerical Value]: A continuous value from 0.10 to 50.00 represents the measured concentration in mg/L.
- -90: Value below the lower limit of detection (< 0.10 mg/L).
- -98: Unusable sample (e.g., sample was hemolyzed, insufficient quantity).
- -99: Missing (e.g., sample not collected, participant refused blood draw).

Collection Protocol and Provenance

Specimen Type: Serum from a venous blood sample.
Collection Procedure: Blood was drawn from the antecubital vein into a serum separator tube (SST) after an 8-hour overnight fast. The sample was allowed to clot for 30 minutes, then centrifuged at 1,500 x g for 15 minutes. Serum was aliquoted and stored at -80°C until analysis.
Date of Creation: 2025-11-15
Version: 1.0

It is essential to generate data dictionaries and codebooks upon the generation of the dataset; otherwise important details may be lost.

Provenance

Provenance refers to particular metadata regarding the history of processes and inputs that give rise to a particular file. Tracking of provenance is essential to ensure that one knows exactly how a particular file was created. This includes:

the origin of original data (such as the instrument used to collect it, or date of collection)
the specific input files that went into creation of the file, for files that are derived data
the specific versions of any software tools that were used to create the file
the specific settings used for the software tools

Tracking of provenance is non-trivial. The World Wide Web Consortium (W3C) has developed a framework called PROV which defines a model for the representation of provenance information. This framework provides an overview of the many features of provenance that one might want to record for an information that is shared online. The PROV data models defines three main concepts:

Entities: things that are produced, such as datasets and publications
Activities: processes that involve using, generating, or modifying entities
Agents: People, organizations, or artifacts (such as computers) that are responsible for activities

In addition, the model defines a set of relationships between these concepts, as shown in this figure from the W3C:

This data model highlights the breadth of information that needs to be represented in order to accurately record provenance.

There are several different ways to track provenance in practice, which vary in their complexity, comprehensiveness, and ease of use. We will discuss this in much more detail in a later chapter on workflows.

In the next post I will discuss the handling of sensitive data.

Neural Strategies

Estimating parameters using optimization

Automated differentiation

Local minima in optimization

Estimating parameters: Closed-form and Bayesian estimation

Closed-form estimates

Quantifying uncertainty in closed-form estimation

The bootstrap as a general method for quantifying uncertainty

Bayesian estimation

Simulating data

Simulating data from a model

Simulating data based on existing data

Validating scientific software using simulations

Creating simulations

Generating random numbers

Choosing a distribution

Managing complex scientific workflows

Starting point: One huge notebook

The problem of in-place operations

Converting from Jupyter notebook to a runnable python script

Decomposing a complex workflow

Resumable workflows

Managing a complex workflow with Snakemake

Pipeline optimization

Parametric sweeps

Workflow testing strategies

Generating tests using AI agents

Avoiding the happy path

Property-based testing for workflows

Tracking provenance in workflows

Generating a PROV representation from Snakemake

From idea to talk in less than 24 hours

The conference

The analyses

The talk

Takeaways

Best practices for Snakemake workflows

Using a working directory

Workflow organization

Snakefile formatting

Configurability

Report generation

Reproducible environments with Snakemake

Reproducible environments with Conda

Reproducible environments with containers

Workflow management using Snakemake

Snakemake scripts

Updating the workflow when inputs change

Remembering Robert Bilder

The Consortium for Neuropsychiatric Phenomics

Informatics and Ontologies

Bob’s legacy

Using a workflow engine

General-purpose versus domain-specific workflow engines

Running a simple workflow using GNU make

A simple workflow example

Running a simple workflow using GNU make

Streaming workflows and method chaining

Method chaining

Scientific workflow management

What do we want from a scientific workflow

FAIR-inspired practices for workflows

Archiving research data

Version control for data

Using git for data version control

Using DataLad for version control on larger datasets

Creating a local DataLad dataset

Modifying files

Using datalad run

Pushing data to a remote repository

Handling sensitive data

Handling of sensitive data

Data security

Deidentification

Anonymization

Metadata, data documentation, and provenance

Metadata

Metadata file formats

Data documentation

Data dictionaries

Using `datalad run`