This is a section from the open-source living textbook Better Code, Better Science, which is being released in sections on Substack. The entire book can be accessed here and the Github repository is here. This material is released under CC-BY-NC.
Why better code can lead to better science
One of the main motivations for writing this book is to help make science better. In particular, we want to increase the *reproducibility* of scientific results. But what does *reproducibility* mean? And for that matter, what does it even mean for an activity to count as "science"?
This seemingly simple question turns out to be remarkably difficult to answer in a satisfying way. Although it is easy to give examples of forms of inquiry that we we think are scientific (astrophysics, molecular biology) and forms that we think are not (astrology, creationism), defining an "essence" of science has eluded philosophers of science who have examined this question (known as the demarcation problem") over the last century. One answer is that there are a set of features that together distinguish between scientific and non- or pseudo-scientific enterprises. Some of these have to do with the social characteristics of the enterprise, best described in terms of the features that are often found in pseudoscience, such as an overriding belief in authority and the willingness to disregard information that contradicts the theory. But nearly everyone agrees that an essential aspect of science is the ability to reproduce results reported by others. The idea of replication as a sine qua non of science goes back to the 17th Century, when Christian Huygens built an air pump based on the designs of Robert Boyle and demonstrated a phenomenon called "anomalous suspension" that initially could not be replicated by Boyle, leading Huygens to ultimately travel to London and demonstrate the phenomenon directly (Shapin & Shaffer, 1985). Throughout the development of modern science, the ability for researchers to replicate results by other scientists has been a foundational feature of science.
An example serves to show how well science can work when the stakes are high. In 1989, the chemists Martin Fleischmann and Stanley Pons reported that they had acheived nuclear fusion at temperatures much less than usually thought to be required. If true this would have been a revolutionary new source of energy for the world, so scientists quickly began to try to reproduce the result. Within just a few months, the idea of "cold fusion" was full discredited; while the New York Times labeled the cold fusion work as an example of “pathological science", the entire episode showed how well science can sometimes self-correct.
Cold fusion was a case in which science worked as it should, with other researchers quickly trying (and in this case failing) to reproduce a result. However, there are other cases in which scientific claims have lingered for years, only to be discredited when examined deeply enough. Within psychology, a well known example comes from the study of "ego depletion", a phenomenon in which exerting self control in one domain was thought to "deplete" the ability to exert self control in a different domain. This phenomenon was first reported by Baumeister and colleagues (1994) who reported that giving people a difficult mental task to solve made them more likely to eat a cookie on their way out of the laboratory, compared to people who didn't have to solve the task. Hundreds of papers were published on the phenomenon over the subsequent decades, mostly using more simple laboratory tasks that didn't require a kitchen and oven in the laboratory to bake cookies. Nearly all of these subsequently published studies reported finding ego depletion effects. But two large-scale efforts to reproduce the finding including data from more than 5,000 participants (Hagger et al., 2016; Vohs et al., 2021) have shown that the effect is so small as to likely be non-existent. Below we will talk more about the reasons that we now think these kinds of irreproducible findings can come about.
What does "reproducibility" mean?
There are many different senses of the term "reproducibility", which can cause confusion. A framework that we like for this concept comes from the Turing Way, an outstanding guide for open and reproducible science practices. This framework distinguishes between whether the data and analysis are either same or different between two analyses:
A schematic of the Turing Way framework for different concepts of reproducibility. Reproduced from The Turing Way under CC-BY.
People sometimes get hung up on these terminological differences, but that's often just a distraction from the central point: We want to ensure that scientific research generates answers to questions that can generalize to a broad range of situations beyond the initial study.
A reproducibility "crisis"
Starting in the early 2010's, scientists became deeply concerned about whether results in their fields were reproducible, focusing primarily on the concept of *replication*; that is, whether another researcher could collect a new dataset using the same method and acheive the same answer using the same analysis approach. While this concern spanned many different domains of science, the field of psychology was most prominent in tackling it head-on. A large consortium banded together in an attempt to replicate the findings from 100 published psychology papers, and the results were startling (Open Science Collaboration, 2015): Whereas 97% of the original studies had reported statistically significant results, only 36% of the replication attempts reported a significant finding. This finding led to a firestorm of criticism and rebuttal, but ultimately other studies have similarly shown that a substantial portion of published psychology findings cannot be replicated, leading to what was termed a "reproducibility crisis" in psychology (Nosek et al., 2022). Similar efforts subsequently uncovered problems in other areas, such as cancer biology (Errington et al., 2021), where it was only possible to even complete a replication of about 1/4 of the intended studies due to a lack of critical details in the published studies and lack of cooperation by about 1/3 the original authors.
As this crisis unfolded, attention turned to the potential causes for such a lack of reproducibility. One major focus was the role of "questionable research practices" (QRPs) - practices that have the potential to decrease the reproducibility of research. A prominent 2011 article titled "False Positive Psychology" (Simmons et al., 2011) showed that commonly used practices within psychology have the potential to substantially inflate the false positive rates of research studies. Given the prominence of statistical hypothesis testing and the bias towards positive and statistically significant results (usually at p < .05) in the psychology literature, these practices were termed "p-hacking". Many of the efforts to improve reproducibility have focused on reducing the prevalence of p-hacking, such as the pre-registration of hypotheses and data analyses.
Open science and reproducibility
There is a well known quote from Jonathan Buckheit and David Donoho (1995) that highlights the importance of openly available research objects in science:
An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.
There was surprisingly little focus on code during the reproducibility crisis, but it is clear that there are problems even with what would seem like the easiest quadrant of the Turing Way framework: namely, the ability to reproduce results give the same data and same analysis methods. Tom Hardwicke, Michael Frank and colleagues have examined the ease of reproducing results from psychology papers where the data are openly available, and the results were not encouraging. In one analysis (Hardwicke et al., 2021), they attempted to reproduce the published results from 25 papers with open data. Their initial analyses showed major numerical discrepancies in about 2/3 of the papers; strikingly, for about 1/4 of the papers they were unable to reproduce the values reported in the original publication *even with the help of the authors*!
It is increasingly common for researchers to share both code and data from their published research articles, in part due to incentives such as "badges" that are offered by some journals. However, in our experience, it can be very difficult to actually run the shared code, due to various problems that limit the portability of the code. Throughout this book we will discuss the tools and techniques that can help improve the portability of shared code and thus increase the reproducibility of published results.
Bug-hacking
A particular concern is that not all software errors are created equal. Imagine that a graduate student is comparing the performance of a new machine learning method that they developed with their implementation of a previous method. Unbeknownst to them, their implementation of the previous method contains an error. If the error results in poorer performance of the previous method (thus giving their new method the edge), then they are less likely to go looking for a bug than they might be if the error caused performance of the competing method to be inaccurately high. We have referred to this before as "bug-hacking", and this problem is nicely exemplified by a comic strip from PhD Comics.
There are numerous apparent examples of bug-hacking in the literature. On striking example was identified by Mark Styczynski and his colleagues (2008) when they examined a set of substitution matrices known as the BLOSUM family that are commonly used in bioinformatics analyses. A set of these matrices were initially created and shared in 1992 and widely used in the field for 15 years before Styczynski et al. discovered that they were in error. These errors appeared to have significant impact on results, but interestingly the incorrect matrices actually performed *better* than the correct matrices in terms of the number of errors in biological sequence alignments. It seems highly likely that a bug that had substantially reduced performance would have been identified much earlier.
Another example comes from our own field of neuroimaging. A typical neuroimaging study collects data from hundreds of thousands of three-dimensional volumetric pixels (know as *voxels*) within the brain, and then performs statistical tests at each of those locations. This requires a correct for multiple tests to prevent the statistical error rate from skyrocketing simply due to the large number of tests. There are a number of different methods that are implemented in different software packages, some of which rely upon mathematical theory and others of which rely upon resampling or simulation. One of the commonly used open source software packages, AFNI, provided a tool called *3DClustSim* that used simulation to estimate a statistical correction for multiple comparisons. This tool was commonly used in the neuroimaging literature, even by researchers who otherwise did not use the AFNI software, and the lore developed that 3DClustSim was less conservative than other tools. When Anders Eklund and Tom Nichols (Eklund et al., 2016) analyzed the performance of several different tools for multiple test correction, they identified a bug in the way that the 3DClustSim tool performed a particular rescaling operation, which led in some cases to inflated false positive rates. This bug had existed in the code for 15 years, and almost certainly was being leveraged by researchers to obtain "better" results (i.e. results with more seeming discoveries). Had the bug led to much more conservative results compared to other standard methods, it is likely that users would have complained and the problem would have been investigated; in the event, users did not complain about getting more apparent discoveries in their analyses.
How not to fool ourselves
In his 1974 commencement address at Caltech, the physicist Richard Feynman famously said "The first principle is that you must not fool yourself — and you are the easiest person to fool." One of the most powerful ways that scientists have developed to prevent us from fooling ourselves is *blinding* - that is, preventing us from seeing or otherwise knowing information that could lead us to be biased towards our own hypotheses. You may be familiar, for example, of the idea of a "double-blind" randomized controlled trial in medical research, in which participants are randomly assigned to a treatment of interest or a control condition (such as a placebo); the "double-blind" aspect of the trial refers to the fact that neither the patient nor the researcher knows who has been assigned to the treatment versus control condition. Assuming that blinding actually works (which can fail, for example, if the treatment has strong side effects), this can give results that are at much lower risk of bias compared to a trial in which the physician or patient know what their condition is. In physics, researchers will regularly relabel or otherwise modify the data to prevent the researcher from knowing whether they are working with the real data versus some other version. This kind of blinding helps researchers avoid fooling themselves.
A major concern in the development of software for data analysis is that the researcher will make choices that are data-dependent. Andrew Gelman and Eric Loken (2019) referred to a "garden of forking paths", in which the researcher makes seemingly innocuous data-driven decisions about the methods to apply for analysis, resulting in an injection of bias into the analysis. One commonly recommended solution for this is "pre-registration", in which the methods to be applied to the data are pre-specified before any contact is made with the data. There are several platforms (including the Open Science Framework, ClinicalTrials.gov, and AsPredicted.org, depending on the type of research) that can be used to pre-register analysis plans and code prior to their application to real data. Pre-registration has been used in medical research for more than two decades, and its introduction was associated with a substantial reduction in the prevalence of positive outcomes in clinical trials, presumably reflecting the reduction in bias (Kaplan & Irvin, 2015). However, pre-registration can be challenging when the data are complex and the analytic methods are not clear from the outset. How can a researcher avoid bias while still making sure that the analyses are optimal for their data? There are several possible solutions, whose applicability will depend upon the specific features of the data in question.
One solution is to set aside a portion of the data (which we call the "discovery" dataset) for code development, holding aside a "validation dataset" that remains locked away until the analysis code is fixed (and preferably pre-registered). This allows the researcher to use the discovery dataset to develop the analysis code, ensuring that the code is well matched to the features of the dataset. As long as contact with the validation dataset is scrupulously avoided during the discovery phase, this can prevent analyses of the validation dataset from being biased by the specific features of those data. The main challenge of this approach comes about when the dataset is not large enough to split into two parts. One adaptation of this approach is to use pilot data or data data that were discarded in the initial phase of data cleaning (e.g. due to data quality issues) as the discovery sample, realizing that these data will likely differ in systematic ways from the validation set.
Beyond reproducibility: Getting valid answers
Our discussion so far has focused on reproducibility, but it is important to point out that a result can be completely reproducible yet wrong. A degenerate example is a data analysis program that always outputs zeros for any analysis regardless of the data; it will be perfectly reproducible, with the same data or different data, yet also perfectly wrong! This distinction goes by various names; we will adopt the terminology of "reliability" versus "validity" that is commonly used in many fields including psychology. A measurement is considered to be *reliable* if repeated measurements of the same type give similar answers; a measurement is perfectly reliable if it gives exactly the same answer each time it is performed, and increasing measurement error leads to lower reliability. On the other hand, a measurement is considered to be *valid* if it accurately indexes the underlying feature that it is intended to measure; that is, the measure is *unbiased* with respect to the ground truth.
The distinction between reliability and validity implies that we can't simply focus on making our analyses reproducible; we also need to make sure that they reproducibly give a valid answer. In a later chapter we will talk in much more detail about how to validate computational analyses using simulated data.
Guiding principles for this book
The material that we will present in this book reflects a set of guiding principles:
- Scientific research increasingly relies upon code written by researchers with the help of AI agents. Improving the quality of research code is a direct way to enhance the reliability of scientific research.
- Scientists bear the ultimate responsibility for ensuring that their code provides answers that are both reliable and valid. Fortunately, there are many software development tools that can help in this endeavor.
- Scientists have a responsibility to make their work as open and reproducible as possible. Open source software and platforms for open sharing of research objects including data and code are essential to making this happen.
A beautiful introduction to a topic that is close to my heart. Looking forward to the next chapters.
---
Since someone has to be "that guy", here are two snippets that might contain some unintended grammatical innovations:
- This requires a *correct* for multiple tests to prevent the statistical error rate from skyrocketing
- One adaptation of this approach is to use pilot data or *data data* that were discarded in the initial phase of data cleaning
I'd like to mention that the quote about an article being merely advertising of the scholarship, although it is reproduced in the article by Buckheit and Donoho, it is in fact attributed to Jon Claerbout. Everyone cites B&D but we should credit Claerbout instead!