Version control for data
Better Code, Better Science: Chapter 7, Part 10
This is a possible section from the open-source living textbook Better Code, Better Science, which is being released in sections on Substack. The entire book can be accessed here and the Github repository is here. This material is released under CC-BY-NC. Thanks to Yaroslav Halchenko and Chris Markiewicz for helpful comments on this section.
In the case of original data we never want to allow any changes, but for derived data we will often end up making changes to our workflows that result in changes in the data. As an example, let’s say that we are analyzing RNA-sequencing data, and we receive a notice that a bug was found in the specific version of STAR that we had used for sequence alignment. We would like to be able to track these changes, so that we know which data we are working with at any point in time. In many laboratories, this achieved through file naming, resulting in a menagerie of files with names like dataset_new_fixed_v2.tsv This can make it difficult to determine exactly which data were used in any analysis. In Chapter 2 we discussed the many reasons why we use version control for code, and many of those also apply to data as well. In the case of data, it is particularly important to be able to track the what, when, and why of any changes to the data, which is exactly the purpose of version control systems.
Using git for data version control
When the relevant data are small (e.g., smaller than a few megabytes) and stored in a text format (such as CSV/TSV), one can simply use git to track changes in the data. (We will discuss in a later chapter why Github is not an optimal platform for sharing data, at least not on its own.).
However, git does not work well for version control on larger datasets using binary data files. Git is able to efficiently store version information about code because it tracks the specific differences in the code between versions (known as a diff), and only stores the differences. Thus, if one has a very large code file and changes one line, only that one line difference is stored in the git database. However, with binary data this strategy is not effective, and git has to store the entire new dataset each time, leading to bloated repositories and very slow performance.
Using DataLad for version control on larger datasets
A solution to this problem is to use a version control tool that is specifically designed for large data. There are several tools that address this problem; we will focus on DataLad, which is a data management system that functions very similarly to git. It is based on a tool called git-annex, but provides much greater ease of use for researchers. (Full disclosure: Our group collaborates with the DataLad group and our grants have supported some of their development work.)
An important note: DataLad is quite powerful but has a significant learning curve, and takes a bit of time to get accustomed to. In particular, its use of symbolic links can sometimes confuse new users. Having said that, let’s look at some simple examples.
Creating a local DataLad dataset
Let’s say that we want to create a new dataset on our local computer that will be tracked by DataLad. We first create a new repository:
➤ datalad create -d . my_datalad_repo
add(ok): my_datalad_repo (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
create(ok): my_datalad_repo (dataset)
This creates a new directory, called my_datalad_repo and sets it up as a DataLad dataset. We then go into the directory and create a subdirectory called data, and then download some data files from another project. We do this using the datalad download-url function, which will both download the data and save them to the DataLad dataset:
➤ datalad download-url -d . -O my_datalad_repo/data/ https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/demographics.csv
[INFO ] Downloading ‘https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/demographics.csv’ into ‘/Users/poldrack/Dropbox/code/BetterCodeBetterScience/my_datalad_repo/data/’
download_url(ok): /Users/poldrack/Dropbox/code/BetterCodeBetterScience/my_datalad_repo/data/demographics.csv (file)
add(ok): data/demographics.csv (file)
save(ok): my_datalad_repo (dataset)
add(ok): my_datalad_repo (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
➤ datalad download-url -d . -O my_datalad_repo/data/ https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/meaningful_variables_clean.csv
[INFO ] Downloading ‘https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/meaningful_variables_clean.csv’ into ‘/Users/poldrack/Dropbox/code/BetterCodeBetterScience/my_datalad_repo/data/’
download_url(ok): /Users/poldrack/Dropbox/code/BetterCodeBetterScience/my_datalad_repo/data/meaningful_variables_clean.csv (file)
add(ok): data/meaningful_variables_clean.csv (file)
save(ok): my_datalad_repo (dataset)
add(ok): my_datalad_repo (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)A DataLad dataset is also a git repository, which we can see if we use the `git log` command:
➤ git log
commit 948cc31262fcddda3bfc56b222687710861c57d1 (HEAD -> text/datamgmt-Nov3)
Author: Russell Poldrack <poldrack@gmail.com>
Date: Mon Dec 15 13:40:52 2025 -0800
[DATALAD] Download URLs
URLs:
https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/meaningful_variables_clean.csv
commit 9b4b8b29e08a21974dc52e3026405b878078f07b
Author: Russell Poldrack <poldrack@gmail.com>
Date: Mon Dec 15 13:40:29 2025 -0800
[DATALAD] Download URLs
URLs:
https://raw.githubusercontent.com/IanEisenberg/Self_Regulation_Ontology/refs/heads/master/Data/Complete_02-16-2019/demographics.csv
Here we see the commit messages that were automatically created by DataLad, first for creating the new dataset and then for downloading the URLS. The datalad download-url function adds the URL to the log, which is useful for provenance tracking. If one wishes to download a large number of files, there is also a datalad addurls command that can download multiple files based on a single text file (TSV, JSON, etc) containing the relevant URLs and information.
Modifying files
Now let’s say that we want to make a change to one of the files and save the changes to the dataset. Files tracked by DataLad are read-only (”locked”) by default. If we want to edit them, then we need to use `datalad unlock` to unlock the file:
➤ datalad unlock my_datalad_repo/data/demographics.csv
unlock(ok): my_datalad_repo/data/demographics.csv (file)We then use a Python script to make the change, which in this case is removing some columns from the dataset:
➤ python src/BetterCodeBetterScience/modify_data.py my_datalad_repo/data/demographics.csv
We can now use datalad status to see that the file has been modified:
➤ datalad status
modified: my_datalad_repo (dataset)And we can then save it using datalad save:
➤ datalad save -d . -m “Modified demographics.csv” my_datalad_repo/data/demographics.csv
add(ok): data/demographics.csv (file)
save(ok): my_datalad_repo (dataset)
add(ok): my_datalad_repo (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)DataLad doesn’t have a staging area like git does, so there is no need to first add and then commit the file; datalad save is equivalent to adding and then committing the changes. If we then check the status we see that there are no changes waiting to be saved:
➤ datalad status
nothing to save, working tree cleanUsing datalad run
Although the previous example was meant to provide background on how DataLad works, in practice there is actually a much easier way to accomplish these steps, which is by using the datalad run command. This command will automatically take care of fetching and unlocking the relevant files, running the command, and then committing the files back in, generating a commit message that tracks the specific command that was used:
➤ datalad run -i my_datalad_repo/data/demographics.csv -o my_datalad_repo/data/demographics.csv -- uv run src/BetterCodeBetterScience/modify_data.py my_datalad_repo/data/demographics.csv
[INFO ] Making sure inputs are available (this may take some time)
unlock(ok): my_datalad_repo/data/demographics.csv (file)
[INFO ] == Command start (output follows) =====
Built bettercodebetterscience @ file:///Users/poldrack/Dropbox/code/BetterCode
Uninstalled 1 package in 1ms
Installed 1 package in 1ms
[INFO ] == Command exit (modification check follows) =====
run(ok): /Users/poldrack/Dropbox/code/BetterCodeBetterScience (dataset) [uv run src/BetterCodeBetterScience/modif...]
add(ok): data/demographics.csv (file)
save(ok): my_datalad_repo (dataset)
add(ok): my_datalad_repo (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
# show the most recent commit
➤ git log -1
commit 3ef3b94a0abffec6a8db7570a97339f48ee728ed (HEAD -> text/datamgmt-Nov3)
Author: Russell Poldrack <poldrack@gmail.com>
Date: Mon Dec 15 13:28:06 2025 -0800
[DATALAD RUNCMD] uv run src/BetterCodeBetterScience/modif...
=== Do not change lines below ===
{
“chain”: [],
“cmd”: “uv run src/BetterCodeBetterScience/modify_data.py my_datalad_repo/data/demographics.csv”,
“exit”: 0,
“extra_inputs”: [],
“inputs”: [
“my_datalad_repo/data/demographics.csv”
],
“outputs”: [
“my_datalad_repo/data/demographics.csv”
],
“pwd”: “.”
}
^^^ Do not change lines above ^^^
If one uses DataLad for data versioning then the datalad run command can be very helpful for running commands on those data.
Pushing data to a remote repository
DataLad is a particularly powerful tool for sharing data across systems. It allows one to push or pull data from a number of different remote storage systems; in this example we will use the Open Science Framework (OSF) as our storage location, because it is particularly easy to use with DataLad.
We first need to install and set up the datalad-osf Python package, per the DataLad documentation. We also need to create an account on the OSF site, and obtain a Personal Access Token for login. We can then use DataLad to authenticate with OSF:
➤ datalad osf-credentials
You need to authenticate with ‘https://osf.io’ credentials. https://osf.io/settings/tokens provides information on how to gain access
token:
osf_credentials(ok): [authenticated as Russell Poldrack <poldrack@stanford.edu>]Having authenticated with OSF, we can now create a new OSF project using DataLad:
➤ datalad create-sibling-osf --title datalad-test-project -s osf
create-sibling-osf(ok): https://osf.io/htprk/
[INFO ] Configure additional publication dependency on “osf-storage”
configure-sibling(ok): . (sibling)Once the project is created, we can push the contents of our dataset to our OSF project:
➤ datalad push --to osf
copy(ok): data/demographics.csv (file) [to osf-storage...]
copy(ok): data/meaningful_variables_clean.csv (file) [to osf-storage...]
publish(ok): . (dataset) [refs/heads/main->osf:refs/heads/main [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->osf:refs/heads/git-annex [new branch]]
action summary:
copy (ok: 2)
publish (ok: 2)These data now exist on OSF, and can be cloned to our local machine using datalad clone:
➤ datalad clone osf://htprk/
[INFO ] Remote origin uses a protocol not supported by git-annex; setting annex-ignore
install(ok): /Users/poldrack/Downloads/htprk (dataset)
➤ tree htprk
htprk
└── data
├── demographics.csv -> ../.git/annex/objects/f7/Mm/MD5E-s58237--dc5b157fb9937eae2166d73ee943c766.csv/MD5E-s58237--dc5b157fb9937eae2166d73ee943c766.csv
└── meaningful_variables_clean.csv -> ../.git/annex/objects/J5/X6/MD5E-s1248729--e4fbac610f1f5e25e04474e55209ef56.csv/MD5E-s1248729--e4fbac610f1f5e25e04474e55209ef56.csvNotice that the files in the cloned dataset directory are actually symbolic links; the actual file contents are not downloaded when the dataset is cloned. We can see this if we try to look at the size of the datafile:
➤ wc data/demographics.csv
wc: data/demographics.csv: open: No such file or directoryTo actually download the file contents, we can use `datalad get`, after which we will see that the file contents are available:
➤ datalad get . 1 ↵
get(ok): data/demographics.csv (file) [from web...]
get(ok): data/meaningful_variables_clean.csv (file) [from web...]
action summary:
get (ok: 2)
➤ wc data/demographics.csv
523 1276 58237 data/demographics.csvOne can also push data using DataLad to a range of other remote hosts; see the DataLad documentation for more on this.
In the next post I will complete the data management chapter with a discussion of archiving data.
