Storing research data
Better Code, Better Science: Chapter 7, Part 2
This is a possible section from the open-source living textbook Better Code, Better Science, which is being released in sections on Substack. The entire book can be accessed here and the Github repository is here. This material is released under CC-BY-NC.
Storing data
There are several different ways that one can store research data, which vary in their ease of use, speed, reliability, and resilience. One major distinction is between the use of file systems (either physical or cloud systems) or database systems.
Before discussing different options, it is useful to lay out the important considerations regarding different data storage solutions. These are the dimensions across which different options will vary:
Ease of use: How much extra work is required for the user to implement the storage solution?
Collaboration: Do multiple researchers need to access the data? Do they need to be able to modify the dataset concurrently?
Storage capacity: Is the solution sufficient to store the relevant amount of data for the study? Is it scalable over time?
Performance: Is the solution fast enough to enable the required processing and analysis steps?
Accessibility: Is the storage system accessible to the system where the compute will be performed (e.g. local computer, HPC cluster, cloud system)?
Security: Does the system meet the security and compliance requirements for the particular dataset? Does it allow appropriate access control?
Redundancy: Is the system robust to disasters, ranging from the failure of one hard drive to a catastrophic flood or fire? Does it provide the required backup capability?
Cost: Does the cost of the solution fit within the researcher’s budget? Are there hidden costs that must be taken into account?
Longevity: Will the data remain available in the long term?
It’s also important to point out that most projects end up using multiple storage solutions for different portions of the data lifecycle.
File system storage
A file system is an organized system for naming and locating computer files on a storage system such as a hard disk. Readers of this book will undoubtedly be familiar with the file systems present on Mac, Windows, or UNIX/Linux systems, which represent a hierarchical tree of folders and files. Here is an example of the file tree for the source code folder in the book project:
➤ tree -L 2 src
src
└── BetterCodeBetterScience
├── __init__.py
├── __pycache__
├── bug_driven_testing.py
├── claudecode
├── constants.py
├── data_management.ipynb
├── distance.py
├── distance_testing
├── docker-example
├── escape_velocity.py
├── formatting_example.py
├── formatting_example_ai.py
├── formatting_example_ruff.py
├── incontext_learning_example.ipynb
├── language_model_api_prompting.ipynb
├── llm_utils.py
├── modify_data.py
├── my_linear_regression.py
├── simple_testing.py
├── simpleScaler.py
├── test_independence.py
└── textminingWe often use spatial metaphors to describe file systems; we say that a file is “inside” a folder, or that we are going to “move” a file from one folder to another. Working effectively and efficiently with data stored on a file system will be enhanced by a solid knowledge of the various tools that one can use to interact with a file system. In the examples throughout the book I will focus on POSIX-compliant operating systems like MacOS and Linux, but most of the same concepts also apply to other file systems such as Windows.
Storage on a PC/laptop hard drive
The simplest way to store data is on a hard drive on a researcher’s personal computer workstation or laptop. While this is easy and relatively cheap for smaller datasets, it is also fraught for numerous reasons:
It is risky, as the failure of the hard drive or loss of the system to damage or theft can result in total loss of the data.
Depending on whether or not the system is encrypted, theft may expose the data in ways that violate confidentiality.
Most PC/laptop systems do not have automatic backup systems, so they are less likely to have a viable backup for recovery if a problem occurs.
It is difficult or impossible to allow collaborators to access the data
For many research domains, the size of the data (often in terabytes) will quickly outstrip the capacity of local hard drives.
For these reasons, I generally recommend to researchers in my lab that they should never rely solely on their own computer as the storage solution for research data.
Storage on a network drive
Research data is often stored on network drives. These can vary from a network-attached storage system dedicated to one or more users within a research group, to large-scale network drives managed by an institutional computing center. One common feature of network storage is the use of redundant drive systems, such as RAID (Redundant Array of Independent Disks). These systems combine multiple individual hard drives in ways that provide some degree of redundancy, such that the system can withstand the loss of one or more individual disks (depending on the setup) with no data loss. However, it is critically important to remember that while RAID does provide some degree of fault-tolerance, it does not provide the disaster recovery benefits of a true backup.
Many researchers run and manage their own RAID systems for their group’s use, either attached to a single workstation or to a network. This can be a cost-effective solution for large data storage, especially in situations where institutional data storage resources are not available. However, I think that the apparent robustness of RAID systems can provide a false sense of security to their users. Take the most common RAID setup for redundant storage, RAID 5, in which the system is robust to the failure of one of its hard drives. When a drive fails, the system enters a “degraded” mode, often providing a notice to the user such as a flashing red light or beeping sounds. If this failure goes unnoticed, or the system administrator puts off fixing it, the failure of a second drive during degraded mode or during the rebuilding of the array after replacing the first failed drive can lead to complete data loss. Similarly, if the rebuilding of the array fails (for example, due to power loss during the rebuild or an unrecoverable error in reading from the other drives), this can compromise the data. Safe use of RAID arrays requires consistent attention (including email notifications of failure if possible) and a strong backup strategy.
Most research institutions now offer large network-attached storage systems for research data, often connected directly to high-performance computing systems. We have used systems like these for our research data for more than 15 years, and I personally would never go back to running my group’s own RAID system (which we did for years beforehand). Foremost, the system administration and hardware management resources of an institutional computing center will almost always outstrip those of most research groups. These large systems will have monitoring and repair procedures in place to ensure against data loss, and in the 15 years that we have used these systems (at Stanford and the University of Texas), we have never experienced data loss due to hardware failure. However, they are still liable to potential disasters. These systems are also highly performant, providing parallel access to the data through high-speed interconnections with the compute system.
Backing up one’s data from a large network drive is a great idea in theory, but in our experience it has often been either impossible or too costly, given the many terabytes of research data that we store on the systems. Given the relatively low likelihood of failure, we have adopted a more risk-tolerant strategy to big data storage:
Original data and any data that cannot be reasonably recreated from original data are stored on at least two independent systems (such as the network drive and a cloud storage system)
Software code is stored on a separate partition that is backed up by the computing center, as well as being pushed to Github.
In this way, we have redundant copies of the code and original data that could be used to recreate the processed data if necessary. This is a risky strategy, but the more risk-averse alternative of continuously backing up our entire 250TB partition would be cost-prohibitive.
Cloud drives
Cloud drives, such as Dropbox or Google Drive, have become very popular for storage and sharing of data. I personally keep all of my code and documents synced to Dropbox from my laptop, and the file recovery capabilities of Dropbox have saved me from myself more than once after accidentally deleting the wrong files. I also regularly share files with other researchers using the Dropbox file sharing features. Because of the potential impact of loss of my laptop, I also keep a “hot spare” laptop that is constantly kept in sync with my primary laptop via Dropbox. Thus, cloud drives are essential for my own research and productivity workflow. However, cloud drives on their own are unlikely to be a primary solution for data storage with large datasets, for several reasons including:
Their cost increases dramatically as the datasets move into the terabyte range.
You can’t bring the compute to the data using these systems - you have to bring the data to the compute. This means that the data need to be fully downloaded to each synced system, resulting in a large number of copies of the dataset.
These systems also are not optimized for large files, and network speed may result in long synchronization times.
In addition, many institutions have specific restrictions regarding the use of specific cloud drives, especially with regard to private or protected information.
Cloud object storage
An increasingly common storage option, especially for very large datasets, is the use of cloud-based object stores, such as Amazon’s Simple Storage Service (S3) or Google Cloud Storage. In some ways object storage is similar to a standard file system, in that it allows the storage of arbitrary types of files, which can be retrieved using a key that functions like a file path on a file system. However, there are also important differences between object storage and file systems. Most importantly, cloud object stores are accessed via web API calls rather than by operations on a local storage system. Cloud object stores have several features that can make them very attractive for research data storage:
They offer scalability in terms of data size that is limited only by one’s budget
They provide robustness through redundant storage across multiple systems
They are often much less expensive than standard file system (”block”) storage
These systems are most effective when they are accessed directly using computing resources hosted by the same cloud provider. If they are located within the same datacenter, then the network connectivity can be substantially faster. It rarely makes sense to access data directly on a cloud object store from a local computing system, both because of the potentially high cost of reading and writing data from these systems and because of the relatively slow network connectivity between a local system and a cloud provider.
Database storage
In some areas of science, such as genomics, it is common to store data using database systems rather than files on a filesystem. A database system is a software system that stores data records and allows the user to query the records based on specific features and to add, modify, or delete records. A database system can run locally on one’s own computer, or can be accessed remotely via the Internet; most cloud computing providers provide database systems that can be hosted virtually, providing access to storage space that is limited only by one’s budget.
There are many potential benefits to the use of database storage that will be outlined below. However, one important factor in the choice of database versus flat file storage is what software tools will be used to analyze the data. If the analyses are primarily being performed using custom code in Python or R, then it is relatively easy to either retrieve information from a database or load data from a flat file. However, in some fields (including the field of neuroimaging where I work) it is common to use software packages that are built to process flat files, which strongly drives researchers in the field towards that approach.
I will first briefly outline several of the most common forms of database systems, and then show an example that employs each of them.
Relational databases
The best known form of database is the relational database, which organizes tabular data into a set of tables with well-defined relationships between them. They also enable queries using a query language, of which Structured Query Language (SQL) is a well-known example. For me, SQL has always been one of those things that I use just infrequently enough that I never actually learn it. Fortunately, LLMs are very good at translating natural language into SQL queries, lowering the barrier of entry for researchers who want to try out database storage.
ACID
One important feature of a relational databases is that they generally implement features to ensure data integrity and reliability. These are often referred to as the ACID properties:
Atomicity: Transactions are atomic, meaning that they either succeed or they don’t: there are no partial transactions. If a transaction fails then the database remains in the state it was in prior to the failed transaction.
Consistency: A transaction is required to leave the database in a valid state. Any transaction that attempts to violate any constraints or rules (such as the requirement that every measurement includes a valid device key) will be rejected.
Isolation: Individual transactions do not interfere with one another, such that they would never see any partial changes due to another transaction. Thus, one can submit many transactions at once and be sure that they will each be processed correctly without interference from others.
Durability: Transactions are durable, such that once they are written they will be permanent despite failures such as power outages or system crashes (as long as the server is not damaged).
The adherence of relational database systems to these principles helps ensure the integrity of scientific data, in comparison to the use of flat files which do not necessarily achieve these goals.
Analytic databases
There is a particular kind of relational database known as an analytic database that is specialized for operations that work across many rows in the database, rather than the focus on individual records in a standard relational database. One widely-used analytic database in the Python ecosystem is DuckDB, which supports very fast operations on large datasets, and integrates well with Pandas and other tools. Unlike traditional relational database systems, it doesn’t require any specialized server setup.
NoSQL databases
While relational databases were the only game in town for many years, there are now a number of other kinds of database, collectively referred to as NoSQL databases because they use non-relational data models (like documents, graphs, or key-value pairs) rather than the tables with fixed schemas that define a standard relational database. Each of these can be very useful for specific problems that match the database’s strengths. Some, but not all, NoSQL databases are ACID compliant. It’s important to ensure that one has the right safeguards in place when using a non-compliant database system.
Document stores
A document store is basically what it sounds like: a system into which one can dump documents and then query them. I think of this as in some ways the opposite of a SQL database. In the SQL database, most of the work comes in designing the database schema, which will determine up front how the data are represented; after that, querying is fairly straightforward. In a document store, one can insert documents with varying structure into the database without the need for a predefined schema. The hard work in a document store comes in figuring out how to structure queries and indexes effectively, especially when the structure of the data varies. For most of the tasks where I have used databases I have chosen document stores (particularly MongoDB) over relational databases because of the flexibility that they offer.
Graph databases
A graph database is built to efficiently store and query graph-structured data. These are data where the primary features of interest for querying are the relationships between entities, rather than the entities themselves. Scientific examples could include social network relationships, protein-protein interactions, or connections between neurons or brain areas. Graph databases are particularly good at finding multi-step relationships within the graph, which are much more difficult to find using a relational database or document store. A commonly used graph database is Neo4j, which has its own query language called Cypher that is specifically designed for queries on graph structure.
Vector databases
A relatively recent entry into the database field is the vector database, which is optimized for finding similar numerical vectors. These have become essential in the context of AI, because they can be used to quickly find similar items that are embedded in a vector space, typically using neural networks. These items can include text documents, images, molecular structures, or any other kind of data that can be embedded in a vector space. Vector databases differ in that can return ranked similarity ratings in addition to a discrete set of matches, and thus they are best for performing analyses that involve similarity-based search.
In the next post I will turn to practical issues around managing research data.

Great summary!
I hope there will also be a chapter about managing research data as open data. I.e. we know about the FAIR principles, but what are the technical details to put it in practice?