Data organization schemes
Better Code, Better Science: Chapter 7, Part 7
This is a possible section from the open-source living textbook Better Code, Better Science, which is being released in sections on Substack. The entire book can be accessed here and the Github repository is here. This material is released under CC-BY-NC.
It is essential to use a consistent data organization scheme for one’s research data. This is obvious when the data are shared with other researchers, but even if the data will never be shared with anyone else, good organization is essential when one looks back at one’s own data in the future. Thus, good data organization is a gift to your future self!
In this section we discuss data organization. The most important principle of data organization is that the scheme should be easily understood consistently applied. If a standard scheme exists in one’s field of research, then I would strongly suggest using that scheme, or at least adapting it to one’s local requirements. A second important principle is that file and folder names should be machine readable. Increasingly we want to use automated tools to parse large datasets, and a machine-readable organization scheme (as I discuss below) is essential to doing this effectively.
File granularity
One common decision that we need to make when managing data is to save data in more smaller files versus fewer larger files. The right answer to this question depends in part on how we will have to access the data. If we only need to access a small portion of the data and we can easily determine which file to open to obtain those data, then it probably makes sense to save many small files. However, if we need to combine data across many small files, then it likely makes sense to save the data as one large file. For example, in the [data management notebook] there is an example where we create a large (10000 x 100000) matrix of random numbers, and save them either to a single file or to a separate file for each row. When loading these data, the loading of the single file is about 5 times faster than loading the individual files.
Another consideration about the number of files has to do with storage systems that are commonly used on high-performance computing systems. On these systems, it is common to have separate quotas for total space used (e.g. in terabytes) as well as for the number of inodes, which are structures that store information about files and folders on a UNIX filesystem. Thus, generating many small files (e.g. millions) can sometimes cause problems on these systems. For this reason, we generally err on the side of generating fewer larger files versus more smaller files when working on high-performance computing systems.
Data file/folder naming conventions
From my standpoint, the most important consideration for naming of files and folders is the ability to automatically parse the file/folder names. While there are may possible ways to do this, I prefer the approach used in the Brain Imaging Data Structure (BIDS), which our group was involved in developing . This is a standard for organizing a wide range of brain imaging data types, but the strategy behind the standard is applicable to almost any scientific data types. The basic idea is to embed a set of key-value pairs in the name, along with a suffix that defines the data type along with a file type extension for files. The schema looks like this:
<key>-<value>_<key>-<value>_suffix.extensionThis is useful because it is very easy to automatically parse such a file name. For example, let’s say we have a file called sub-001_sess-1A_desc-Diffusion_fa.nii.gz. We can easily parse file names like this as follows:
filename = ‘sub-001_sess-1A_desc-Diffusion_fa.nii.gz’
def split_filename(filename):
extension = ‘.’.join(filename.split(’.’)[1:])
name = filename.split(’.’)[0]
key_values = {k:v for k,v in (item.split(’-’) for item in name.split(’_’)[:-1])}
key_values[’suffix’] = name.split(’_’)[-1]
return extension, key_values
extension, key_values = split_filename(filename)
print(key_values){’desc’: ‘Diffusion’, ‘sess’: ‘1A’, ‘sub’: ‘001’, ‘suffix’: ‘fa’}This is very useful because it allows one to easily query a large set of files for particular key-value pairs, and also allows one to easily parse the key-value pairs for a particular file.
It’s worth nothing that using a naming scheme like this requires strict attention to naming hygiene. In particular, it’s essential to ensure that the delimiter characters (”-” and “_”) don’t accidentally get used within the values. For example, if one were using an analysis called “IS-RSA”, using this for the description (e.g. sub-001_sess-1A_desc-IS-RSA_corr.zarr) would cause file parsing to fail.
In the next post I will discuss metadata.
