Data formats and file types: Tabular data
Better Code, Better Science: Chapter 7, Part 4
This is a possible section from the open-source living textbook Better Code, Better Science, which is being released in sections on Substack. The entire book can be accessed here and the Github repository is here. This material is released under CC-BY-NC.
There is a wide range of different data formats and file types used across science. In this and coming posts I will cover some of the most widely used formats and file types, realizing that there are many areas of science that use highly specialized formats/file types that I can’t cover in depth here. In this first post I focus on tabular data.
Tabular data
Tabular data are loosely defined as data stored in rows and columns, as in a spreadsheet. A data frame is a particular representation of tabular data, in which each column in the dataset has a label and each row has an index value that refers to that row. The packages supporting data frames (such as pandas in Python) generally provide a set of operations that can be performed on the data frame, such as filtering, sorting, merging, or pivoting.
Long, wide, and tidy tabular data
There are multiple ways to organize data within a tabular dataset, reflecting what each column and row refer to. This distinction is often referred to as long versus wide data, though in reality there is really a spectrum of organization between these extremes. Wide data generally refers to data where each row refers to a single observational unit (such as a one site, person, or planet), and each column refers to different variables measured for that unit. For example, let’s say that we had measurements of height, weight, and blood pressure from three individuals. A wide representation of these data would have one row per individual, with an identifier column identifying the individual and separate columns for each of the measurements:
| id | height | weight | blood_pressure |
|-----:|---------:|---------:|-----------------:|
| 1 | 170 | 70 | 120 |
| 2 | 180 | 80 | 130 |
| 3 | 175 | 75 | 125 |Long data generally refers to data where different variables measured for a unit are spread across rows. In this case, there are separate columns that specifies the variable being measured, the value of the variable for that unit, and the identity of the unit being measured.
| id | measurement | value |
|-----:|:---------------|--------:|
| 1 | height | 170 |
| 2 | height | 180 |
| 3 | height | 175 |
| 1 | weight | 70 |
| 2 | weight | 80 |
| 3 | weight | 75 |
| 1 | blood_pressure | 120 |
| 2 | blood_pressure | 130 |
| 3 | blood_pressure | 125 |A related concept is the notion of tidy data, popularized by Hadley Wickham who is leader in the R language community. While it is often associated with long data, the concept of tidy data actually refers to the meaning of the data rather than its shape. Tidy data has three features, as specified by Wickham (2014):
each variable is a column
each observation is a row
each type of observational unit is a table.
It’s easiest to understand these concepts by looking at some examples of datasets that are not tidy, following the examples laid out by Wickham:
Column headers are values, not variable names
Sometimes data are spread across columns where each column refers to a different value of the variable. For example, the following table shows cancer pathology data from three hospitals, with different columns quantifying the number of samples falling into each of four different tumor stages:
| site | Stage1 | Stage2 | Stage3 | Stage4 |
|:-------|---------:|---------:|---------:|---------:|
| H1 | 46 | 27 | 38 | 32 |
| H2 | 37 | 48 | 31 | 27 |
| H3 | 25 | 37 | 33 | 23 |There are really three different variables represented in this table: Site, Stage, and frequency. What we really want is to have three columns, representing those three variables. We can achieve this using the melt function from pandas:
df_tidy = pd.melt(df, id_vars=[”site”],
var_name=”Stage”, value_name=”Frequency”)
# make stage an integer
df_tidy.Stage = df_tidy.Stage.str.replace(”Stage”, “”).astype(int)
print(df_tidy.to_markdown(index=False))| site | Stage | Frequency |
|:-------|--------:|------------:|
| H1 | 1 | 46 |
| H2 | 1 | 37 |
| H3 | 1 | 25 |
| H1 | 2 | 27 |
| H2 | 2 | 48 |
| H3 | 2 | 37 |
| H1 | 3 | 38 |
| H2 | 3 | 31 |
| H3 | 3 | 33 |
| H1 | 4 | 32 |
| H2 | 4 | 27 |
| H3 | 4 | 23 |
Multiple variables are stored in one column
This pattern takes the previous one a step further, by defining columns based on the values of more than one variable. For example, let’s say that there data for both lung and prostate cancer:
|site|Stg1Lng|Stg2Lng|Stg3Lng|Stg4Lng|Stg1Prs|Stg2Prs|Stg3Prs|Stg4Prs|
|:---|------:|------:|------:|------:|------:|------:|------:|------:|
| H1 | 44 | 32 | 21 | 28 | 48 | 24 | 44 | 34 |
| H2 | 22 | 30 | 22 | 45 | 26 | 49 | 31 | 32 |
| H3 | 48 | 40 | 26 | 33 | 46 | 33 | 24 | 25 |In this example, each value column represents a combination of stage and type of cancer. We can tidy this by first melting the data frame, and then splitting the combined column names into separate variables for Stage and Cancer type:
# tidy this, first by melting
df_tidy = pd.melt(df, id_vars=[”site”],
var_name=”Stage_Cancer”, value_name=”Freq”)
# then split Stage_Cancer into two columns
df_tidy[[”Stage”, “Cancer”]] = df_tidy.Stage_Cancer.str.extract(r’Stg(\d)(\w{3})’)
del df_tidy[”Stage_Cancer”]
# make Stage an integer
df_tidy.Stage = df_tidy.Stage.astype(int)
# reorder columns
df_tidy = df_tidy[[”site”, “Stage”, “Cancer”, “Freq”]]
print(df_tidy.to_markdown(index=False))| site | Stage | Cancer | Freq |
|:-------|--------:|:---------|-------:|
| H1 | 1 | Lng | 44 |
| H2 | 1 | Lng | 22 |
| H3 | 1 | Lng | 48 |
| H1 | 2 | Lng | 32 |
| H2 | 2 | Lng | 30 |
| H3 | 2 | Lng | 40 |
| H1 | 3 | Lng | 21 |
| H2 | 3 | Lng | 22 |
| H3 | 3 | Lng | 26 |
| H1 | 4 | Lng | 28 |
| H2 | 4 | Lng | 45 |
| H3 | 4 | Lng | 33 |
| H1 | 1 | Prs | 48 |
| H2 | 1 | Prs | 26 |
| H3 | 1 | Prs | 46 |
| H1 | 2 | Prs | 24 |
| H2 | 2 | Prs | 49 |
| H3 | 2 | Prs | 33 |
| H1 | 3 | Prs | 44 |
| H2 | 3 | Prs | 31 |
| H3 | 3 | Prs | 24 |
| H1 | 4 | Prs | 34 |
| H2 | 4 | Prs | 32 |
| H3 | 4 | Prs | 25 |Variables are stored in both rows and columns
We could also have some variables denoted by their own column with others split across columns:
| site | Cancer | Stage1 | Stage2 | Stage3 | Stage4 |
|:-------|:---------|---------:|---------:|---------:|---------:|
| H1 | Lng | 44 | 32 | 21 | 28 |
| H1 | Prs | 48 | 24 | 44 | 34 |
| H2 | Lng | 22 | 30 | 22 | 45 |
| H2 | Prs | 26 | 49 | 31 | 32 |
| H3 | Lng | 48 | 40 | 26 | 33 |
| H3 | Prs | 46 | 33 | 24 | 25 |Here we can melt the data frame to collect the Stage columns:
# gather to make it tidy
df_both_tidy = pd.melt(df_both, id_vars=[”site”, “Cancer”],
var_name=”Stage”, value_name=”Frequency”)
# make Stage an integer
df_both_tidy.Stage = df_both_tidy.Stage.str.replace(”Stage”, “”).astype(int)
print(df_both_tidy.to_markdown(index=False))| site | Cancer | Stage | Frequency |
|:-------|:---------|--------:|------------:|
| H1 | Lng | 1 | 44 |
| H1 | Prs | 1 | 48 |
| H2 | Lng | 1 | 22 |
| H2 | Prs | 1 | 26 |
| H3 | Lng | 1 | 48 |
| H3 | Prs | 1 | 46 |
| H1 | Lng | 2 | 32 |
| H1 | Prs | 2 | 24 |
| H2 | Lng | 2 | 30 |
| H2 | Prs | 2 | 49 |
| H3 | Lng | 2 | 40 |
| H3 | Prs | 2 | 33 |
| H1 | Lng | 3 | 21 |
| H1 | Prs | 3 | 44 |
| H2 | Lng | 3 | 22 |
| H2 | Prs | 3 | 31 |
| H3 | Lng | 3 | 26 |
| H3 | Prs | 3 | 24 |
| H1 | Lng | 4 | 28 |
| H1 | Prs | 4 | 34 |
| H2 | Lng | 4 | 45 |
| H2 | Prs | 4 | 32 |
| H3 | Lng | 4 | 33 |
| H3 | Prs | 4 | 25 |Single type of observational unit spread across multiple tables
Sometimes we might have different data frames for each observation unit, such as a different data frame for each hospital in our example. To fix this we can simple merge the data frames by concatenating them:
df_merged = pd.concat([df1, df2, df3], ignore_index=True)Tabular file formats
The most common file formats are comma-separated value (CSV) or tab-separated value (TSV) files. Both of these have the benefit of being represented in plain text, so their contents can be easily examined without any special software. I generally prefer to use tabs rather than commas as the separator (or delimiter), primarily because they can more easily naturally represent longer pieces of text that may include commas. These can also be represented using CSV, but they require additional processing in order to escape the commas within the text so that they are not interpreted as delimiters.
Text file formats like CSV and TSV are nice for their ease of interpretability, but they are highly inefficient for large data compared to optimized file formats, such as the Parquet format. To see this in action, I loaded a brain image and saved all of the non-zero data points (857,785 to be exact) to a data frame, which I then saved to CSV and Parquet formats; see the data management notebook for details. Looking at the resulting files, we can see that the Parquet file is only about 20% the size of the CSV file:
➤ du -sk /tmp/brain_tabular.*
19464 /tmp/brain_tabular.csv
3804 /tmp/brain_tabular.parquetWhen we look at the amount of time needed to load these files, we see an even stronger edge for the Parquet format. Because the loading times can vary due to other activity on the system, we load each 100 times to get an average load time:
import time
# time loading of each format
# load 100 times to get average loading time of each format
nreps = 100
start = time.time()
for _ in range(nreps):
df_csv = pd.read_csv(’/tmp/brain_tabular.csv’)
end = time.time()
csv_time = (end - start)/nreps
print(f”CSV load time: {csv_time:.4f} seconds”)
start = time.time()
for _ in range(nreps):
df_parquet = pd.read_parquet(’/tmp/brain_tabular.parquet’)
end = time.time()
parquet_time = (end - start)/nreps
print(f”Parquet load time: {parquet_time:.4f} seconds”)
print(f’ratio {csv_time/parquet_time:.2f}’)CSV load time: 0.0719 seconds
Parquet load time: 0.0082 seconds
ratio 8.77Here we see that loading the CSV file takes almost 9 times as long as the Parquet file. For a single file of this size this is not a meaningful difference in times, but for projects involving many files or much larger files the difference in loading times could become a real game-changer.
Russ’s First Law of Tabular Data Management
“Don’t use spreadsheets to manage scientific data.”
Spreadsheet software such as Microsoft Excel is commonly used by researchers for all sorts of data management and processing operations. Why are spreadsheets problematic?
They encourage manual manipulation of the data, which makes the operations non-reproducible by definition.
Spreadsheet tools will often automatically format data, sometimes changing things in important but unwanted ways. For example, gene names such as “SEPT2” and “MARCH1” are converted to dates by Microsoft Excel, and some accession numbers (e.g. “2310009E13”) are converted to floating point numbers. An analysis of published genomics papers found that roughly twenty percent of supplementary gene lists created using Excel contained errors in gene names due to these conversions.
It is very easy to make errors when performing operations on a spreadsheet, and these errors can often go unnoticed. A well known example occurred in the paper ”Growth in the time of debt” by the prominent economists Carmen Reinhart and Kenneth Rogoff. This paper claimed to have found that high levels of national debt led to decreased economic growth, and was used as a basis for promoting austerity programs after the 2008 financial crisis. However, researchers subsequently discovered that the authors had made an error in their Excel spreadsheet, excluding data from several countries; when the full data were used, the relationship between growth and debt became much weaker.
Spreadsheet software can sometimes have limitations that can cause problems. For example, the use of an outdated Microsoft Excel file format (.xls) caused underreporting of COVID-19 cases due to limitations on the number of rows in that file format, and the lack of any warnings when additional rows in the imported data files were ignored.
Spreadsheets do not easily lend themselves to version control and change tracking, although some spreadsheet tools (such as Google Sheets) do provide the ability to clearly label versions of the data.
I will occasionally use Microsoft Excel to examine a data file, but I think that spreadsheet tools should *almost never* be used as part of a scientific data workflow.
In the next post I will talk about storage of multidimensional array data.

Great practical guide for researchers handling tabular data. The TSV vs CSV tradeoff is something I wish more people understood, especially when dealing with natural language fields that inevitably contain commas. Hit this exact issue parsing survey responses last month and ended up reprocessing everything as TSV. The Parquet benchmarks are eye-opening too, 9x faster load times and 80% smaller files makes a huge differenec when you're working with larger datasets. Also appreciate calling out the Reinhart-Rogoff Excel disaster, that should be required reading for anyone doing data analysis.