Handling sensitive data
Better Code, Better Science: Chapter 7, Part 9
This is a possible section from the open-source living textbook Better Code, Better Science, which is being released in sections on Substack. The entire book can be accessed here and the Github repository is here. This material is released under CC-BY-NC.
Handling of sensitive data
Researchers in some fields, particularly those who work with data obtained from human subjects, often handle data are sensitive, meaning that they may require a higher degree of security and/or additional procedures to protect the privacy and confidentiality of the research subjects.
Data security
Sensitive data often require additional protections from potential breach. The minimum requirement is generally that the data are housed on an encrypted file system and any transfers are made via an encrypted channel, and that access to the system is controlled. Some datasets include more stringent security measures in their Data Use Agreement. For example, the Adolescent Brain Cognitive Development (ABCD) study, a widely used dataset on brain and cognitive development, requires that any systems used to house or process the data must meet a specific standard for sensitive information known as NIST SP 800-171. This standard comprises 17 “families” of security requirements that a system must meet to be compliant:
Access Control
Maintenance
Security Assessment and Monitoring
Awareness and Training
Media Protection
System and Communications Protection
Audit and Accountability
Personnel Security
System and Information Integrity
Configuration Management
Physical Protection
Planning
Identification and Authentication
Risk Assessment
System and Services Acquisition
Incident Response
Supply Chain Risk Management
In general this level of security certification will be limited to computer systems run by an organizational IT group rather than by an individual investigator, due to the stringency of the requirements.
Deidentification
Deidentification generally involves the removal of specific identifying information that could potentially be used to reidentify a human subject. In the US, this generally relies upon the Safe Harbor provision in the Health Insurance Portability and Accountability Act of 1996 (HIPAA), which states the following criteria for rendering a dataset deidentified:
(i) The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed:
(A) Names
(B) All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:(1) The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and(2) The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000
(C) All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
(D) Telephone numbers
(E) Fax numbers
(F) Email addresses
(G) Social security numbers
(H) Medical record numbers
(I) Health plan beneficiary numbers
(J) Account numbers
(K) Certificate/license numbers
(L) Vehicle identifiers and serial numbers, including license plate numbers
(M) Device identifiers and serial numbers
(N) Web Universal Resource Locators (URLs)
(O) Internet Protocol (IP) addresses
(P) Biometric identifiers, including finger and voice prints
(Q) Full-face photographs and any comparable images
(R) Any other unique identifying number, characteristic, or code, except as permitted by paragraph (c) of this section; and
(ii) The covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.
In the US, deidentification of data is generally sufficient to render them non-sensitive, whereas this is generally not the case in European countries covered by the General Data Protection Regulation (GDPR).
Anonymization
Anonymization refers to the modification of data in a way that can essentially guarantee that the subjects cannot be reidentified. For example, one might modify ages so that they are stated in ranges (such as 20-25 years old) instead of a specific year. These methods generally change the data in ways that could potentially affect downstream analyses, and thus many researchers shy away from using anonymized data unless absolutely necessary.
One method that is often used for large datasets is known as differential privacy, which involves adding noise to analytic results in a way that can provably prevent reidentification. For example, this method is now used by the US Census Bureau to protect individuals. This has the benefit of providing a provable mathematical guarantee of privacy by quantifying the maximum degree of privacy loss given a particular amount of noise added. However, this method may have adverse effects on the data, such by disparately impacting small sub-populations within a larger dataset.
In the next post I will talk about version control for data.
