Last week I was trying to get access to some single cell RNA sequencing (scRNAseq) data hosted at the European Genome-Phenome Archive. Unlike the American databases (SRA and GEO, both hosted at NCBI), where scRNAseq data is pretty easy to download, this European database has a bunch of extra privacy forms that need to be completed. The forms will be reviewed by a committee which might decide to grant permission to access the data, if they’re feeling generous. I’ve been waiting 9 days so far, and the database administrators still haven’t even sent my forms to the committee.
Now, a recent paper in Cell may make it even more cumbersome to share scRNAseq data1, by showing that a research participant’s genetic information can be inferred from scRNAseq gene expression information. The basic concept is as follows:
For a particular cell type,2 find what genetic variants (called eQTLs) are associated with variation in gene expression. eQTLs are already publicly known for most cell types, and can be computed for new cell types if you have data containing both gene expression and genetic variation.
For a scRNAseq sample, compute the average gene expression (called “pseudobulk”) for a that cell type.
Then, determine which set of genetic variants is most likely to correspond with the observed gene expression in that sample.
In cases where a person’s genetic information is public (for example, if it was leaked from 23andme) this could de-anonymize a research participant.
None of this is really new from a theoretical perspective,3 but the authors were the first ones to actually put it into practice with scRNAseq data, which required overcoming some technical challenges related to noisy data. By making it common knowledge that it’s possible to infer genotypes from scRNAseq counts matrices, this paper could lead to new restrictions on data sharing, especially for data from sensitive samples like fetal tissue.
Things like this paper are why the Europeans won’t let me use their data!
Let’s consider the following scenario to give a plausible example of how data sharing could lead to a bad outcome:
A married man has an affair. His partner gets pregnant, has an abortion, and donates the fetal tissue for research. Later, a bad actor gets the sequencing data, figures out the identity of both parents of the fetus, and tries to blackmail them. When they don’t pay, he tips off the man’s wife, who murders them out of jealousy.
In this hypothetical case, data sharing led to two deaths, so it’s clear that the potential for bad consequences exists.4 However, a couple of factors reduce my concern about data sharing:
Most people aren’t blackmailers, and for those people who are blackmailers, there are far easier ways to get people’s personal health information than scRNAseq data de-anonymization.
De-anonymization has been possible for at least 12 years with bulk5 gene expression data, but over this time, I’m not aware of any cases where this has actually happened.
Over here in the USA, NCBI still lets me download not only gene expression counts, but raw sequencing reads, from human fetal tissue. This data sharing has been a huge help for my research, since I didn’t have to collect fetal tissue and sequence it myself. Over the last 12 years, sharing gene expression data has had clear benefits for science, while the harms have remained hypothetical. I am confident that this will be true for the future as well.
In the end, the only people who have a real incentive to perform de-anonymization are researchers trying to publish papers in high-impact journals. If I were a research participant, I would want to be informed of the theoretical possibility of me being identified, but I wouldn’t worry about it.
scRNAseq data can be shared as either raw sequencing reads or counts matrices. Of course, raw sequencing reads are trivial to infer genotypes from, because they have the actual sequences. Counts matrices represent gene expression by listing the number of sequencing reads detected for each gene in each cell, but they don’t contain the actual sequences. Typically, researchers share scRNAseq data as counts matrices because those are easier to work with (and are much smaller files).
The study used various kinds of white blood cells because they’re the most common in scRNAseq databases.
Similar methods have been known since at least 2012.
More speculatively, de-anonymizaition and blackmail could be performed at scale using generative AI tools.
i.e., not single-cell.
Im most disturbed by the possibility that profiling of a newborn or child could identify variation(s) that society uses to limit the life or happiness of 'unfits'. In Stephen J Gould's essay, "Carrie Buck's Daughter", he examines the infamous Supreme Court case of Buck vs Bell 1927 that forced the sterilization of tens of thousands of mentally "enfeebled" people in the US over the subsequent decades. I worry that our capacity to discriminate between individuals is vastly out pacing the collective intelligence of the institutions created to safe guard our liberty.