Imagine you’re a Marvel supervillain, developing a cool new gene therapy to allow your sharks to shoot laser beams from their eyes. For this to work, you need to make their retinal amacrine cells express the Super-Luciferase enzyme. However, if any Super-Luciferase gets expressed in other cell types, the effects will be catastrophic: it will create a laser overload, explode your sharks, and maybe even destroy your secret underground lab.
So, how can you avoid this?
One way would be to design your gene therapy vector, such as a virus or lipid nanoparticle, in a way that only delivers the Super-Luciferase gene to the correct cells. Even if you inject your gene therapy directly into the eye,1 it will be taken up by most of the cells in the retina, not just the amacrine cells. You would need to engineer the vector in a way that blocks it from entering all the off-target cell types while maintaining on-target activity. This is quite challenging. Also, if you’re making transgenic sharks instead of doing gene therapy, then all the cells in the shark’s body would have the Super-Luciferase gene.
A better approach is to express the Super-Luciferase gene under the control of a cell-type-specific regulatory sequence. Often, you can use a promoter sequence for a gene that you know is only expressed in your cell type of interest. For example, the GRK1 promoter drives expression in photoreceptor cells. However, it can be tricky to find these regulatory elements, since some cell types might not have any specific marker genes known. Taking another example, the gene FOXL2, which is commonly used as a marker for ovarian granulosa cells, is actually also expressed in pituitary cells and the embryonic eyelids. Clearly, this shark laser project will take a lot of work.
Synthetic Regulatory Elements
But what if there were an easier way?
Recently, a few exciting papers were published describing methods for de novo design of cell-type-specific regulatory sequences.
First, a paper from Stein Aerts’s lab2: Cell type directed design of synthetic enhancers
Second, from Alexander Stark’s lab: Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo
Third, a preprint from Ryan Tewhey’s lab: Machine-guided design of synthetic cell type-specific cis-regulatory elements
Cell type directed design of synthetic enhancers
This paper started out by creating random 500-base sequences of DNA, and “evolving” them using a machine learning model of chromatin accessibility to make sequences that were likely to be active in a cell type of interest.
When genes are turned off in a particular cell type, the chromatin containing those genes is tightly packed, but if the genes are being expressed, then the chromatin is more accessible.3 A recently developed method called ATAC-seq (Assay for Transposase-Accessible Chromatin with sequencing) can map the chromatin accessibility across the whole genome at single-cell resolution. Using this method, researchers analyzed Drosophila (fruit fly) brains to determine which regions of chromatin were accessible in which cell types. They could also determine the gene expression of each cell by single cell RNA sequencing.
Taking this data, the researchers trained a deep learning model, DeepFlyBrain, that predicts the chromatin accessibility of a given DNA sequence in each cell type of the fly brain. With the model, they could perform in silico evolution:
predict the accessibility of random DNA sequences
take the highest-scoring sequences
generate a new pool of random mutations based on those sequences4
repeat
After 15 cycles of mutation and selection for sequences that would be accessible specificically in Kenyon cells (a certain insect brain cell type), the prediction scores had increased to nearly the maximum possible value. The researchers then took a closer look at the evolved sequences, and found that they were including binding motifs for transcription factors that are present specifically in Kenyon cells.
To test the synthetic enhancers, the researchers constructed expression vectors that would express green fluorescent protein (GFP) if the enhancers were active. They then created transgenic Drosophila lines with each expression vector integrated at the same site in the genome. Out of thirteen synthetic enhancers, ten were active specifically in Kenyon cells and three were inactive. The researchers also applied their method for perineural glial cells, and found that four out of six candidates were specifically active in that cell type. Using their method, they were also able to alter the cell type specificity of naturally occurring enhancer sequences.5
Finally, the researchers tested their method using three different human melanoma cell lines. Using a machine learning model trained on different kinds of melanoma, they were able to design enhancers that drove luciferase expression in melanocyte-like melanoma, but not other kinds.
This method is super cool, but it requires a lot of high-quality ATAC-seq data to train the prediction model. For rare cell types, this can be quite difficult to generate. Also, since the model only knows about the cell types you give it, if you want to ensure that your gene expression is specific, you need to include data from every cell type at every developmental stage in the organism.6 This is not too bad for Drosophila, but much more difficult for humans.
Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo
This paper used a similar method to the previous paper (and was published in the same issue of Nature). The researchers started out with a deep learning model to predict chromatin accessibility in different Drosophila cell types. However, they then went a step further and used transfer learning to fine-tune their model on smaller datasets from particular cell types.
To test their method, the researchers generated synthetic enhancers for different Drosophila tissues: central nervous system, muscle, intestine, and epidermis.7 Their enhancers generally performed well, although some of them were only active in a subset of the cell types in the tissue, and others had activity in off-target cell types.
In my opinion, the main advance of this paper was the use of transfer learning. This could be quite useful for rare cell types, where there is insufficient data to make a full model.
Machine-guided design of synthetic cell type-specific cis-regulatory elements
These researchers took a different approach: instead of training a model on chromatin accessibility, they trained directly on gene expression data. Previous MPRA (massively parallel reporter assay) experiments had taken 200-base fragments of the human genome and measured their ability to activate gene expression in three different human cancer cell lines (neuroblastoma, hepatoma, and leukemia). Aggregating data from various MPRA projects, the researchers obtained expression data for 776,474 of these fragments and trained a deep learning model to predict the expression of arbitrary 200-base sequences.
Similarly to the other papers, the researchers next performed in silico evolution to design 200-base DNA sequences that could activate gene expression specifically in one of the three cancer cell lines. They also searched the human genome for 200-base sequences predicted to have a high expression and specificity. Overall, the researchers tested 77,157 natural and synthetic DNA sequences in the three cancer cell lines. The synthetic sequences performed quite well: 94.1% of them were expressed specifically in the intended cell type.
This is pretty cool, but it’s not very useful if you can only apply this method to three types of cancer cells. To see if their sequences could work in whole organisms, the researchers took the top three sequences for neuroblastoma cells and for hepatoma cells, and made transgenic zebrafish with these sequences used to drive GFP expression. Two of the three neuroblastoma sequences were active in neurons, and two of the three hepatoma sequences were active in the liver. In transgenic mice, one of the neuroblastoma sequences was active in neurons.
This paper is notable for its different approach and high success rate in the cell types that it was trained on. However, it is much harder to do MPRA experiments in a whole organism than in cell culture,8 which will make it hard to generate training data that covers a whole organism (and basically impossible for rare cell types).
Conclusions
Synthetic regulatory sequences might be the key breakthrough you need to make sharks with laser eyes. The two requirements will be a big ATAC-seq dataset for sharks, and good machine learning skills.9 And of course, a Super-Luciferase enzyme that isn’t entirely fictional.
Oh, and you might want to test your regulatory sequence with GFP before you put in Super-Luciferase. Just in case.
Which is definitely possible, and has even been tested in a clinical trial which used a photoreceptor-specific GRK1 promoter to express Cas9.
I’m a big fan of the Aerts lab, since they also developed pySCENIC which I’ve used for my meiosis project.
For more background on chromatin, see my epigenetics post.
The first method the researchers used was to make completely random mutations. After showing that this worked, the researchers developed another mutation method that inserted known binding motifs for transcription factors. Finally, the researchers tested a third method based on generative adversarial networks. They said this also worked but didn’t give many details.
For example, taking an enhancer active in one cell type and making it also active in another, or taking an enhancer active in two cell types and restricting it to one type.
If a cell type is missing from the model, the model won’t be able to ensure that your enhancer is inactive in that cell type. You might be able to get away with assuming it’s inactive, but this is definitely not guaranteed.
I’m not sure why they went for tissues instead of cell types, as each tissue contains multiple cell types that can have very different gene expression. Maybe they wanted to differentiate their paper from the Aerts lab paper, which was posted as a preprint before theirs?
One preprint used AAVs to do this, but it’s hard.
Speaking of machine learning skills, I learned how to run StableDiffusion with pytorch to make the images in this post. I couldn’t get it to understand that the lasers need to come out of the shark’s eyes, and it generated a bunch of incorrect images which I showed at the beginning of this post. I eventually resorted to drawing some green lines on a picture of a shark and using img2img to make it look nicer (for the final image in this post).