Biologists (including myself) often need to identify types of cells based on their gene expression. For example, if I’m differentiating stem cells to make an ovarian organoid, and I perform single cell RNA sequencing, I might want to check the data to see which ovarian cell types are present.
Today, a Nature Methods paper reported good results with giving GPT-4 a list of cell-specific genes and asking it to identify the cell type. This seems interesting, and also quite easy to check for myself to see if it actually works.
My test
I don’t pay for access to GPT-4, but I gave ChatGPT a test using the prompt from the Nature Methods paper, with the following cell markers:
Identify types of human cells using the following marker genes. Identify one cell type for each row. Only provide the cell type name.
SOX17, POU5F1, NANOS3, PRDM1, NANOG, CD38
POU5F1, SOX2, KLF4, ITGA6, NANOG
SOX17, FOXA2, CXCR4, GATA4
FOXL2, AMHR2, CD82, NR5A1, FSHR, GATA4
ZP3, DPPA3, DDX4, NPM2, ZP2
FOXL2, FSHB, NR5A1, PITX1, GNRHR
STK31, ZBTB16, DDX4, SSEA4, NANOS2
NR2F2, CYP17A1, STAR, LHCGR, GLI1, HSD3B
CASC3, PGAP1, SLC6A16, CNTNAP4, NPHP1
SYCP1, TEX12, REC8, SPO11, SYCP3
NR5A1, SOX9, FSHR, GATA4
OTX2, SOX1, TUBB3, PAX6
ChatGPT’s response:
Endoderm cell
Induced pluripotent stem cell (iPSC)
Definitive endoderm cell
Granulosa cell
Oocyte
Granulosa cell
Spermatogonial stem cell
Theca cell
Sertoli cell
Spermatocyte
Leydig cell
Neuroepithelial cell
Results
Incorrect, this is a primordial germ cell. This should have been an easy one! SOX17 does overlap with endoderm though.
Correct
Correct
Correct
Correct
Incorrect, this is a pituitary gonadotroph. Although it does share FOXL2 and NR5A1 expression with granulosa cells, FSHB and GNRHR are very specific to the pituitary.
Correct
Correct
Very incorrect. This is just a list of random genes! ChatGPT seemed to be biased by me giving it other reproductive cell types.
Sort of correct — these are just markers for meiosis.
Incorrect, this is a Sertoli cell. Leydig cells lack SOX9 expression.
Basically correct, although I would have called it “neural ectoderm”.
Overall score: 7.5 / 12
Let’s test again
In the first test, ChatGPT got the random genes completely wrong. Let’s prompt it to announce that it’s uncertain if it doesn’t actually know the cell type.
Identify types of human cells using the following marker genes. Identify one cell type for each row. Only provide the cell type name. If you are uncertain, respond "unknown" instead of providing a cell type name.
SOX17, POU5F1, NANOS3, PRDM1, NANOG, CD38
POU5F1, SOX2, KLF4, ITGA6, NANOG
SOX17, FOXA2, CXCR4, GATA4
FOXL2, AMHR2, CD82, NR5A1, FSHR, GATA4
ZP3, DPPA3, DDX4, NPM2, ZP2
FOXL2, FSHB, NR5A1, PITX1, GNRHR
STK31, ZBTB16, DDX4, SSEA4, NANOS2
NR2F2, CYP17A1, STAR, LHCGR, GLI1, HSD3B
CASC3, PGAP1, SLC6A16, CNTNAP4, NPHP1
SYCP1, TEX12, REC8, SPO11, SYCP3
NR5A1, SOX9, FSHR, GATA4
OTX2, SOX1, TUBB3, PAX6
ChatGPT’s response (and my scoring):
Endoderm (still incorrect!)
Embryonic stem cell (correct, although it said iPSC before)
Definitive endoderm (correct)
Granulosa cell (correct)
Oocyte (correct)
Gonadotroph (correct)
Spermatogonium (largely correct, but “spermatogonial stem cell” is better)
Theca cell (correct)
Sertoli cell (INCORRECT, it’s still thrown off by random genes!)
Spermatocyte (sort of correct, same as before)
Sertoli cell (correct this time!)
Neural progenitor cell (again, basically correct but not what I would have called it)
This time it performs better (I’d give it 9.5/12), but it’s still tricked by random genes, and it still can’t recognize primordial germ cells.
Let’s try with only random genes:
Identify types of human cells using the following marker genes. Identify one cell type for each row. Only provide the cell type name. If you are uncertain, respond "unknown" instead of providing a cell type name.
CASC3, PGAP1, SLC6A16, CNTNAP4, NPHP1
IL9, SLC30A4, SX18P8, CHRDL2, SMUG1P1
HCST, EXOSC8, ORC3, CIDECP2, DNM2
DTL, U3, DDX28, WDFY3, PPP1R2P4
LTK, STK32C, SMIM9, DPPA3P10, MTCO1P12
This time, ChatGPT just responded “unknown” for everything. Very good! Without the prompt to respond “unknown”, ChatGPT instead made wild guesses:
Renal tubular epithelial cell
Retinal pigment epithelial cell
Natural killer cell
Keratinocyte
Spermatocyte
Conclusions
ChatGPT is remarkably good at identifying most cell types, but can be overconfident and assign a cell type to a list of random genes. There also seems to be some bias in this: ChatGPT said the random gene list was Sertoli cells in context of the larger list of reproductive cell types, but when given five lists of completely random genes, it said “unknown” for all of them. Giving the option to respond “unknown” was very important, since otherwise the main outcome was “bovine fecal cells”.
I still don’t trust ChatGPT enough to use for my research, but it will be interesting to see if this improves over time. Also, if any readers can try my prompts with GPT-4, please post the results in the comments!
Addendum: the reverse direction
I asked ChatGPT to list marker genes for the human cell types (replacing the “random” with macrophages). Interestingly, it listed SOX2 as a marker for primordial germ cells. This would be correct for mice, but not for humans! Apparently ChatGPT does know what primordial germ cells are, it just knows more about mice than about humans, despite me specifying human cells in my prompt.
The other results were correct, although sometimes ChatGPT used nonstandard gene names (for example PLZF instead of ZBTB16, or SCF instead of KITLG).
Primordial germ cells:
NANOG
STELLA (also known as DPPA3)
BLIMP1 (also known as PRDM1)
SOX2
OCT4 (also known as POU5F1)
Pluripotent stem cells:
NANOG
OCT4 (also known as POU5F1)
SOX2
LIN28
KLF4
Definitive endoderm cells:
GATA4
SOX17
FOXA2
CXCR4
HNF4A
Ovarian granulosa cells:
FSHR (Follicle-Stimulating Hormone Receptor)
CYP19A1 (Aromatase)
INHBB (Inhibin Beta B)
AMH (Anti-Müllerian Hormone)
ESR2 (Estrogen Receptor 2)
Pituitary gonadotrope cells:
FSHB (Follicle-Stimulating Hormone Beta Subunit)
LHB (Luteinizing Hormone Beta Subunit)
GNRHR (Gonadotropin-Releasing Hormone Receptor)
NR5A1 (Steroidogenic Factor 1)
GATA2
Oocytes:
GDF9 (Growth Differentiation Factor 9)
BMP15 (Bone Morphogenetic Protein 15)
ZP3 (Zona Pellucida Glycoprotein 3)
FIGLA (Folliculogenesis-Specific Basic Helix-Loop-Helix Transcription Factor)
NOBOX (Newborn Ovary Homeobox)
Spermatogonial stem cells:
NANOS2
PLZF (Promyelocytic Leukemia Zinc Finger)
GFRA1 (GDNF Family Receptor Alpha 1)
RET (Rearranged During Transfection)
STRA8 (Stimulated by Retinoic Acid 8)
Theca cells:
CYP17A1 (Cytochrome P450 Family 17 Subfamily A Member 1)
CYP11A1 (Cytochrome P450 Family 11 Subfamily A Member 1)
LHCGR (Luteinizing Hormone/Choriogonadotropin Receptor)
INSL3 (Insulin-Like 3)
STAR (Steroidogenic Acute Regulatory Protein)
Macrophages:
CD68
CD14
CD163
CSF1R (Colony-Stimulating Factor 1 Receptor)
ITGAM (Integrin Subunit Alpha M)
Meiocytes:
SYCP1 (Synaptonemal Complex Protein 1)
SYCP3 (Synaptonemal Complex Protein 3)
DMC1 (Disrupted Meiotic cDNA 1)
REC8 (Meiotic Recombination Protein REC8)
SPO11 (Sporulation-Specific Protein 11)
Sertoli cells:
SOX9
AMH (Anti-Müllerian Hormone)
FSHR (Follicle-Stimulating Hormone Receptor)
SCF (Stem Cell Factor)
INHA (Inhibin Alpha)
Neuroectoderm cells:
PAX6
SOX1
SOX2
NEUROG2 (Neurogenin 2)
NES (Nestin)
this was my responses from GTP-4 (I indeed got primordial germ cells for #1)
These are marker genes for different types of human cells. Here are the cell types that these marker genes are associated with:
SOX17, POU5F1, NANOS3, PRDM1, NANOG, CD38 - Primordial Germ Cells
POU5F1, SOX2, KLF4, ITGA6, NANOG - Induced Pluripotent Stem Cells
SOX17, FOXA2, CXCR4, GATA4 - Definitive Endoderm Cells
FOXL2, AMHR2, CD82, NR5A1, FSHR, GATA4 - Granulosa Cells
ZP3, DPPA3, DDX4, NPM2, ZP2 - Oocytes
FOXL2, FSHB, NR5A1, PITX1, GNRHR - Pituitary Gonadotrope Cells
STK31, ZBTB16, DDX4, SSEA4, NANOS2 - Spermatogonial Stem Cells
NR2F2, CYP17A1, STAR, LHCGR, GLI1, HSD3B - Leydig Cells
CASC3, PGAP1, SLC6A16, CNTNAP4, NPHP1 - These genes are not specific to a single cell type and are expressed in various tissues.
SYCP1, TEX12, REC8, SPO11, SYCP3 - Meiotic Cells
NR5A1, SOX9, FSHR, GATA4 - Sertoli Cells
OTX2, SOX1, TUBB3, PAX6 - Neural Progenitor Cells
Please note that these are general associations and the presence of these genes does not definitively determine the cell type. The exact cell type can depend on the expression levels of these genes and the presence of other marker genes.
Here are GPT-4's responses:
Prompt 1:
Pluripotent Stem Cells
Embryonic Stem Cells
Endoderm Cells
Granulosa Cells
Oocytes
Gonadotroph Cells
Spermatogonial Stem Cells
Leydig Cells
Neuronal Cells
Meiotic Cells
Sertoli Cells
Neural Progenitor Cells
Prompt 2:
Pluripotent Stem Cells
Embryonic Stem Cells
Endoderm Cells
Granulosa Cells
Oocytes
Pituitary Gonadotrope Cells
Germ Cells
Leydig Cells
unknown
Meiotic Cells
Sertoli Cells
Neural Progenitor Cells
Prompt 3:
Based on the marker genes provided for each row and without conducting a detailed gene expression analysis or referencing a specific database, here's an informed guess for the cell types based on common knowledge in cell biology and gene function:
CASC3, PGAP1, SLC6A16, CNTNAP4, NPHP1 - Given these genes, this profile does not clearly point to a specific, well-known cell type without further context. CNTNAP4 is involved in neuronal activity, but the combination of genes does not provide a clear match. unknown
IL9, SLC30A4, SX18P8, CHRDL2, SMUG1P1 - IL9 is a cytokine involved in immune responses, suggesting these cells might be related to the immune system. However, without a clear connection between all the markers, pinpointing a specific immune cell type is challenging. unknown
HCST, EXOSC8, ORC3, CIDECP2, DNM2 - HCST is involved in the activation of natural killer (NK) cells, suggesting this could be a profile related to NK cells or another cell type involved in the immune response. Natural Killer Cell
DTL, U3, DDX28, WDFY3, PPP1R2P4 - This combination of genes does not correspond to a specific cell type that can be easily identified without further information. These genes are involved in various cellular processes, including DNA repair and synthesis, making it hard to narrow down. unknown
LTK, STK32C, SMIM9, DPPA3P10, MTCO1P12 - LTK is a receptor tyrosine kinase that has been implicated in neuronal development and other processes, but the combination of genes provided does not specify a particular cell type without additional context. unknown
These guesses are made with caution, as identifying cell types solely based on a small set of marker genes without additional context (such as expression levels, cell function, or tissue of origin) can be quite speculative and may not accurately represent the cell types.
Prompt 3 without unknown sentence:
Identifying cell types based on marker gene expression is a nuanced task that requires a deep understanding of gene expression patterns across different tissues and developmental stages. However, I'll provide a general identification of cell types based on the marker genes listed, keeping in mind that the expression of these genes can vary under different conditions and might not be exclusive to a single cell type. Here are potential cell types based on the listed marker genes:
CASC3, PGAP1, SLC6A16, CNTNAP4, NPHP1 - These genes are involved in various cellular functions, including cell adhesion, signaling, and metabolism. Neurons or neural progenitor cells might express these, particularly due to CNTNAP4's involvement in the nervous system.
IL9, SLC30A4, SX18P8, CHRDL2, SMUG1P1 - IL9 is a cytokine that plays roles in various aspects of the immune response. This set of genes could be indicative of T helper cells, specifically Th9 cells, given IL9's role in these cells.
HCST, EXOSC8, ORC3, CIDECP2, DNM2 - HCST is involved in the immune system's activation and signaling pathways. This gene set might suggest Natural Killer (NK) cells due to HCST's significance in NK cell-mediated cytotoxicity.
DTL, U3, DDX28, WDFY3, PPP1R2P4 - These genes are involved in DNA replication, repair, and cell cycle regulation. They could be indicative of cells with high proliferative capacity, such as stem cells or progenitor cells in rapidly renewing tissues.
LTK, STK32C, SMIM9, DPPA3P10, MTCO1P12 - LTK is a receptor tyrosine kinase involved in various cell signaling pathways. This set of genes could suggest endothelial cells due to their roles in vascular development and maintenance.
It's important to note that this identification is very general and should be validated with more specific functional assays or gene expression analyses in the context of the tissue of interest.