|
--- |
|
tags: |
|
- model_hub_mixin |
|
- pytorch_model_hub_mixin |
|
--- |
|
|
|
# sCellTransformer |
|
|
|
sCellTransformer (sCT) is a long-range foundation model designed for zero-shot |
|
prediction tasks in single-cell RNA-seq and spatial transcriptomics data. It processes |
|
raw gene expression profiles across multiple cells to predict discretized gene |
|
expression levels for unseen cells without retraining. The model can handle up to 20,000 |
|
protein-coding genes and a bag of 50 cells in the same sample. This ability |
|
(around a million-gene expressions tokens) allows it to learn cross-cell |
|
relationships and capture long-range dependencies in gene expression data, |
|
and to mitigate the sparsity typical in single-cell datasets. |
|
|
|
sCT is trained on a large dataset of single-cell RNA-seq and finetuned on spatial |
|
transcriptomics data. Evaluation tasks include zero-shot imputation of masked gene |
|
expression, and zero-shot prediction of cell types. |
|
|
|
**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI) |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer) |
|
- **Paper:** [A long range foundation model for zero-shot predictions in single-cell and spatial transcriptomics data](https://openreview.net/pdf?id=VdX9tL3VXH) |
|
|
|
### How to use |
|
|
|
Until its next release, the transformers library needs to be installed from source with |
|
the following command in order to use the models. |
|
PyTorch should also be installed. |
|
|
|
``` |
|
pip install --upgrade git+https://github.com/huggingface/transformers.git |
|
pip install torch |
|
``` |
|
|
|
A small snippet of code is given here in order to infer with the model from random |
|
input. |
|
|
|
``` |
|
import torch |
|
from transformers import AutoModel |
|
|
|
model = AutoModel.from_pretrained( |
|
"InstaDeepAI/sCellTransformer", |
|
trust_remote_code=True, |
|
) |
|
num_cells = model.config.num_cells |
|
dummy_gene_expressions = torch.randint(0, 5, (1, 19968 * num_cells)) |
|
torch_output = model(dummy_gene_expressions) |
|
``` |
|
|
|
A more concrete example is provided in the notebook example on one of the downstream |
|
evaluation dataset. |
|
|
|
#### Training data |
|
|
|
The model was trained following a two-step procedure: |
|
pre-training on single-cell data, then finetuning on spatial transcriptomics data. |
|
The single-cell data used for pre-training, comes from the |
|
[Cellxgene Census collection datasets](https://cellxgene.cziscience.com/) |
|
used to train the scGPT models. It consists of around 50 millions |
|
cells and approximately 60,000 genes. The spatial data comes from both the [human |
|
breast cell atlas](https://cellxgene.cziscience.com/collections/4195ab4c-20bd-4cd3-8b3d-65601277e731) |
|
and [the human heart atlas](https://www.heartcellatlas.org/). |
|
|
|
#### Training procedure |
|
|
|
As detailed in the paper, the gene expressions are first binned into a pre-defined |
|
number of bins. This allows the model to better learn the distribution of the gene |
|
expressions through sparsity mitigation, noise reduction, and extreme-values handling. |
|
Then, the training objective is to predict the masked gene expressions in a cell, |
|
following a BERT-like style training. |
|
|
|
### BibTeX entry and citation info |
|
|
|
``` |
|
@misc{joshi2025a, |
|
title={A long range foundation model for zero-shot predictions in single-cell and |
|
spatial transcriptomics data}, |
|
author={Ameya Joshi and Raphael Boige and Lee Zamparo and Ugo Tanielian and Juan Jose |
|
Garau-Luis and Michail Chatzianastasis and Priyanka Pandey and Janik Sielemann and |
|
Alexander Seifert and Martin Brand and Maren Lang and Karim Beguir and Thomas PIERROT}, |
|
year={2025}, |
|
url={https://openreview.net/forum?id=VdX9tL3VXH} |
|
} |
|
``` |