Yanisadel utanielian commited on
Commit
2c76881
·
verified ·
1 Parent(s): 67368f1

Update README.md (#1)

Browse files

- Update README.md (742c533e3d3816cb9bcd12cf7aea0663fc0ae15a)


Co-authored-by: Tanielian <utanielian@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +58 -11
README.md CHANGED
@@ -1,15 +1,23 @@
1
  ---
2
  tags:
3
- - model_hub_mixin
4
- - pytorch_model_hub_mixin
5
  ---
6
 
7
  # sCellTransformer
8
 
9
- sCellTransformer (sCT) is a long-range foundation model designed for zero-shot prediction tasks
10
- in single-cell RNA-seq and spatial transcriptomics data. It processes raw gene expression profiles across multiple cells to predict discretized
11
- gene expression levels for unseen cells without retraining. The model handles up to 20,000 protein-coding genes and outputs around a million
12
- gene expression tokens, mitigating the sparsity typical in single-cell datasets.
 
 
 
 
 
 
 
 
13
 
14
  **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
15
 
@@ -17,13 +25,15 @@ gene expression tokens, mitigating the sparsity typical in single-cell datasets.
17
 
18
  <!-- Provide the basic links for the model. -->
19
 
20
- - **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer)
21
- - **Paper:** [A long range foundation model for zero-shot predictions in single-cell and spatial transcriptomics data](https://openreview.net/pdf?id=VdX9tL3VXH)
22
-
 
23
 
24
  ### How to use
25
 
26
- Until its next release, the transformers library needs to be installed from source with the following command in order to use the models.
 
27
  PyTorch should also be installed.
28
 
29
  ```
@@ -31,7 +41,8 @@ pip install --upgrade git+https://github.com/huggingface/transformers.git
31
  pip install torch
32
  ```
33
 
34
- A small snippet of code is given here in order to infer with the model from random input.
 
35
 
36
  ```
37
  import torch
@@ -44,4 +55,40 @@ model = AutoModel.from_pretrained(
44
  num_cells = model.config.num_cells
45
  dummy_gene_expressions = torch.randint(0, 5, (1, 19968 * num_cells))
46
  torch_output = model(dummy_gene_expressions)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  ```
 
1
  ---
2
  tags:
3
+ - model_hub_mixin
4
+ - pytorch_model_hub_mixin
5
  ---
6
 
7
  # sCellTransformer
8
 
9
+ sCellTransformer (sCT) is a long-range foundation model designed for zero-shot
10
+ prediction tasks in single-cell RNA-seq and spatial transcriptomics data. It processes
11
+ raw gene expression profiles across multiple cells to predict discretized gene
12
+ expression levels for unseen cells without retraining. The model can handle up to 20,000
13
+ protein-coding genes and a bag of 50 cells in the same sample. This ability
14
+ (around a million-gene expressions tokens) allows it to learn cross-cell
15
+ relationships and capture long-range dependencies in gene expression data,
16
+ and to mitigate the sparsity typical in single-cell datasets.
17
+
18
+ sCT is trained on a large dataset of single-cell RNA-seq and finetuned on spatial
19
+ transcriptomics data. Evaluation tasks include zero-shot imputation of masked gene
20
+ expression, and zero-shot prediction of cell types.
21
 
22
  **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
23
 
 
25
 
26
  <!-- Provide the basic links for the model. -->
27
 
28
+ - **Repository:
29
+ ** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer)
30
+ - **Paper:
31
+ ** [A long range foundation model for zero-shot predictions in single-cell and spatial transcriptomics data](https://openreview.net/pdf?id=VdX9tL3VXH)
32
 
33
  ### How to use
34
 
35
+ Until its next release, the transformers library needs to be installed from source with
36
+ the following command in order to use the models.
37
  PyTorch should also be installed.
38
 
39
  ```
 
41
  pip install torch
42
  ```
43
 
44
+ A small snippet of code is given here in order to infer with the model from random
45
+ input.
46
 
47
  ```
48
  import torch
 
55
  num_cells = model.config.num_cells
56
  dummy_gene_expressions = torch.randint(0, 5, (1, 19968 * num_cells))
57
  torch_output = model(dummy_gene_expressions)
58
+ ```
59
+
60
+ A more concrete example is provided in the notebook example on one of the downstream
61
+ evaluation dataset.
62
+
63
+ #### Training data
64
+
65
+ The model was trained following a two-step procedure:
66
+ pre-training on single-cell data, then finetuning on spatial transcriptomics data.
67
+ The single-cell data used for pre-training, comes from the
68
+ [Cellxgene Census collection datasets](https://cellxgene.cziscience.com/)
69
+ used to train the scGPT models. It consists of around 50 millions
70
+ cells and approximately 60,000 genes. The spatial data comes from both the [human
71
+ breast cell atlas](https://cellxgene.cziscience.com/collections/4195ab4c-20bd-4cd3-8b3d-65601277e731)
72
+ and [the human heart atlas](https://www.heartcellatlas.org/).
73
+
74
+ #### Training procedure
75
+
76
+ As detailed in the paper, the gene expressions are first binned into a pre-defined
77
+ number of bins. This allows the model to better learn the distribution of the gene
78
+ expressions through sparsity mitigation, noise reduction, and extreme-values handling.
79
+ Then, the training objective is to predict the masked gene expressions in a cell,
80
+ following a BERT-like style training.
81
+
82
+ ### BibTeX entry and citation info
83
+
84
+ ```
85
+ @misc{joshi2025a,
86
+ title={A long range foundation model for zero-shot predictions in single-cell and
87
+ spatial transcriptomics data},
88
+ author={Ameya Joshi and Raphael Boige and Lee Zamparo and Ugo Tanielian and Juan Jose
89
+ Garau-Luis and Michail Chatzianastasis and Priyanka Pandey and Janik Sielemann and
90
+ Alexander Seifert and Martin Brand and Maren Lang and Karim Beguir and Thomas PIERROT},
91
+ year={2025},
92
+ url={https://openreview.net/forum?id=VdX9tL3VXH}
93
+ }
94
  ```