|
--- |
|
license: mit |
|
pipeline_tag: tabular-regression |
|
tags: |
|
- chemistry |
|
- microbiology |
|
- antibiotics |
|
library_name: duvida |
|
datasets: |
|
- scbirlab/thomas-2018-spark-wt |
|
--- |
|
|
|
# Predictor of _Staphylococcus aureus_ MICs |
|
|
|
_Updated:_ Tue 1 Apr 08:02:52 BST 2025 |
|
|
|
Trained on the _Staphylococcus aureus_, WT accumulator phenotype subset of the [human-curated SPARK dataset](https://doi.org/10.1021/acsinfecdis.8b00193) (2115 rows in total for _Staphylococcus aureus_). |
|
|
|
## Model details |
|
|
|
This model was trained using [our Duvida framework](https://github.com/scbirlab/duvida), |
|
as a result of hyperparameter searches and selecting the model that performs best on unseen test data |
|
(from a scaffold split). |
|
|
|
Duvida also saves the training data in this checkpoint to allows the calculation of uncertainty metrics |
|
based on that training data. |
|
|
|
This model is the best regression model from a hyperparameter search, determined |
|
by Pearson's $$r$$ on a held-out test set not used in training or early stopping. |
|
|
|
### Model architecture |
|
|
|
- **Regression** |
|
|
|
```json |
|
|
|
{ |
|
"dropout": 0.0, |
|
"ensemble_size": 3, |
|
"extra_featurizers": null, |
|
"learning_rate": 1e-05, |
|
"model_class": "ChempropModelBox", |
|
"n_hidden": 5, |
|
"n_units": 8, |
|
"use_2d": true, |
|
"use_fp": true |
|
} |
|
``` |
|
|
|
### Model usage |
|
|
|
You can use this model with: |
|
|
|
```python |
|
from duvida.autoclasses import AutoModelBox |
|
modelbox = AutoModelBox.from_pretrained("hf://scbirlab/spark-dv-2503-saur") |
|
modelbox.predict(filename=..., inputs=[...], columns=[...]) # make predictions on your own data |
|
``` |
|
|
|
## Training details |
|
|
|
- **Dataset:** [SPARK, WT accumulator, _Staphylococcus aureus_ subset](https://huggingface.co/datasets/scbirlab/thomas-2018-spark-wt) (2115 rows in total for _Staphylococcus aureus_) |
|
- **Input column:** smiles |
|
- **Output column:** pmic |
|
- **Split type:** Murcko scaffold |
|
- **Split proportions:** |
|
- 70% training (1424 rows) |
|
- 15% validation (for early stopping) (309 rows) |
|
- 15% test (for selecting hyperparameters) (316 rows) |
|
|
|
Here is the training log: |
|
|
|
<img src="training-log.png" width=450> |
|
|
|
And these are the evaluation scores. |
|
|
|
Train (1424 rows): |
|
|
|
```json |
|
|
|
{ |
|
"Pearson r": 0.9141987685996613, |
|
"RMSE": 0.238382488489151, |
|
"Spearman rho": 0.8198319253295027 |
|
} |
|
``` |
|
|
|
<img src="predictions_train.png" width=450> |
|
|
|
Validation (309 rows): |
|
|
|
```json |
|
|
|
{ |
|
"Pearson r": 0.9432814998253994, |
|
"RMSE": 0.3496144711971283, |
|
"Spearman rho": 0.8553478966171193 |
|
} |
|
``` |
|
|
|
<img src="predictions_validation.png" width=450> |
|
|
|
Test (316 rows): |
|
|
|
```json |
|
|
|
{ |
|
"Pearson r": 0.7588797018977873, |
|
"RMSE": 0.7793745398521423, |
|
"Spearman rho": 0.8158068476381244 |
|
} |
|
``` |
|
|
|
<img src="predictions_test.png" width=450> |
|
|
|
## Training data details |
|
|
|
The training data were collated by the authors of: |
|
|
|
> Joe Thomas, Marc Navre, Aileen Rubio, and Allan Coukell |
|
> Shared Platform for Antibiotic Research and Knowledge: A Collaborative Tool to SPARK Antibiotic Discovery |
|
> ACS Infectious Diseases 2018 4 (11), 1536-1539 |
|
> DOI: 10.1021/acsinfecdis.8b00193 |
|
|
|
We cleaned the original SPARK dataset to subset the most relevant columns, remove empty values, |
|
give succint column titles, and split by species. |
|
|
|
This particular dataset retains only measurements on bacteria with wild-type accumulation phenotypes. |
|
|
|
### Dataset Sources |
|
|
|
- **Repository:** https://www.collaborativedrug.com/spark-data-downloads |
|
- **Paper:** https://doi.org/10.1021/acsinfecdis.8b00193 |
|
|
|
### Data Collection and Processing |
|
|
|
Data were processed using [schemist](https://github.com/scbirlab/schemist), a tool for processing chemical datasets. |
|
|
|
The SMILES strings have been canonicalized, and split into training (70%), validation (15%), and test (15%) sets |
|
by Murcko scaffold for each species with more than 1000 entries. Additional features like molecular weight and |
|
topological polar surface area have also been calculated. |
|
|
|
### Who are the source data producers? |
|
|
|
Joe Thomas, Marc Navre, Aileen Rubio, and Allan Coukell |