spark-dv-2503-saur / README.md

Upload folder using huggingface_hub

74b78fb verified 3 months ago

3.96 kB

	---
	license: mit
	pipeline_tag: tabular-regression
	tags:
	- chemistry
	- microbiology
	- antibiotics
	library_name: duvida
	datasets:
	- scbirlab/thomas-2018-spark-wt
	---

	# Predictor of _Staphylococcus aureus_ MICs

	_Updated:_ Tue 1 Apr 08:02:52 BST 2025

	Trained on the _Staphylococcus aureus_, WT accumulator phenotype subset of the [human-curated SPARK dataset](https://doi.org/10.1021/acsinfecdis.8b00193) (2115 rows in total for _Staphylococcus aureus_).

	## Model details

	This model was trained using [our Duvida framework](https://github.com/scbirlab/duvida),
	as a result of hyperparameter searches and selecting the model that performs best on unseen test data
	(from a scaffold split).

	Duvida also saves the training data in this checkpoint to allows the calculation of uncertainty metrics
	based on that training data.

	This model is the best regression model from a hyperparameter search, determined
	by Pearson's $$r$$ on a held-out test set not used in training or early stopping.

	### Model architecture

	- Regression

	```json

	{
	"dropout": 0.0,
	"ensemble_size": 3,
	"extra_featurizers": null,
	"learning_rate": 1e-05,
	"model_class": "ChempropModelBox",
	"n_hidden": 5,
	"n_units": 8,
	"use_2d": true,
	"use_fp": true
	}
	```

	### Model usage

	You can use this model with:

	```python
	from duvida.autoclasses import AutoModelBox
	modelbox = AutoModelBox.from_pretrained("hf://scbirlab/spark-dv-2503-saur")
	modelbox.predict(filename=..., inputs=[...], columns=[...]) # make predictions on your own data
	```

	## Training details

	- Dataset: [SPARK, WT accumulator, _Staphylococcus aureus_ subset](https://huggingface.co/datasets/scbirlab/thomas-2018-spark-wt) (2115 rows in total for _Staphylococcus aureus_)
	- Input column: smiles
	- Output column: pmic
	- Split type: Murcko scaffold
	- Split proportions:
	- 70% training (1424 rows)
	- 15% validation (for early stopping) (309 rows)
	- 15% test (for selecting hyperparameters) (316 rows)

	Here is the training log:

	<img src="training-log.png" width=450>

	And these are the evaluation scores.

	Train (1424 rows):

	```json

	{
	"Pearson r": 0.9141987685996613,
	"RMSE": 0.238382488489151,
	"Spearman rho": 0.8198319253295027
	}
	```

	<img src="predictions_train.png" width=450>

	Validation (309 rows):

	```json

	{
	"Pearson r": 0.9432814998253994,
	"RMSE": 0.3496144711971283,
	"Spearman rho": 0.8553478966171193
	}
	```

	<img src="predictions_validation.png" width=450>

	Test (316 rows):

	```json

	{
	"Pearson r": 0.7588797018977873,
	"RMSE": 0.7793745398521423,
	"Spearman rho": 0.8158068476381244
	}
	```

	<img src="predictions_test.png" width=450>

	## Training data details

	The training data were collated by the authors of:

	> Joe Thomas, Marc Navre, Aileen Rubio, and Allan Coukell
	> Shared Platform for Antibiotic Research and Knowledge: A Collaborative Tool to SPARK Antibiotic Discovery
	> ACS Infectious Diseases 2018 4 (11), 1536-1539
	> DOI: 10.1021/acsinfecdis.8b00193

	We cleaned the original SPARK dataset to subset the most relevant columns, remove empty values,
	give succint column titles, and split by species.

	This particular dataset retains only measurements on bacteria with wild-type accumulation phenotypes.

	### Dataset Sources

	- Repository: https://www.collaborativedrug.com/spark-data-downloads
	- Paper: https://doi.org/10.1021/acsinfecdis.8b00193

	### Data Collection and Processing

	Data were processed using [schemist](https://github.com/scbirlab/schemist), a tool for processing chemical datasets.

	The SMILES strings have been canonicalized, and split into training (70%), validation (15%), and test (15%) sets
	by Murcko scaffold for each species with more than 1000 entries. Additional features like molecular weight and
	topological polar surface area have also been calculated.

	### Who are the source data producers?

	Joe Thomas, Marc Navre, Aileen Rubio, and Allan Coukell