Spaces:
Running
on
CPU Upgrade
Showcase official/verified results
We can probably mark/colorize the official/verified results, and note their count as well like this:
- Total Models: 30 (24 official + 6 self-reported)
Wdyt @Muennighoff ?
I think this is a good idea, but would wait for a few actual non-official results to come in
What does official mean in this context?
All new submissions (we receive at least a couple a week) typically perform the following:
- add an implementation to
mteb
(typically they will just use the wrapper for sentence transformers so they just have to supply the metadata) - Using the implementation, run the results (though they could change the implementation afterwards or change the results)
- Submit the results to
embedding-benchmark/results
, where we do a review, which checks for outliers (we have discovered a few cases where providers forgot to tell us that they trained on one of the datasets)
We do not track who evaluated the model.
We have a few historic data points that are completely self-reported without validation (submitted by pushing results to model card), this submission process is no longer possible.
We could add a symbol for "Reproducible"
I think since implementations are now required it is probably fine to close this! I guess that things that turn out to be non-reproducible might be flagged by users and then removed unless fixes anyways
Yeah, If people find stuff that doesn't reproduce, we either rerun it or remove it (after giving the authors the chance to fix it).