|
from dataclasses import dataclass |
|
from enum import Enum |
|
|
|
@dataclass |
|
class Task: |
|
benchmark: str |
|
metric: str |
|
col_name: str |
|
|
|
|
|
|
|
|
|
class Tasks(Enum): |
|
|
|
task0 = Task("icelandic_winogrande_stringmatch", "exact_match,get-answer", "WinoGrande-IS") |
|
task1 = Task("icelandic_sentences_ged_stringmatch", "exact_match,get-answer", "GED") |
|
task2 = Task("icelandic_inflection_easy", "json_metric,get-answer", "Inflection (common)") |
|
task3 = Task("icelandic_inflection_medium", "json_metric,get-answer", "Inflection (uncommon)") |
|
task4 = Task("icelandic_inflection_hard", "json_metric,get-answer", "Inflection (rare)") |
|
task5 = Task("icelandic_belebele", "exact_match,get-answer", "Belebele (IS)") |
|
task6 = Task("icelandic_arc_challenge", "exact_match,get-answer", "ARC-Challenge-IS") |
|
|
|
NUM_FEWSHOT = 0 |
|
|
|
|
|
|
|
|
|
|
|
TITLE = """<h1 align="center" id="space-title">Icelandic LLM leaderboard</h1>""" |
|
|
|
|
|
INTRODUCTION_TEXT = """ |
|
""" |
|
|
|
|
|
LLM_BENCHMARKS_TEXT = f""" |
|
## Benchmark tasks |
|
The Icelandic LLM leaderboard evaluates models on several tasks. All of them are set up as generation tasks, where the model's output is compared to the expected output. |
|
This means that models that have not been instruction fine-tuned might perform poorly on these tasks. |
|
|
|
The following tasks are evaluated: |
|
|
|
### WinoGrande-IS |
|
The Icelandic WinoGrande task is a human-translated and localized version of the ~1000 test set examples in the WinoGrande task in English. |
|
Each example consists of a sentence with a blank, and two answer choices for the blank. The task is to choose the correct answer choice using coreference resolution. |
|
The benchmark is designed to test the model's ability to use knowledge and common sense reasoning in Icelandic. |
|
The Icelandic WinoGrande dataset is described in more detail in the IceBERT paper (https://aclanthology.org/2022.lrec-1.464.pdf). |
|
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-winogrande |
|
|
|
### GED |
|
This is a benchmark for binary sentence-level Icelandic grammatical error detection, adapted from the Icelandic Error Corpus (IEC) and contains 200 examples. |
|
Each example consists of a sentence that may contain one or more grammatical errors, and the task is to predict whether the sentence contains an error. |
|
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-sentences-gec |
|
|
|
### Inflection benchmarks |
|
The inflection benchmarks test the model's ability to generate inflected forms of Icelandic adjective-noun pairs. They are divided into three levels of difficulty by |
|
commonness: common (100 examples), uncommon (100 examples), and rare (100 examples). The model gets a point for an example if it generates error-free json with the |
|
correct inflected forms in all cases, singular and plural. |
|
- Link to dataset (common): https://huggingface.co/datasets/mideind/icelandic-inflection-easy |
|
- Link to dataset (uncommon): https://huggingface.co/datasets/mideind/icelandic-inflection-medium |
|
- Link to dataset (rare): https://huggingface.co/datasets/mideind/icelandic-inflection-hard |
|
|
|
### Belebele (IS) |
|
This is the Icelandic subset (900 examples) of the Belebele benchmark, a multiple-choice reading comprehension task. The task is to answer questions about a given passage. |
|
- Link to dataset: https://huggingface.co/datasets/facebook/belebele |
|
|
|
### ARC-Challenge-IS |
|
A machine-translated version of the ARC-Challenge multiple-choice question-answering dataset. For this benchmark, we use the test set which contains 1.23k examples. |
|
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-arc-challenge |
|
|
|
""" |
|
|
|
EVALUATION_QUEUE_TEXT = """ |
|
## Some good practices before submitting a model |
|
|
|
### 1) Make sure you can load your model and tokenizer using AutoClasses: |
|
```python |
|
from transformers import AutoConfig, AutoModel, AutoTokenizer |
|
config = AutoConfig.from_pretrained("your model name", revision=revision) |
|
model = AutoModel.from_pretrained("your model name", revision=revision) |
|
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision) |
|
``` |
|
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded. |
|
|
|
Note: make sure your model is public! |
|
Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted! |
|
|
|
### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index) |
|
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`! |
|
|
|
### 3) Make sure your model has an open license! |
|
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗 |
|
|
|
### 4) Fill up your model card |
|
When we add extra information about models to the leaderboard, it will be automatically taken from the model card |
|
|
|
## In case of model failure |
|
If your model is displayed in the `FAILED` category, its execution stopped. |
|
Make sure you have followed the above steps first. |
|
If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task). |
|
""" |
|
|
|
|