Spaces:

mideind
/

icelandic-llm-leaderboard

Running

File size: 5,683 Bytes

efeee6d
314f91a
95f85ed
efeee6d
 
 
 
 
 
314f91a
b899767
 
efeee6d
943f952
67a665c
56926f2
 
 
 
67a665c
 
1ffc326
 
b899767
 
efeee6d
 
 
2a3757e
58733e4
efeee6d
8c49cb6
0227006
 
efeee6d
0227006
67a665c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d16cee2
d313dbd
 
8c49cb6
d313dbd
 
 
 
 
 
 
 
 
8c49cb6
b323764
d313dbd
 
 
 
 
 
 
 
b323764
d313dbd
 
 
 
8c49cb6
 
d16cee2
58733e4
2a73469

from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard 
    task0 = Task("icelandic_winogrande_stringmatch", "exact_match,get-answer", "WinoGrande-IS")
    task1 = Task("icelandic_sentences_ged_stringmatch", "exact_match,get-answer", "GED")
    task2 = Task("icelandic_inflection_easy", "json_metric,get-answer", "Inflection (common)")
    task3 = Task("icelandic_inflection_medium", "json_metric,get-answer", "Inflection (uncommon)")
    task4 = Task("icelandic_inflection_hard", "json_metric,get-answer", "Inflection (rare)")
    task5 = Task("icelandic_belebele", "exact_match,get-answer", "Belebele (IS)")
    task6 = Task("icelandic_arc_challenge", "exact_match,get-answer", "ARC-Challenge-IS")

NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------



# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">Icelandic LLM leaderboard</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
## Benchmark tasks
The Icelandic LLM leaderboard evaluates models on several tasks. All of them are set up as generation tasks, where the model's output is compared to the expected output.
This means that models that have not been instruction fine-tuned might perform poorly on these tasks.

The following tasks are evaluated:

### WinoGrande-IS
The Icelandic WinoGrande task is a human-translated and localized version of the ~1000 test set examples in the WinoGrande task in English.
Each example consists of a sentence with a blank, and two answer choices for the blank. The task is to choose the correct answer choice using coreference resolution.
The benchmark is designed to test the model's ability to use knowledge and common sense reasoning in Icelandic.
The Icelandic WinoGrande dataset is described in more detail in the IceBERT paper (https://aclanthology.org/2022.lrec-1.464.pdf).
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-winogrande

### GED
This is a benchmark for binary sentence-level Icelandic grammatical error detection, adapted from the Icelandic Error Corpus (IEC) and contains 200 examples.
Each example consists of a sentence that may contain one or more grammatical errors, and the task is to predict whether the sentence contains an error.
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-sentences-gec

### Inflection benchmarks
The inflection benchmarks test the model's ability to generate inflected forms of Icelandic adjective-noun pairs. They are divided into three levels of difficulty by
commonness: common (100 examples), uncommon (100 examples), and rare (100 examples). The model gets a point for an example if it generates error-free json with the 
correct inflected forms in all cases, singular and plural.
- Link to dataset (common): https://huggingface.co/datasets/mideind/icelandic-inflection-easy
- Link to dataset (uncommon): https://huggingface.co/datasets/mideind/icelandic-inflection-medium
- Link to dataset (rare): https://huggingface.co/datasets/mideind/icelandic-inflection-hard

### Belebele (IS)
This is the Icelandic subset (900 examples) of the Belebele benchmark, a multiple-choice reading comprehension task. The task is to answer questions about a given passage.
- Link to dataset: https://huggingface.co/datasets/facebook/belebele

### ARC-Challenge-IS
A machine-translated version of the ARC-Challenge multiple-choice question-answering dataset. For this benchmark, we use the test set which contains 1.23k examples.
- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-arc-challenge

"""

EVALUATION_QUEUE_TEXT = """
## Some good practices before submitting a model

### 1) Make sure you can load your model and tokenizer using AutoClasses:
```python
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
```
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.

Note: make sure your model is public!
Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!

### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!

### 3) Make sure your model has an open license!
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗

### 4) Fill up your model card
When we add extra information about models to the leaderboard, it will be automatically taken from the model card

## In case of model failure
If your model is displayed in the `FAILED` category, its execution stopped.
Make sure you have followed the above steps first.
If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
"""