jiazhengli
/

Llama-3.1-8B-RoleMRC-sft

Model card Files Files and versions Community

Llama-3.1-8B-RoleMRC-sft / README.md

jiazhengli's picture

Update README.md

6aa3b65 verified 4 months ago

|

history blame contribute delete

3.13 kB

	---
	model-index:
	- name: jiazhengli/Llama-3.1-8B-RoleMRC-sft
	results: []
	datasets:
	- Junrulu/RoleMRC
	language:
	- en
	base_model: meta-llama/Meta-Llama-3.1-8B
	license: llama3
	---

	# Model Card for Llama-3.1-8B-RoleMRC-sft

	This repository provides a fine-tuned version of Llama-3.1-8B, using our proposed [RoleMRC dataset](https://huggingface.co/datasets/Junrulu/RoleMRC). We obey all licenses mentioned in llama3's work.

	## Performance

	Reference-based Evaluation Result

	\| Model \| BLEU \| ROUGE-1 \| ROUGE-2 \| ROUGE-L \| ROUGE-Lsum \| METEOR \| BERTScore F1 \|
	\|--------------------------------\|--------\|---------\|---------\|---------\|------------\|--------\|-----------\|
	\| LLaMA3.1-8B-Instruct \| 0.0226 \| 0.2277 \| 0.0615 \| 0.1509 \| 0.1650 \| 0.2594 \| 0.8478 \|
	\| LLaMA3.1-70B-Instruct \| 0.0232 \| 0.2258 \| 0.0646 \| 0.1500 \| 0.1661 \| 0.2632 \| 0.8480 \|
	\| LLaMA3.1-8B-RoleMRC-SFT \| 0.1782 \| 0.4628 \| 0.2676 \| 0.3843 \| 0.3853 \| 0.3975 \| 0.8831 \|
	\| LLaMA3.1-8B-RoleMRC-DPO \| 0.1056 \| 0.3989 \| 0.1785 \| 0.2988 \| 0.3001 \| 0.4051 \| 0.8805 \|

	General Benchmark

	\| Model \| GSM8K 8-shot \| Math 4-shot \| GPQA 0-shot \| IFEval 3-shot \| MMLU-Pro 5-shot \| MMLU 0-shot \| PiQA 3-shot \| MUSR 0-shot \| TruthfulQA 3-shot\| / Avg. \|
	\|----------------------------------------\|-------------\|------------\|-------------\|--------------\|---------------\|-----------\|-----------\|-----------\|------------------------\|------\|
	\| LLAMA3.1-8B \| 48.98 \| 17.78 \| 12.5 \| 16.67 \| 35.21 \| 63.27 \| 81.77 \| 38.1 \| 28.52 \| 38.09 \|
	\| LLAMA3.1-8B-INSTRUCT \| 77.41 \| 34.1 \| 12.72 \| 57.67 \| 40.77 \| 68.1 \| 82.1 \| 39.81 \| 36.47 \| 49.91 \|
	\| LLaMA3.1-8B-RoleMRC-SFT \| 56.18 \| 12.78 \| 19.64 \| 42.09 \| 31.58 \| 59.3 \| 82.64 \| 40.34 \| 35.01 \| 42.17 \|
	\| LLaMA3.1-8B-RoleMRC-DPO \| 58.53 \| 13.5 \| 20.09 \| 46.64 \| 31.8 \| 59.96 \| 82.7 \| 39.42 \| 37.33 \| 43.33 \|

	## Evaluation Details
	Five conditional benchmarks, using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness):
	- GSM8K: 8-shot, report strict match
	- IFEval: 3-shot, report instruction-level strict accuracy
	- PiQA: 3-shot, report accuracy
	- MMLU: 0-shot, report normalized accuracy
	- TruthfulQA: 3-shot, report accuracy of single-true mc1 setting

	## Input Format

	The model is trained to use the following format:
	```
	<\|start_header_id\|>user<\|end_header_id\|>

	{PROMPT}<\|eot_id\|>
	<\|start_header_id\|>assistant<\|end_header_id\|>

	{Response}
	```

	## Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1e-5
	- total_train_batch_size: 16
	- optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_ratio: 0.04
	- num_epochs: 1.0