Poro 2 8B Base Model Card

Poro 2 8B Base is an 8B parameter decoder-only transformer created through continued pretraining of Llama 3.1 8B to add Finnish language capabilities. It was trained on 165B tokens using a carefully balanced mix of Finnish, English, code, and math data. Poro 2 is a fully open source model and is made available under the Llama 3.1 Community License.

Poro 2 was created in a collaboration between AMD Silo AI, the TurkuNLP group of the University of Turku, and High Performance Language Technologies (HPLT). Training was conducted on the LUMI supercomputer, using compute resources generously provided by CSC - IT Center for Science, Finland.

This model demonstrates how continued pretraining can efficiently add new language capabilities to existing models while maintaining performance in the original domains. Through the combination of English and Finnish training data, we achieve a model that substantially outperforms the base Llama 3.1 8B model in Finnish while maintaining solid English proficiency.

For more details on our training and data curation process, check out our Continued Pretraining Playbook.

Poro 2 Model Family

The Poro 2 model family includes both 8B and 70B models, and there are three different versions released of the Poro 2 models: a base model, a post-training SFT-only checkpoint, and the final instruct model which is the SFT model plus a round of DPO.

Model Based on Base Model SFT Instruct
Poro 2 8B Llama 3.1 8B Poro 2 8B Base Poro 2 8B SFT Poro 2 8B Instruct
Poro 2 70B Llama 3.1 70B Poro 2 70B Base Poro 2 70B SFT Poro 2 70B Instruct

What does Poro mean? Poro is the Finnish word for Reindeer! 🦌 These animals are native to Finland and hold a significant and historical role in Finnish culture.

Model Overview

NOTE: This is a base model which needs further fine tuning for most use cases.

Poro 2 8B is based on the Llama 3.1 8B architecture and uses continued pretraining to add Finnish language capabilities.

Hyperparameter Value
n_parameters 8.03B
n_layers 32
n_heads 32
n_kv_heads 8
d_model 4096
vocab_size 128256
max_sequence_length 8192
base_model Llama-3.1-8B

Training

Poro 2 8B was created through continued pretraining on the LUMI supercomputer, using AMD MI250X GPUs. Training used a 3D parallelism strategy with TP=2, PP=1.

Training was conducted using a custom version of the Megatron-LM framework. Our code is available at https://github.com/LumiOpen/Megatron-LM-lumi.

Training Hyperparameters

Hyperparameter Value Comment
Precision bfloat16
Optimizer AdamW
Learning rate 3e-4
LR scheduler cosine Warmup ratio 0.05, min LR 1e-8
Weight decay 1e-1
Global batch size 512
Micro batch size 1
Max sequence length 8192
Total tokens 165B 1 epoch

Dataset

Poro 2 8B was trained on a balanced 165B token dataset designed to maintain English, code, and math capabilities while adding Finnish proficiency.

Dataset Source Percentage Tokens
Finnish FineWeb2 30% 50B
English FineWeb-Edu 30% 50B
Code StarCoder 30% 50B
Math FineMath 10% 16B
Total 100% 165B

Evaluation Results

Poro 2 8B shows substantial improvements in Finnish capabilities over Llama 3.1 8B, while maintaining English performance:

Finnish Performance

Poro 2 8B Llama 3.1 8B
ARC Challenge 48.90 38.82
HellaSwag 50.49 30.97
MMLU 56.25 49.64
TruthfulQA 49.78 45.54

English Performance

Poro 2 8B Llama 3.1 8B
ARC Challenge 60.75 57.94
HellaSwag 80.55 80.05
MMLU 63.48 65.08
TruthfulQA 48.06 54.02

Translation Performance

Poro 2 8B Llama 3.1 8B
EN→FI BLEU 36.48 23.92
FI→EN BLEU 40.71 37.42

Overall: ~10 percentage point average improvement in Finnish benchmarks with only ~1 percentage point decrease in English performance.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "LumiOpen/Poro-2-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Example usage
prompt = "Kerro minulle Suomesta."  # "Tell me about Finland" in Finnish
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Ethical Considerations and Limitations

Poro 2 8B is an advanced language model optimized for English and Finnish, with additional capabilities in code and mathematics. As with most AI-driven systems, Poro 2 is a product of the vast data it has been trained on, which may reflect the imperfections, biases, and idiosyncrasies of the wider web. The model may, at times, produce outputs that can be considered inaccurate, prejudiced, or controversial.

Key limitations:

  • Limited proficiency in languages other than English and Finnish
  • Potential for generating biased or inappropriate content
  • May produce factually incorrect information

Users and developers engaging with Poro 2 should exercise discretion and consider additional evaluation and customization to ensure the model's responses align with their specific needs.

License

Built with Llama

Poro 2 8B is released under the Llama 3.1 Community License. Please review the license terms before use.

Citation

@misc{poro2_2025,
    title={Poro 2: Continued Pretraining for Language Acquisition},
    author={Elaine Zosa and Jouni Luoma and Kai Hakala and Antti Virtanen and Mika Koistinen and Risto Luukkonen and Akseli Reunamo and Sampo Pyysalo and Jonathan Burdge},
    year={2025},
    howpublished={LumiOpen}
}

Acknowledgments

We thank CSC - IT Center for Science, Finland for providing access to the LUMI supercomputer. This work was supported by the High Performance Language Technologies (HPLT) project and conducted in collaboration with TurkuNLP from the University of Turku. This project has received funding from the European Union's Horizon Europe research and innovation programme under grant agreement No 101070350.

Downloads last month
117
Safetensors
Model size
8.03B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train LumiOpen/Llama-Poro-2-8B-base

Collection including LumiOpen/Llama-Poro-2-8B-base