Text Generation
Transformers
TensorBoard
Safetensors
Japanese
English
qwen2
conversational
text-generation-inference

Shisa V2

Shisa V2 is a family of bilingual Japanese and English (JA/EN) general-purpose chat models trained by Shisa.AI. These models aim to excel in Japanese language tasks while retaining robust English capabilities.

Since our initial Shisa 7B releases, the baseline Japanese capabilities of open-weight language models have significantly improved. New models have more Japanese pre-training tokens, higher JA tokenizer efficiency, and better quality Japanese outputs overall. As such, for Shisa V2 we've eschewed both tokenizer extension and costly continued pre-training and have focused entirely on optimizing post-training. We've significantly expanded and refined the synthetic-data driven approach that was pioneered with our original Shisa 7B models, and have achieved substantial performance gains.

Model Family Overview

The Shisa V2 family comprises a range of models from 7B to 70B parameters in size:

License Model Name Parameters Context Length JA AVG EN AVG
Apache 2.0 shisa-v2-qwen2.5-7b 7B 128K/8K 71.06 54.86
Llama 3.1 shisa-v2-llama3.1-8b1 8B 128K 70.83 54.75
Apache 2.0 shisa-v2-mistral-nemo-12b 12B 128K 72.83 53.33
MIT shisa-v2-unphi4-14b 14B 16K 75.89 60.10
Apache 2.0 shisa-v2-qwen2.5-32b 32B 128K/8K 76.97 67.41
Llama 3.3 shisa-v2-llama3.3-70b1 70B 128K 79.72 67.71

These Shisa V2 models were all trained using the same datasets and training recipes, except for scaling the learning rate based on model size and modifying the global batch size for the 70B model.

While most of our development and tuning was done on the Llama 3.1 8B model, we did some cross-validation during this process and we're pleased that our final recipe has shown robust scaling, improving Japanese language performance across all model sizes evaluated. We've prioritized releasing the highest-quality openly-licensed (Apache 2.0 and MIT) models in each class size.

Performance

All Shisa V2 models demonstrate improved Japanese output quality compared to their respective base models:

Model Name JA Avg EN Avg Shaberi Avg ELYZA 100 JA MT Bench Rakuda Tengu llm-jp-eval shisa-jp-ifeval shisa-jp-rp-bench shisa-jp-tl-bench MixEval LiveBench IFEval EvalPlus
shisa-ai/shisa-v2-qwen2.5-7b 71.06 54.86 8.21 7.81 8.49 8.91 7.62 0.59 0.32 4.49 5.98 0.44 32.9 0.70 0.73
Qwen/Qwen2.5-7B-Instruct 65.30 58.11 8.03 7.81 8.09 8.68 7.53 0.57 0.29 4.15 3.29 0.44 33.9 0.76 0.79

The Shisa V2 models perform well against other models in their respective class sizes.

Included for reference are our recently published shisa-v2-llama3.1-8b-preview "preview" release as well as the still popular, but long-since superseded shisa-gamma-7b-v1 model.

License Model Name JA Avg EN Avg Shaberi Avg ELYZA 100 JA MT Bench Rakuda Tengu llm-jp-eval shisa-jp-ifeval shisa-jp-rp-bench shisa-jp-tl-bench MixEval LiveBench IFEval EvalPlus
Apache 2.0 shisa-ai/shisa-v2-qwen2.5-7b 71.06 54.86 8.21 7.81 8.49 8.91 7.62 0.59 0.32 4.49 5.98 0.44 32.9 0.70 0.73
Llama 3.1 shisa-ai/shisa-v2-llama3.1-8b 70.83 54.75 8.20 7.67 8.32 9.24 7.56 0.57 0.31 4.61 5.91 0.45 31.7 0.82 0.61
Llama 3.1 shisa-ai/shisa-v2-llama3.1-8b-preview 68.03 54.56 8.12 7.55 8.57 9.03 7.33 0.56 0.19 4.67 5.18 0.46 32.0 0.79 0.62
Llama 3.1 tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3 67.44 42.20 8.22 8.01 8.40 9.10 7.37 0.56 0.25 4.36 4.22 0.30 26.4 0.64 0.48
Apache 2.0 Qwen/Qwen2.5-7B-Instruct 65.30 58.11 8.03 7.81 8.09 8.68 7.53 0.57 0.29 4.15 3.29 0.44 33.9 0.76 0.79
Llama 3.1 AXCXEPT/Llama-3.1-8B-EZO-1.1-it 63.80 53.94 7.93 7.57 8.26 8.61 7.28 0.39 0.22 4.53 4.17 0.46 30.4 0.77 0.62
Llama 3 elyza/Llama-3-ELYZA-JP-8B 60.92 39.09 7.91 7.61 8.08 8.92 7.04 0.41 0.24 4.39 1.75 0.34 17.5 0.62 0.43
Llama 3.1 allenai/Llama-3.1-Tulu-3.1-8B 60.86 54.21 7.42 6.84 7.69 8.61 6.52 0.51 0.22 4.39 2.90 0.40 31.3 0.82 0.63
Apache 2.0 llm-jp/llm-jp-3-7.2b-instruct3 56.05 23.46 7.66 6.99 7.70 9.16 6.79 0.47 0.20 3.03 1.49 0.22 5.2 0.49 0.18
Llama 3.1 meta-llama/Llama-3.1-8B-Instruct 53.43 53.43 7.34 6.95 7.67 8.36 6.40 0.25 0.16 4.13 1.03 0.44 27.7 0.80 0.63
Llama 3 shisa-ai/shisa-v1-llama3-8b 53.08 42.80 7.17 6.40 7.50 8.31 6.48 0.23 0.09 4.20 2.24 0.36 20.2 0.63 0.52
Apache 2.0 weblab-GENIAC/Tanuki-8B-dpo-v1.0 52.25 27.04 7.10 6.97 6.58 8.40 6.46 0.23 0.17 3.67 2.02 0.24 14.4 0.38 0.32
Apache 2.0 augmxnt/shisa-gamma-7b-v1 48.88 20.88 6.20 5.74 5.93 7.28 5.87 0.52 0.13 3.20 1.43 0.26 2.2 0.37 0.18

Testing Notes

Japanese functional tests were conducted using the shisa-ai/shaberi fork of the LightBlue Shaberi evaluation harness. Shaberi ratings were performed with a PoLL (LLM Jury) consisting of:

The results were statistically validated to be comparable to both gpt-4-1106-preview and human-reviewed "gold standard" ratings.

Dynamic RoPE extension was utilized when necessary for testing models with context windows smaller than 8K tokens. All tests were performed using recent versions of vLLM or SGLang.

We developed a custom "multieval" harness to automate our model evaluations. Standard benchmarks include:

New Japanese Benchmarks

Over the course of model development, we also created several new evaluations to help us measure performance on important Japanese downstream tasks:

  • shisa-jp-ifeval: Inspired by IFEval, but evaluating instruction-following abilities specific to Japanese grammar and linguistics (closed form)
  • shisa-jp-rp-bench: Assessing performance on Japanese role-play and character/persona-based multi-turn conversations based on Aratako's Japanese-RP-Bench (LLM judge)
  • shisa-jp-tl-bench: Testing Japanese-English translation proficiency (LLM judge, BTL pairwise comparison with logistic transformation scoring)

We believe these benchmarks will be generally useful and plan to open-source them in the near future to support the Japanese LLM research community.

Usage

All Shisa V2 models inherit the chat templates of their respective base models and have been tested and validated for proper inference with both vLLM and SGLang.

Running sampler sweeps, we found the models operate well across a variety of temperatures in most settings. For translation tasks specifically, we recommend a lower temperatures (0.2) to increase accuracy. For role-play and creative tasks, a higher temp (eg 1.0) seems to give good results. To prevent cross-lingual token leakage we recommend a top_p of 0.9 or min_p of 0.1.

No additional safety alignment has been done on these models, so they will largely inherit the base models' biases and safety profiles.

Datasets

Our supervised fine-tuning (SFT) stage dataset consists of approximately 360K samples totaling roughly 420M Llama 3 tokens:

  • shisa-ai/shisa-v2-sharegpt
    • This is a filtered, regenerated and resampled version of the original Shisa V1 augmxnt/ultra-orca-boros-en-ja-v1 dataset
    • This was the backbone of our Shisa V2 training and it proved to be an extremely robust dataset, out-performing all existing mixes/additions (Tulu, Olmo, Rewild, various Magpie sets, etc.) - if you need a JA/EN dataset, we believe this new version is among the best currently available
  • shisa-ai/rewild-set-deepseek-subset
  • shisa-ai/magpie-ultra-set
  • shisa-ai/magpie-advanced-questions-set
    • Magpie-generated questions about advanced college-level topics across a variety of academic fields
  • shisa-ai/japan-magpie-set
    • Magpie-generated questions about Japan's economy and history as well as cultural and business practices
  • shisa-ai/shisa-v2-roleplaying-sft
    • Synthetically-generated roleplaying data featuring a wide variety of characters, situations, and genres
  • shisa-ai/translation_expanded_master_set_filtered
    • A synthetic dataset involving a wide range of translation tasks, including essays, conversations, and fiction
  • shisa-ai/shisa-v2-instruction-following-sft

Our final DPO mix is 113K samples totaling approximately 115M Llama 3 tokens:

  • shisa-ai/deepseekv3-ultrafeedback-armorm-dpo
  • shisa-ai/shisa-v2-roleplaying-dpo
    • A DPO variant of the roleplaying-sft set that uses an UltraFeedback-style rating system
  • shisa-ai/translation-no-extra-text-dpo-dataset
    • A DPO set that aims to reduce the tendency of models to output extraneous explanatory text for translations when not wanted
  • shisa-ai/shisa-v2-instruction-following-dpo
    • A DPO variant of the instruction-following-sft set to further enhance instruction-following performance
  • shisa-ai/politeness-dpo-set
    • A set to allow for greater controllability of speaking style for Japanese responses

Training

We trained over 200 models to empirically test a wide range of variables. Beyond hyper-parameter and data-mix testing, we also ran numerous tests on data ordering, multilingual-specific ordering, curriculum learning, multi-stage training, various forms of self-play, preference tuning, and some of the latest RL/verifiable reward techniques.

A full discussion of these learnings is out of scope here, but we will be updating the shisa-v2 wiki and the Shisa.AI website with forthcoming writeups.

Most of our training was done on a small AWS Sagemaker-deployed 4-node H100 slurm cluster. Training was mostly done with Axolotl with DeepSpeed and Liger Kernels. The Phi 4 and Llama 3.3 70B versions of Shisa V2 were trained with OpenRLHF. Our training logs are publicly available on Weights and Biases.

Credits

The Shisa V2 models were developed by Leonard Lin and Adam Lensenmayer (Shisa.AI).

Compute was provided by Ubitus K.K. and METI GENIAC.

Thanks to Meta Llama, Microsoft Research, Mistral AI, and Qwen Team for providing their models to the open source community, Unsloth for their llamafied conversion of Phi-4, the Tulu team, whose detailed writeups and fast responses to our questions were very helpful, and Chanvichet Vong of the Axolotl team for his tireless work in the Axolotl Discord.

We also extend our thanks to all open source AI developers and researchers - without their publicly shared research, tooling, and datasets, none of our work would be possible. We hope that our own contributions will further support the broader community.

A special thanks to Jon Durbin for his work on Shisa V1.

For more details on our development and insights, please visit the Shisa V2 Github repository and the Shisa.AI website.


1: Per the Llama Community License Agreements, the official names of the Llama-based models are "Llama 3.1 shisa-v2-llama3.1-8b" and "Llama 3.3 shisa-v2-llama3.3-70b"

Downloads last month
45
Safetensors
Model size
7.62B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shisa-ai/shisa-v2-qwen2.5-7b

Base model

Qwen/Qwen2.5-7B
Finetuned
(1391)
this model
Quantizations
4 models

Datasets used to train shisa-ai/shisa-v2-qwen2.5-7b

Collection including shisa-ai/shisa-v2-qwen2.5-7b