Create README.md
Browse files---
language:
- en
tags:
- role-playing
- character simulation
- llama
- llama-3.1
- persona
license: mit
datasets:
- Neph0s/CoSER
---
# CoSER Models
CoSER models are state-of-the-art models for role-playing language agents (RPLAs), built upon LLaMA-3.1 base models (8B and 70B). These models are trained on the [CoSER dataset](https://huggingface.co/datasets/Neph0s/CoSER), which contains authentic multi-turn, multi-character dialogues extracted from 771 renowned novels.
CoSER models exhibit excellent role-playing capabilities. They can produce highly human-like responses across a wide range of personas, including both established fictional characters or original characters. They excel at capturing nuanced personalities, maintaining consistent character traits, and adapting to diverse role-playing scenarios. Results of extensive experiments demonstrate that CoSER models exhibit state-of-the-art role-playing performance across multiple benchmarks.
### Model Variants
- **CoSER-8B**: Fine-tuned from LLaMA-3.1-8B
- **CoSER-70B**: Fine-tuned from LLaMA-3.1-70B
## Training Data
The models are trained on the [CoSER dataset](https://huggingface.co/datasets/Neph0s/CoSER), which differs from existing RPLA datasets in two fundamental ways:
1. It extracts authentic multi-turn, multi-character dialogues from acclaimed literary works, maintaining high source fidelity while exhibiting greater quality and complexity.
2. It incorporates comprehensive types of data:
- Character profiles, dialogues, plot summaries, character experiences, and conversation backgrounds.
- Conversations that capture characters' internal thoughts and physical actions beyond surface-level speech
## Training Methodology
Our training approach is based on "given-circumstance acting" (GCA):
Given a conversation with messages M, characters C, and setting S, the actor LLM is required to sequentially portray each character c∈C to recreate the conversation. During training, for each character c, we optimize the language modeling loss on their corresponding messages.
## Performance and Evaluation
We evaluate our models via GCA Evaluation. It is a comprehensive approach that includes multi-agent simulation and penalty-based LLM assessment:
1. We generate conversations via multi-agent simulation, where the actor LLM portrays each character within a given setting, coordinated by a next-actor-prediction model to manage turn-taking.
2. We assess the generated conversations using penalty-based LLM judges, which are provided detailed rubrics and original conversations for reference.
### Performance on Given-Circumstance Acting
CoSER models outperform existing open-source LLMs on multiple RPLA benchmarks and are comparable to state-of-the-art closed-source models like GPT-4o.
| Model | Storyline Consistency | Anthropomorphism | Character Fidelity | Storyline Quality | Average Score | BLEU | ROUGE-L |
|-------|----------------------|------------------|-------------------|------------------|--------------|------|---------|
| **Close-source Models** | | | | | | | |
| Abab7-preview | 56.81 | 44.23 | 43.83 | 74.83 | 54.92 | 4.96 | 11.50 |
| Doubao-pro | 60.95 | 49.72 | 47.02 | 79.28 | 59.24 | 6.38 | 12.95 |
| Step-1-Flash | 57.75 | 48.12 | 44.48 | 75.93 | 56.57 | 5.95 | 12.71 |
| Step-2 | 61.43 | 49.06 | 47.33 | 77.96 | 58.94 | 5.75 | 12.50 |
| GPT-3.5 | 57.22 | 43.30 | 42.29 | 73.91 | 54.18 | 4.58 | 11.80 |
| GPT-4o | **61.59** | 48.93 | **48.95** | **80.33** | **59.95** | 5.90 | 12.11 |
| GPT-4o Mini | 60.09 | 48.21 | 44.88 | 78.55 | 57.93 | 3.90 | 10.81 |
| Gemini Pro | 59.11 | 52.41 | 47.83 | 77.59 | 59.24 | 5.39 | 11.65 |
| Claude-3-Haiku | 58.18 | 44.66 | 41.88 | 74.14 | 54.71 | 4.80 | 12.02 |
| Claude-3.5-Sonnet | 57.45 | 48.50 | 45.69 | 77.23 | 57.22 | 5.17 | 11.45 |
| **Open-source Models** | | | | | | | |
| Mistral-7B | 59.90 | 40.00 | 44.75 | 61.93 | 51.64 | 2.71 | 9.28 |
| Qwen-2-7B | 51.96 | 35.48 | 31.51 | 63.18 | 45.53 | 4.21 | 10.71 |
| LLaMA-3.1-8B | 54.10 | 45.36 | 40.22 | 72.29 | 52.99 | 4.59 | 10.18 |
| CoSER-8B | 58.61 | 47.23 | 46.90 | 73.04 | 56.45 | 9.40 | 14.21 |
| Vicuna-13B-1.5 | 52.75 | 39.12 | 38.04 | 60.43 | 47.58 | 1.67 | 5.59 |
| Mixtral-8x7B | 51.25 | 38.44 | 36.92 | 67.69 | 48.58 | 5.28 | 11.66 |
| Qwen-2-72B | 57.75 | 47.28 | 46.62 | 76.60 | 57.06 | 5.38 | 11.85 |
| LLaMA-3.1-70B | 57.46 | 45.95 | 43.72 | 74.84 | 55.49 | 4.82 | 10.98 |
| Higgs-Llama-3-70B | 57.10 | 43.82 | 42.41 | 75.62 | 54.74 | 3.99 | 10.92 |
| CoSER-70B | 58.66 | **53.33** | 48.75 | 75.49 | 59.06 | **10.10** | **14.78** |
| DeepSeek-V3 | 56.40 | 47.87 | 44.02 | 76.66 | 56.24 | 4.54 | 11.02 |
*Note: Bold values indicate best performance across all models.*
### Performance on Existing RPLA Benchmarks
| Model | InCharacter Dim | InCharacter Full | Life Choice | CroSS MR |
|-------|----------------|------------------|-------------|----------|
| LLaMA-3.1-8B | 64.97 | 15.62 | 61.10 | 30.15 |
| CoSER-8B | 75.80 | 21.88 | 69.54 | 44.94 |
| *CoSER-8B trained w/o I.T.* | 70.70 | 15.62 | 59.92 | 43.14 |
| LLaMA-3.1-70B | 72.16 | 31.25 | 86.48 | 61.30 |
| Higgs-Llama-3-70B | 74.52 | 28.12 | 74.03 | 60.12 |
| CoSER-70B | 75.80 | **34.38** | **93.47** | **64.49** |
| *CoSER-70B trained w/o I.T.* | 73.12 | 32.14 | 93.18 | 63.14 |
| Qwen-2-72B | 74.52 | 31.25 | 81.14 | 62.57 |
| GPT-3.5 | 71.20 | 21.88 | 78.07 | 30.09 |
| GPT-4o | **76.54** | 32.62 | 75.96 | **64.49** |
| Claude-3.5-Sonnet | 72.61 | 21.88 | 86.07 | 30.59 |
*Note: Bold values indicate best performance. I.T. denotes inner thoughts. For InCharacter, we report accuracy for individual (Dim) and full (Full) dimensions on BFI.*
## Ethical Considerations
We have conducted safety checks on the training dataset and removed potentially problematic content. However, users should be aware that:
- The models may still generate content that reflects biases present in the literary works they were trained on.
- Role-playing as certain characters might involve generating content that includes negative traits or behaviors.
- Users should implement appropriate safeguards when deploying these models in applications.
## Citation
If you use CoSER models in your research, please cite our paper:
```
@misc
{wang2025cosercoordinatingllmbasedpersona,
title={CoSER: Coordinating LLM-Based Persona Simulation of Established Roles},
author={Xintao Wang and Heng Wang and Yifei Zhang and Xinfeng Yuan and Rui Xu and Jen-tse Huang and Siyu Yuan and Haoran Guo and Jiangjie Chen and Wei Wang and Yanghua Xiao and Shuchang Zhou},
year={2025},
eprint={2502.09082},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.09082},
}
```