|
# On-Device LLM Throughput Calculator |
|
|
|
A Gradio web application that helps visualize LLM throughput on memory-bandwidth-constrained devices. |
|
|
|
## Overview |
|
|
|
This tool calculates and visualizes the theoretical throughput (tokens per second) that can be achieved by a Large Language Model (LLM) running on devices with memory bandwidth constraints. It supports different attention mechanisms: |
|
|
|
- Grouped Query Attention (GQA) |
|
- Multi-Query Attention (MQA) |
|
- Memory-Latent Attention (MLA) |
|
|
|
It also visualizes how sliding window attention impacts throughput at different context lengths. |
|
|
|
## Features |
|
|
|
- Customize device specifications (memory bandwidth) |
|
- Configure model parameters (size, layers, heads) |
|
- Compare different attention mechanisms |
|
- Visualize performance across different context lengths |
|
- Sliding window attention support |
|
|
|
## Usage |
|
|
|
1. Configure your device details (name, memory bandwidth) |
|
2. Set model parameters (number of parameters, layer count, etc.) |
|
3. Choose which attention mechanism configurations to compare |
|
4. Generate a visualization of expected throughput |
|
|
|
## Installation |
|
|
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
## Running Locally |
|
|
|
```bash |
|
cd src |
|
python app.py |
|
``` |
|
|
|
## Theory |
|
|
|
The calculations are based on memory bandwidth bottlenecks as described in the [JAX ML Scaling Book](https://jax-ml.github.io/scaling-book/inference/#theoretical-estimates-for-llm-latency-and-throughput). |
|
|
|
The basic formula for tokens per second: |
|
``` |
|
tokens_per_second = (batch_size * memory_bandwidth) / (batch_size * total_kv_size + parameter_size) |
|
``` |
|
|
|
## License |
|
|
|
MIT |
|
|