FL33TW00D
chore: init
dc80200 unverified
|
raw
history blame
1.58 kB
# On-Device LLM Throughput Calculator
A Gradio web application that helps visualize LLM throughput on memory-bandwidth-constrained devices.
## Overview
This tool calculates and visualizes the theoretical throughput (tokens per second) that can be achieved by a Large Language Model (LLM) running on devices with memory bandwidth constraints. It supports different attention mechanisms:
- Grouped Query Attention (GQA)
- Multi-Query Attention (MQA)
- Memory-Latent Attention (MLA)
It also visualizes how sliding window attention impacts throughput at different context lengths.
## Features
- Customize device specifications (memory bandwidth)
- Configure model parameters (size, layers, heads)
- Compare different attention mechanisms
- Visualize performance across different context lengths
- Sliding window attention support
## Usage
1. Configure your device details (name, memory bandwidth)
2. Set model parameters (number of parameters, layer count, etc.)
3. Choose which attention mechanism configurations to compare
4. Generate a visualization of expected throughput
## Installation
```bash
pip install -r requirements.txt
```
## Running Locally
```bash
cd src
python app.py
```
## Theory
The calculations are based on memory bandwidth bottlenecks as described in the [JAX ML Scaling Book](https://jax-ml.github.io/scaling-book/inference/#theoretical-estimates-for-llm-latency-and-throughput).
The basic formula for tokens per second:
```
tokens_per_second = (batch_size * memory_bandwidth) / (batch_size * total_kv_size + parameter_size)
```
## License
MIT