metadata
title: On-Device LLM Throughput Calculator
emoji: π
colorFrom: pink
colorTo: blue
sdk: gradio
sdk_version: 4.36.0
app_file: src/app.py
pinned: false
license: mit
On-Device LLM Throughput Calculator
A Gradio web application that helps visualize LLM throughput on memory-bandwidth-constrained devices.
Overview
This tool calculates and visualizes the theoretical throughput (tokens per second) that can be achieved by a Large Language Model (LLM) running on devices with memory bandwidth constraints. It supports different attention mechanisms:
- Grouped Query Attention (GQA)
- Multi-Query Attention (MQA)
- Memory-Latent Attention (MLA)
It also visualizes how sliding window attention impacts throughput at different context lengths.
Features
- Customize device specifications (memory bandwidth)
- Configure model parameters (size, layers, heads)
- Compare different attention mechanisms
- Visualize performance across different context lengths
- Sliding window attention support
Usage
- Configure your device details (name, memory bandwidth)
- Set model parameters (number of parameters, layer count, etc.)
- Choose which attention mechanism configurations to compare
- Generate a visualization of expected throughput
Installation
pip install -r requirements.txt
Running Locally
cd src
python app.py
Theory
The calculations are based on memory bandwidth bottlenecks as described in the JAX ML Scaling Book.
The basic formula for tokens per second:
tokens_per_second = (batch_size * memory_bandwidth) / (batch_size * total_kv_size + parameter_size)
License
MIT