metadata

title: On-Device LLM Throughput Calculator
emoji: 🚀
colorFrom: pink
colorTo: blue
sdk: gradio
sdk_version: 4.36.0
app_file: src/app.py
pinned: false
license: mit

On-Device LLM Throughput Calculator

A Gradio web application that helps visualize LLM throughput on memory-bandwidth-constrained devices.

Overview

This tool calculates and visualizes the theoretical throughput (tokens per second) that can be achieved by a Large Language Model (LLM) running on devices with memory bandwidth constraints. It supports different attention mechanisms:

Grouped Query Attention (GQA)
Multi-Query Attention (MQA)
Memory-Latent Attention (MLA)

It also visualizes how sliding window attention impacts throughput at different context lengths.

Features

Customize device specifications (memory bandwidth)
Configure model parameters (size, layers, heads)
Compare different attention mechanisms
Visualize performance across different context lengths
Sliding window attention support

Usage

Configure your device details (name, memory bandwidth)
Set model parameters (number of parameters, layer count, etc.)
Choose which attention mechanism configurations to compare
Generate a visualization of expected throughput

Installation

pip install -r requirements.txt

Running Locally

cd src
python app.py

Theory

The calculations are based on memory bandwidth bottlenecks as described in the JAX ML Scaling Book.

The basic formula for tokens per second:

tokens_per_second = (batch_size * memory_bandwidth) / (batch_size * total_kv_size + parameter_size)

License

MIT