Spaces:

FL33TW00D-HF
/

throughput-calculator

Running

File size: 1,758 Bytes

---
title: On-Device LLM Throughput Calculator 
emoji: 🚀
colorFrom: pink
colorTo: blue
sdk: gradio
sdk_version: 4.36.0
app_file: src/app.py
pinned: false
license: mit 
---


# On-Device LLM Throughput Calculator

A Gradio web application that helps visualize LLM throughput on memory-bandwidth-constrained devices.

## Overview

This tool calculates and visualizes the theoretical throughput (tokens per second) that can be achieved by a Large Language Model (LLM) running on devices with memory bandwidth constraints. It supports different attention mechanisms:

- Grouped Query Attention (GQA)
- Multi-Query Attention (MQA)
- Memory-Latent Attention (MLA)

It also visualizes how sliding window attention impacts throughput at different context lengths.

## Features

- Customize device specifications (memory bandwidth)
- Configure model parameters (size, layers, heads)
- Compare different attention mechanisms
- Visualize performance across different context lengths
- Sliding window attention support

## Usage

1. Configure your device details (name, memory bandwidth)
2. Set model parameters (number of parameters, layer count, etc.)
3. Choose which attention mechanism configurations to compare
4. Generate a visualization of expected throughput

## Installation

```bash
pip install -r requirements.txt
```

## Running Locally

```bash
cd src
python app.py
```

## Theory

The calculations are based on memory bandwidth bottlenecks as described in the [JAX ML Scaling Book](https://jax-ml.github.io/scaling-book/inference/#theoretical-estimates-for-llm-latency-and-throughput).

The basic formula for tokens per second:
```
tokens_per_second = (batch_size * memory_bandwidth) / (batch_size * total_kv_size + parameter_size)
```

## License

MIT