Spaces:

FL33TW00D-HF
/

throughput-calculator

Running

App Files Files Community

throughput-calculator / README.md

FL33TW00D

chore: init

dc80200 unverified 3 months ago

preview code

raw

history blame

1.58 kB

	# On-Device LLM Throughput Calculator

	A Gradio web application that helps visualize LLM throughput on memory-bandwidth-constrained devices.

	## Overview

	This tool calculates and visualizes the theoretical throughput (tokens per second) that can be achieved by a Large Language Model (LLM) running on devices with memory bandwidth constraints. It supports different attention mechanisms:

	- Grouped Query Attention (GQA)
	- Multi-Query Attention (MQA)
	- Memory-Latent Attention (MLA)

	It also visualizes how sliding window attention impacts throughput at different context lengths.

	## Features

	- Customize device specifications (memory bandwidth)
	- Configure model parameters (size, layers, heads)
	- Compare different attention mechanisms
	- Visualize performance across different context lengths
	- Sliding window attention support

	## Usage

	1. Configure your device details (name, memory bandwidth)
	2. Set model parameters (number of parameters, layer count, etc.)
	3. Choose which attention mechanism configurations to compare
	4. Generate a visualization of expected throughput

	## Installation

	```bash
	pip install -r requirements.txt
	```

	## Running Locally

	```bash
	cd src
	python app.py
	```

	## Theory

	The calculations are based on memory bandwidth bottlenecks as described in the [JAX ML Scaling Book](https://jax-ml.github.io/scaling-book/inference/#theoretical-estimates-for-llm-latency-and-throughput).

	The basic formula for tokens per second:
	```
	tokens_per_second = (batch_size * memory_bandwidth) / (batch_size * total_kv_size + parameter_size)
	```

	## License

	MIT