FL33TW00D
chore: add pre-amble
bd53d2a unverified
|
raw
history blame
1.76 kB
metadata
title: On-Device LLM Throughput Calculator
emoji: πŸš€
colorFrom: pink
colorTo: blue
sdk: gradio
sdk_version: 4.36.0
app_file: src/app.py
pinned: false
license: mit

On-Device LLM Throughput Calculator

A Gradio web application that helps visualize LLM throughput on memory-bandwidth-constrained devices.

Overview

This tool calculates and visualizes the theoretical throughput (tokens per second) that can be achieved by a Large Language Model (LLM) running on devices with memory bandwidth constraints. It supports different attention mechanisms:

  • Grouped Query Attention (GQA)
  • Multi-Query Attention (MQA)
  • Memory-Latent Attention (MLA)

It also visualizes how sliding window attention impacts throughput at different context lengths.

Features

  • Customize device specifications (memory bandwidth)
  • Configure model parameters (size, layers, heads)
  • Compare different attention mechanisms
  • Visualize performance across different context lengths
  • Sliding window attention support

Usage

  1. Configure your device details (name, memory bandwidth)
  2. Set model parameters (number of parameters, layer count, etc.)
  3. Choose which attention mechanism configurations to compare
  4. Generate a visualization of expected throughput

Installation

pip install -r requirements.txt

Running Locally

cd src
python app.py

Theory

The calculations are based on memory bandwidth bottlenecks as described in the JAX ML Scaling Book.

The basic formula for tokens per second:

tokens_per_second = (batch_size * memory_bandwidth) / (batch_size * total_kv_size + parameter_size)

License

MIT