arxiv:2508.12594

FLARE: Fast Low-rank Attention Routing Engine

Published on Aug 18

· Submitted by

vedantpuri on Aug 21

Upvote

Authors:

Vedant Puri ,

Abstract

FLARE, a linear complexity self-attention mechanism, improves scalability and accuracy for large unstructured meshes and neural PDE surrogates.

AI-generated summary

The quadratic complexity of self-attention limits its applicability and scalability on large unstructured meshes. We introduce Fast Low-rank Attention Routing Engine (FLARE), a linear complexity self-attention mechanism that routes attention through fixed-length latent sequences. Each attention head performs global communication among N tokens by projecting the input sequence onto a fixed length latent sequence of M ll N tokens using learnable query tokens. By routing attention through a bottleneck sequence, FLARE learns a low-rank form of attention that can be applied at O(NM) cost. FLARE not only scales to unprecedented problem sizes, but also delivers superior accuracy compared to state-of-the-art neural PDE surrogates across diverse benchmarks. We also release a new additive manufacturing dataset to spur further research. Our code is available at https://github.com/vpuri3/FLARE.py.

View arXiv page View PDF Add to collection

Community

vedantpuri

Paper author Paper submitter 2 days ago

FLARE is a novel token mixing layer that bypasses the quadratic cost of self-attention by leveraging low-rankness. FLARE is built on the argument that projecting input sequences onto shorter latent sequences, and then unprojecting to the original sequence length, is equivalent to constructing a low-rank form of attention with rank at most equal to the number of latent tokens (see figure below).

Furthermore, we argue that multiple simultaneous low-rank projections could collectively capture
a full attention pattern. Our design allocates a distinct slice of the latent tokens to
each head resulting in distinct projection matrices for each head. This allows each head to learn
independent attention relationships, opening up a key direction of scaling and exploration, wherein
each head may specialize in distinct routing patterns.

FLARE is built entirely from standard fused attention primitives, ensuring high GPU utilization and ease of integration into existing transformer architectures. By replacing full self-attention with low-rank projections and reconstructions, FLARE achieves linear complexity in the number of points (see plot below). As such, FLARE enables end-to-end training on unstructured meshes with one million points on a single GPU – the largest scale demonstrated for transformer-based PDE surrogates.

librarian-bot

2 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.12594 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.12594 in a Space README.md to link it from this page.