Abstract
FLARE, a linear complexity self-attention mechanism, improves scalability and accuracy for large unstructured meshes and neural PDE surrogates.
The quadratic complexity of self-attention limits its applicability and scalability on large unstructured meshes. We introduce Fast Low-rank Attention Routing Engine (FLARE), a linear complexity self-attention mechanism that routes attention through fixed-length latent sequences. Each attention head performs global communication among N tokens by projecting the input sequence onto a fixed length latent sequence of M ll N tokens using learnable query tokens. By routing attention through a bottleneck sequence, FLARE learns a low-rank form of attention that can be applied at O(NM) cost. FLARE not only scales to unprecedented problem sizes, but also delivers superior accuracy compared to state-of-the-art neural PDE surrogates across diverse benchmarks. We also release a new additive manufacturing dataset to spur further research. Our code is available at https://github.com/vpuri3/FLARE.py.
Community
FLARE is a novel token mixing layer that bypasses the quadratic cost of self-attention by leveraging low-rankness. FLARE is built on the argument that projecting input sequences onto shorter latent sequences, and then unprojecting to the original sequence length, is equivalent to constructing a low-rank form of attention with rank at most equal to the number of latent tokens (see figure below).
Furthermore, we argue that multiple simultaneous low-rank projections could collectively capture
a full attention pattern. Our design allocates a distinct slice of the latent tokens to
each head resulting in distinct projection matrices for each head. This allows each head to learn
independent attention relationships, opening up a key direction of scaling and exploration, wherein
each head may specialize in distinct routing patterns.
FLARE is built entirely from standard fused attention primitives, ensuring high GPU utilization and ease of integration into existing transformer architectures. By replacing full self-attention with low-rank projections and reconstructions, FLARE achieves linear complexity in the number of points (see plot below). As such, FLARE enables end-to-end training on unstructured meshes with one million points on a single GPU – the largest scale demonstrated for transformer-based PDE surrogates.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning (2025)
- Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics (2025)
- EcoTransformer: Attention without Multiplication (2025)
- RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling (2025)
- PointLAMA: Latent Attention meets Mamba for Efficient Point Cloud Pretraining (2025)
- Modality Agnostic Efficient Long Range Encoder (2025)
- Efficient Attention Mechanisms for Large Language Models: A Survey (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper