Efficient Sparse Mixture-of-Experts for Sub-Quadratic Inference on Long Contexts
NeurIPSAccepted2025
N. Meters, A. Chen, R. Kapoor
We present a sparse mixture-of-experts architecture that achieves sub-quadratic inference cost on sequences exceeding 128k tokens. By introducing a locality-sensitive routing mechanism that exploits the low-rank structure of attention patterns, our method reduces peak memory by 3.8x while maintaining 98.2% of dense model quality across standard long-context benchmarks. We provide theoretical guarantees on routing stability and demonstrate wall-clock speedups on commodity hardware.
transformersmixture-of-expertsefficient-inferencelong-context