Geometry-Aware Post-Training for Mixture-of-Experts Routers
Authors
Vijayakumaar Palanisamy, Visnu Ritesh ; Dabrowski, Hubert ; Tregers, Karlis
Term
4. semester
Education
Publication year
2026
Submitted on
2026-05-29
Abstract
Many existing methods for controlling Mixture-of-Experts (MoE) routers focus on high-level goals such as overall expert usage, capacity limits, and numerical stability. However, they do not require that similar hidden states (the model’s internal representations) be routed to similar experts. This thesis introduces Geometry-Aware Router Post-training (GARP), a router-only regularization method applied after training. GARP builds local k-nearest-neighbour graphs from pre-router hidden states, which are taken without gradient updates, and then penalizes disagreement between the router’s probability distributions over experts for neighbouring points. The results suggest that the geometric structure of hidden states is a useful signal for organizing sparse routing in MoE models. With experts frozen and the standard Top-K dispatch left unchanged, GARP reduces the coefficient of variation in expert load by 12.9%, the router-neighbour logit distance by 9.2%, and the Prompt Sensitivity Score by 1.6%, while keeping the average downstream performance nearly the same. At the tested scale, the benefits are primarily structural rather than accuracy-driven, and some trade-offs remain.
Mange nuværende metoder til at styre Mixture-of-Experts (MoE) routers fokuserer på overordnede mål som samlet brug af eksperter, kapacitetsbegrænsninger og numerisk stabilitet. De stiller derimod ikke krav om, at lignende skjulte tilstande (de interne repræsentationer i modellen) skal sendes til lignende eksperter. Denne afhandling introducerer Geometry-Aware Router Post-training (GARP), en metode, der kun ændrer routeren efter træning. GARP konstruerer lokale k-nærmeste-nabo-grafer ud fra skjulte tilstande, der tages før routeren og holdes faste, og straffer derefter, når routerens sandsynlighedsfordelinger over eksperter er uenige for nabo-punkter. Resultaterne peger på, at den geometriske struktur i de skjulte tilstande er en nyttig information til at organisere spars routing i MoE-modeller. Når eksperterne holdes fastlåst og den normale Top-K-udvælgelse bevares, reducerer GARP variationen i belastning mellem eksperter (koefficienten for variation) med 12,9 %, afstanden mellem routerens logits for naboer med 9,2 % og Prompt Sensitivity Score med 1,6 %, mens den gennemsnitlige præstation på efterfølgende opgaver stort set er uændret. I den undersøgte skala er forbedringerne primært strukturelle og ikke drevet af højere nøjagtighed, og der er stadig visse afvejninger.
[This abstract has been rewritten with the help of AI based on the project's original abstract]
Keywords
LLM ; MOE ; Mixture of Experts ; Router ; Routing of experts ; Post-training ; MoE Router
