AAU Student Projects is unavailable between June 15th 1.30pm and 17th 1.30pm due to planned system maintenance. The projects cannot be downloaded during this period.
AAU Student Projects - visit Aalborg University's student projects portal
An executive master's programme thesis from Aalborg University
Book cover


Pipeline Parallelism Optimization for Distributed Foundation Models Under Unstable Wireless Links

Author

Term

4. semester

Publication year

2026

Submitted on

Abstract

This thesis examines how unreliable wireless connections affect pipeline-parallel training of large language models (Foundation Models). In the study, a GPT-2 model is split into four stages that distribute the computation across multiple devices. The system is evaluated under different wireless conditions: stable links, mobility, packet loss, and periods with strong signal obstructions. The thesis proposes a method called HeCoPipe, which extends an existing FTPipeHD-style baseline. HeCoPipe monitors queues in the system and can automatically adjust both how strongly training data (gradients) are compressed and how coarsely activations (intermediate model computations) are quantized. In addition, it uses a packet-loss recovery mechanism inspired by HARQ (Hybrid Automatic Repeat reQuest) to resend lost data more effectively. The results show that packet loss and time-varying wireless link capacity can significantly reduce throughput, meaning the speed at which training progresses in pipeline-parallel setups. Across all evaluated scenarios, HeCoPipe achieves higher throughput and lower queue depth than FTPipeHD. The largest benefit appears under mobility with packet loss, where HeCoPipe improves average throughput by up to a factor of 5.17. Overall, the findings indicate that wireless-aware adaptation can make pipeline-parallel training more robust, even when wireless links are unstable.

Dette speciale undersøger, hvordan ustabile trådløse forbindelser påvirker såkaldt pipeline-parallel træning af store sprogmodeller (Foundation Models). I arbejdet opdeles en GPT-2-model i fire trin (stadier), som fordeler beregningerne mellem flere enheder. Systemet testes under forskellige trådløse forhold: stabil forbindelse, bevægelse (mobilitet), pakketab og perioder med kraftige forstyrrelser. Specialet foreslår metoden HeCoPipe, der bygger videre på en eksisterende løsning af typen FTPipeHD. HeCoPipe overvåger køer i systemet og kan automatisk tilpasse både, hvor meget træningsdata (gradienter) komprimeres, og hvor groft aktiveringer (mellemberegninger i modellen) kvantiseres. Derudover anvendes en mekanisme inspireret af HARQ (Hybrid Automatic Repeat reQuest) til bedre at håndtere pakketab ved at sende data om igen på en intelligent måde. Resultaterne viser, at pakketab og varierende trådløs kapacitet markant kan sænke gennemløbet (hvor hurtigt træningen skrider frem) i pipeline-parallel træning. I alle de undersøgte scenarier opnår HeCoPipe højere gennemløb og mindre køophobning end FTPipeHD. Den største forbedring ses, når der både er mobilitet og pakketab, hvor HeCoPipe kan øge den gennemsnitlige gennemløbshastighed med op til 5,17 gange. Samlet peger resultaterne på, at trådløst-tilpassede metoder kan gøre pipeline-parallel træning mere robust, selv når de trådløse forbindelser er ustabile.

[This abstract has been rewritten with the help of AI based on the project's original abstract]