Revisiting Bilevel Optimization for Aligning Self-Supervised Pretraining with Downstream Fine-Tuning: Advancing BiSSL Through Systematic Vari ations, Novel Design Modifications, and Adaptation to New Data Domains
Author
Term
4. semester
Education
Publication year
2025
Submitted on
2025-07-28
Pages
99
Abstract
The BiSSL framework models the pipeline of self-supervised pretraining followed by downstream fine-tuning as the lower- and upper-levels of a bilevel optimization problem. The lower-level parameters are additionally regularized to resemble the ones of the upper-level, which collectively yields a pretrained model initialization more aligned with the downstream task. This project extends the study of BiSSL by first evaluating its sensitivity to hyperparameter variations. Then, design modifications, including adaptive lower-level regularization scaling and generalized upper-level gradient expressions, are furthermore proposed and tested. Lastly, BiSSL is adapted to natural language processing tasks using the generative pretrained transformer pretext task, and then evaluated on a range of diverse downstream tasks. Results show that BiSSL is robust towards variations in most of its hyperparameters, provided that the training duration is sufficiently long. The proposed design modifications yield no consistent improvements and may even degrade performance. For natural language processing tasks, BiSSL achieves occasional gains and otherwise matches the baseline. The findings overall suggest that the original BiSSL design is robust, effective, and able to improve downstream accuracy across input domains.
The BiSSL framework models the pipeline of self-supervised pretraining followed by downstream fine-tuning as the lower- and upper-levels of a bilevel optimization problem. The lower-level parameters are additionally regularized to resemble the ones of the upper-level, which collectively yields a pretrained model initialization more aligned with the downstream task. This project extends the study of BiSSL by first evaluating its sensitivity to hyperparameter variations. Then, design modifications, including adaptive lower-level regularization scaling and generalized upper-level gradient expressions, are furthermore proposed and tested. Lastly, BiSSL is adapted to natural language processing tasks using the generative pretrained transformer pretext task, and then evaluated on a range of diverse downstream tasks. Results show that BiSSL is robust towards variations in most of its hyperparameters, provided that the training duration is sufficiently long. The proposed design modifications yield no consistent improvements and may even degrade performance. For natural language processing tasks, BiSSL achieves occasional gains and otherwise matches the baseline. The findings overall suggest that the original BiSSL design is robust, effective, and able to improve downstream accuracy across input domains.
Keywords
Documents
