AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Goal-conditioned Reinforcement Learning with Transformer Architectures and Intrinsic Reward Design

Author

Term

4. term

Publication year

2023

Submitted on

Pages

24

Abstract

This thesis explores goal-conditioned reinforcement learning (GCRL) by applying transformer architectures in both online and offline settings and by proposing a simple intrinsic reward mechanism. First, a decoder-only transformer is integrated into Goal-conditioned Supervised Learning (GCSL); successful training requires careful balancing of optimization steps and data collection to prevent model collapse, with additional gains over the baseline achieved via variance and covariance regularization and a gated fusion mechanism that merges goal information with sequences of state and action tokens. Next, the study tests whether these modifications transfer to the offline supervised setting through the Decision Transformer: in practice, only the gated fusion mechanism consistently provides stable learning and slight performance improvements, while also simplifying goal specification by using a single “reward” token rather than returns-to-go at every timestep. Finally, a GCRL method is proposed that trains a goal-conditioned dynamics model alongside a DRL agent, where the agent receives an intrinsic reward by maximizing the negative squared L2 distance between the actual next-state representation and the model’s predicted representation. This approach outperforms a discrete PPO baseline in a simple 10x10 gridworld but does not carry over to more complex environments such as CarRacing. The work suggests future directions including evaluating the modified Decision Transformer on larger offline datasets, extending the intrinsic-reward approach to more complex tasks, and exploring consciousness-inspired frameworks to improve intent and alignment in RL.

Dette speciale undersøger mål-betinget forstærkningslæring (GCRL) med fokus på transformer-arkitekturer i både online og offline indstillinger samt et simpelt intrinsisk belønningsdesign. Først udvides Goal-conditioned Supervised Learning (GCSL) med en decoder-only transformer, hvor træningen kræver omhyggelig balance mellem optimeringstrin og datainindsamling for at undgå modelkollaps. Yderligere opnås forbedringer over baseline via varians- og kovarians-regularisering og en gated fusionsmekanisme, der sammenfletter målinformation med sekvenser af tilstands- og handlingstokens. Dernæst evalueres, om disse ændringer overføres til den offline, superviserede indstilling via Decision Transformer: her giver kun den gatede fusionsmekanisme stabil indlæring og små performancegevinster, samtidig med at mål-kodningen kan forenkles ved at bruge en enkelt “belønnings”-token i stedet for return-to-go ved hvert tidssteg. Endelig foreslås en GCRL-metode, der træner en mål-betinget dynamikmodel og et DRL-agent, hvor agenten modtager en intrinsisk belønning ved at maksimere den negative kvadrerede L2-afstand mellem den faktiske næste tilstandsrepræsentation og modelens forudsagte repræsentation. Metoden overgår en diskret PPO-baseline i et enkelt 10x10 gridworld, men resultaterne kunne ikke udvides til mere komplekse miljøer som CarRacing. Arbejdet peger på fremtidige retninger, herunder at teste den ændrede Decision Transformer på større offline datasæt, undersøge den intrinsiske belønningsmetode i mere komplekse miljøer og mere spekulative spor inspireret af bevidsthedsteori for bedre intention og alignment i RL.

[This apstract has been generated with the help of AI directly from the project full text]