Navigating in a Simulated Environment with Curriculum-based Reinforcement Learning
Author
Martinkevic, Jevgenij
Term
4. term
Education
Publication year
2018
Submitted on
2018-08-31
Pages
46
Abstract
Forstærkningslæring (Reinforcement Learning, RL) har for nylig vist imponerende resultater i bevægelse, navigation, robotik og i spil med forskellig kompleksitet. En central udfordring er dog belønningsdesign: at omsætte målet for en opgave til et belønningssignal, som agenten kan lære af. Det er ofte tidskrævende og svært, især når belønningerne er sparsomme, dvs. når agenten kun får feedback lejlighedsvis. I denne afhandling undersøges en curriculum-baseret RL-tilgang til at løse en navigationsopgave med kun sparsomme belønninger. Træningen foregår i et simuleret miljø bygget i Unity Engine, og agenterne trænes med Proximal Policy Optimization (PPO) via Unity Machine Learning Agents Toolkit. Resultaterne viser, at en typisk RL-agent, der kun fik sparsomme belønninger, ikke kunne løse opgaven inden for den givne tid. Derimod lykkedes det at løse opgaven med en miljø-centreret curriculum-tilgang, hvor opgaven blev nedbrudt i lektioner med gradvist stigende sværhedsgrad. Denne tilgang førte til en succesrate på 99% på målopgaven. Afhandlingen undersøger også en kombination af curriculum learning og såkaldt reward shaping (ekstra, mere detaljerede belønninger). Observationerne viser, at denne kombination kan påvirke træningen negativt, når de to metoder overlapper og belønner den samme type adfærd, hvilket kan forvirre indlæringen.
Reinforcement Learning (RL) has recently achieved strong results in locomotion, navigation, robotics, and games of varying complexity. A key challenge, however, is reward design: translating the task goal into a reward signal that an agent can learn from. This is often difficult and time-consuming, especially with sparse rewards, where the agent receives feedback only rarely. This thesis investigates a curriculum-based RL approach to solve a navigation task using only sparse rewards. Training is carried out in a simulated environment built in the Unity Engine, and agents are trained with Proximal Policy Optimization (PPO) using the Unity Machine Learning Agents Toolkit. The results show that a typical RL agent relying only on sparse rewards could not solve the task within the available time. In contrast, an environment-centered curriculum—decomposing the task into lessons of increasing difficulty—enabled the agent to solve the target task and achieve a 99% success rate. The thesis also examines combining curriculum learning with reward shaping (adding more detailed intermediate rewards). The observations indicate that this combination can negatively affect training when the two approaches overlap and encourage the same type of behavior, which can confuse learning.
[This abstract was generated with the help of AI]
Documents
