AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


A Multimodal Large Language Model for Music Captioning

Term

4. semester

Publication year

2024

Submitted on

Pages

40

Abstract

In this project, the goal was to implement a multimodal model, using an audio encoder and a large language model, capable of creating music descriptions given a song. A multimodal converter model was developed for captioning 10 second music clips. The model was consistently able to generate descriptions, however, it struggled with hallucinations and inaccuracies. The model's performance was measured using the BERTScore and a qualitative evaluation. Future work should prioritize fine-tuning the large language model together with the audio projection layer to combat the current issues. Hereafter, further research should look into other language models, improve the prompt used, and try different audio encoders.