A Multimodal Large Language Model for Music Captioning

Authors

Olsen, Jakob ; Mørk, Jacob ; Lauridsen, Anders Højbak Lysgaard

Term

4. semester

Education

Mathematical Engineering, Master

Publication year

2024

Submitted on

2024-06-03

Pages

Abstract

In this project, the goal was to implement a multimodal model, using an audio encoder and a large language model, capable of creating music descriptions given a song. A multimodal converter model was developed for captioning 10 second music clips. The model was consistently able to generate descriptions, however, it struggled with hallucinations and inaccuracies. The model's performance was measured using the BERTScore and a qualitative evaluation. Future work should prioritize fine-tuning the large language model together with the audio projection layer to combat the current issues. Hereafter, further research should look into other language models, improve the prompt used, and try different audio encoders.

Documents

Download
View record in AAU Student Projects

A master's thesis from Aalborg University

A Multimodal Large Language Model for Music Captioning