A Multimodal Large Language Model for Music Captioning
Authors
Term
4. semester
Education
Publication year
2024
Submitted on
2024-06-03
Pages
40
Abstract
In this project, the goal was to implement a multimodal model, using an audio encoder and a large language model, capable of creating music descriptions given a song. A multimodal converter model was developed for captioning 10 second music clips. The model was consistently able to generate descriptions, however, it struggled with hallucinations and inaccuracies. The model's performance was measured using the BERTScore and a qualitative evaluation. Future work should prioritize fine-tuning the large language model together with the audio projection layer to combat the current issues. Hereafter, further research should look into other language models, improve the prompt used, and try different audio encoders.
Documents
