Author(s)
Term
4. semester
Education
Publication year
2024
Submitted on
2024-06-03
Pages
40 pages
Abstract
In this project, the goal was to implement a multimodal model, using an audio encoder and a large language model, capable of creating music descriptions given a song. A multimodal converter model was developed for captioning 10 second music clips. The model was consistently able to generate descriptions, however, it struggled with hallucinations and inaccuracies. The model's performance was measured using the BERTScore and a qualitative evaluation. Future work should prioritize fine-tuning the large language model together with the audio projection layer to combat the current issues. Hereafter, further research should look into other language models, improve the prompt used, and try different audio encoders.
Documents
Colophon: This page is part of the AAU Student Projects portal, which is run by Aalborg University. Here, you can find and download publicly available bachelor's theses and master's projects from across the university dating from 2008 onwards. Student projects from before 2008 are available in printed form at Aalborg University Library.
If you have any questions about AAU Student Projects or the research registration, dissemination and analysis at Aalborg University, please feel free to contact the VBN team. You can also find more information in the AAU Student Projects FAQs.