AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


A Multimodal Large Language Model for Music Captioning

Authors

; ;

Term

4. semester

Publication year

2024

Submitted on

Pages

40

Abstract

This project develops a multimodal model that listens to short music excerpts and writes a text description of them. The system combines an audio encoder (which converts sound into numerical representations) with a large language model that generates natural language. Specifically, we built a converter to caption 10-second music clips. The model consistently produced descriptions, but it suffered from hallucinations (details not supported by the audio) and other inaccuracies. We evaluated performance using BERTScore, which assesses semantic similarity between generated and reference descriptions, along with a qualitative evaluation. Future work should prioritize jointly fine-tuning the language model and the audio projection layer to reduce these errors, followed by exploring alternative language models, improved prompts, and different audio encoders.

Dette projekt udvikler en multimodal model, der kan lytte til korte musikudsnit og skrive en tekstbeskrivelse af dem. Modellen kombinerer en lyd-encoder (som omsætter lyd til numeriske repræsentationer) med en stor sprogmodel, der genererer naturligt sprog. Konkret byggede vi en konverter, der laver tekstbeskrivelser af 10-sekunders musikklip. Systemet kunne konsekvent producere beskrivelser, men det led af hallucinationer (opdigtede detaljer) og andre unøjagtigheder. Ydelsen blev målt med BERTScore, som vurderer den semantiske lighed mellem genererede og referencebeskrivelser, samt med en kvalitativ evaluering. Fremadrettet bør arbejdet prioritere fælles finetuning af sprogmodellen og audio-projektionslaget for at reducere fejl. Derefter bør der undersøges alternative sprogmodeller, forbedringer af prompten og andre lyd-encodere.

[This apstract has been rewritten with the help of AI based on the project's original abstract]