AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


DETR for Combined Object Detection and Notation Assembly in Optical Music Recognition

Authors

;

Term

4. term

Education

Publication year

2021

Submitted on

Pages

13

Abstract

This thesis tackles Optical Music Recognition (OMR) by unifying two core tasks that are traditionally handled separately: detecting music symbols and assembling notation, where relationships among symbols are analyzed to recover musical meaning. The work is motivated by the need to digitize large music archives and by the strongly context-dependent nature of notation, in which a symbol’s meaning (e.g., pitch) depends on its relations to clefs, accidentals, and staff position. We investigate transformer-based methods, focusing on the DEtection TRansformer (DETR), whose attention mechanisms capture global object relations, and propose an end-to-end model that simultaneously detects handwritten Western music symbols and predicts their inter-object relationships. Our approach augments DETR with explicit relation modeling and frames the problem as multi-task learning so that detection and notation assembly are solved holistically. Compared with prior OMR pipelines that combine CNN-based detection with rule- or grammar-driven assembly, our method removes reliance on hand-crafted rules by learning relations directly from data without pairwise crops. We situate the method within R-CNN, semantic segmentation, and scene graph generation research and present experiments assessing the effectiveness of transformers and multi-task learning on dense images with many small symbols; detailed quantitative results are reported in the full thesis and are not included in this excerpt.

Denne afhandling adresserer optisk musikgenkendelse (OMR) med fokus på at samle to centrale og traditionelt adskilte trin: objektgenkendelse af musiksymboler og notationssammenstilling, hvor relationer mellem symboler analyseres for at udlede musikalsk mening. Arbejdet motiveres af behovet for at digitalisere store arkiver af nodesats samt af de kontekstafhængige udfordringer i musiknotation, hvor betydningen af et symbol (fx tonehøjde) bestemmes af dets forhold til andre symboler og nøgler i nodesystemet. Vi undersøger transformerbaserede metoder, særligt DEtection TRansformer (DETR), som via opmærksomhedsmekanismer kan modellere globale relationer i billeder, og foreslår en end-to-end model, der på én gang detekterer håndskrevne musikobjekter i vestlig notation og forudsiger deres indbyrdes relationer. Tilgangen udvider DETR med eksplicit relationmodellering og formulerer opgaven som multi-opgave-læring, så detektering og notationssammenstilling løses holistisk. I forhold til tidligere OMR-arbejder, der ofte kombinerer CNN-baseret detektering med regelbaserede grammatikker eller heuristik for sammenstilling, eliminerer vores metode afhængigheden af håndlavede regler ved at lære relationer direkte fra data uden parvise beskæringer. Vi placerer metoden i relation til R-CNN-, semantisk segmenterings- og scene-graf-litteraturen og rapporterer eksperimenter, der evaluerer effektiviteten af transformere og multi-opgave-læring i tætte billeder med mange små symboler; konkrete resultater fremgår af den fulde afhandling og er ikke indeholdt i dette uddrag.

[This apstract has been generated with the help of AI directly from the project full text]