A Job Manager for the NorduGrid ARC
Authors
Jensen, Henrik Thostrup ; Leth, Jepser Ryge
Term
4. term
Education
Publication year
2004
Abstract
Denne rapport præsenterer udviklingen af et system, der kan styre beregningsopgaver (jobs) i NorduGrid. Projektet bygger videre på et tidligere arbejde, hvor et baggrundsprogram (daemon) blev lavet til at gensende fejlede jobs. Rapporten begynder med en introduktion til grids – distribuerede computersystemer, hvor flere organisationer deler ressourcer – og forklarer, hvordan ressourcedeling koordineres gennem virtuelle organisationer (VO’er). Derefter beskrives en generel gridmodel med eksempler på tre forskellige gridarkitekturer. NorduGrid-projektet gennemgås med fokus på dets organisatoriske setup og værktøjskassens arkitektur. Herefter præsenteres projektets indledende overvejelser: behovet for en Job Manager, designfilosofien og de vigtigste funktioner. En samlet oversigt over Job Manageren følges af beskrivelser af de mindre moduler. Et centralt designvalg er at adskille bogføring af job (status og historik) fra udførende handlinger i separate moduler. Det gør det muligt at ændre måden, jobs behandles på, uden at ændre hele systemet. Rapporten forklarer også, hvordan man undgår et enkelt fejlpunkt (single point of failure) ved at lade andre Job Managers overtage automatisk (failover), hvis én fejler. Et afsluttende kapitel skitserer fremtidige forbedringer. Konklusionen er, at Job Manageren tilbyder en robust ramme til produktionsmiljøer og en enkel grænseflade, som applikationer kan bruge til at få adgang til gridet.
This report presents the development of a system that manages compute jobs in NorduGrid. The project continues earlier work that created a background program (daemon) capable of resubmitting failed jobs. The report starts with an introduction to grids—distributed computing systems where multiple organizations share resources—and explains how resource sharing is coordinated through virtual organizations (VOs). It then outlines a general grid model and gives examples of three different grid architectures. The NorduGrid project is described, including its governance and the architecture of its toolkit. The report then covers the project’s initial considerations: the need for a Job Manager, the design philosophy, and key features. An overview of the Job Manager is followed by descriptions of its smaller modules. A central design choice is to separate job bookkeeping (status and history) from the actions that operate on jobs, placing them in different modules. This makes it possible to redefine how jobs are handled without changing the whole system. The report also explains how to avoid a single point of failure by allowing other Job Managers to take over automatically (failover) if one instance fails. A final chapter outlines future work. The conclusion is that the Job Manager provides a robust framework for production systems and a simple interface that applications can use to access the grid.
[This abstract was generated with the help of AI]
Documents
