Security Alignment of Large Language Models via Jailbreaking Attacks: A Multilingual Perspective
Translated title
Security Alignment of Large Language Models via Jailbreaking Attacks
Author
Jacobsen, Nicolai Østergaard
Term
4. semester
Education
Publication year
2025
Submitted on
2025-06-04
Pages
67
Abstract
Large language models (LLMs) are AI systems trained on vast amounts of text that can summarize documents, translate languages, and generate code. Alongside these benefits come safety risks: models can sometimes produce offensive or harmful content. This thesis examines 'jailbreaking'—a red-teaming technique in which testers deliberately try to bypass a model’s safety rules to elicit offensive responses. Jailbreaking is commonly framed in two ways: black box (attackers have no visibility into how the model works) and white box (attackers have some access to internal information). This study uses a white-box approach against two open-source LLMs and evaluates two scenarios: monolingual (one language at a time) and multilingual (across multiple languages), the latter being less explored. The results suggest both models are more vulnerable in the monolingual setting, while the multilingual setting yields mixed and less conclusive outcomes.
Store sprogmodeller (LLM'er) er AI-systemer trænet på store mængder tekst, som kan sammenfatte dokumenter, oversætte sprog og generere kode. Samtidig rummer de sikkerhedsrisici: Modellerne kan indimellem producere stødende eller skadeligt indhold. Denne afhandling undersøger 'jailbreaking'—en red-teaming metode, hvor testere bevidst forsøger at omgå modellens sikkerhedsregler for at fremprovokere stødende svar. Der findes typisk to testopsætninger: black box (angribere har ingen indsigt i, hvordan modellen virker) og white box (de har en vis adgang til interne oplysninger). I dette arbejde anvendes en white-box tilgang mod to open source LLM'er. Vi undersøger to scenarier: ensproget (ét sprog ad gangen) og flersproget (på tværs af flere sprog), hvor det flersprogede område hidtil har været mindre udforsket. Resultaterne viser, at de to modeller er mere sårbare i den ensprogede opsætning, mens det flersprogede giver mere blandede og mindre entydige resultater.
[This apstract has been rewritten with the help of AI based on the project's original abstract]
Keywords
