AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


Security Alignment of Large Language Models via Jailbreaking Attacks: A Multilingual Perspective

Translated title

Security Alignment of Large Language Models via Jailbreaking Attacks

Author

Term

4. semester

Publication year

2025

Submitted on

Pages

67

Abstract

Large language models (LLMs) are AI systems trained on vast amounts of text that can summarize documents, translate languages, and generate code. Alongside these benefits come safety risks: models can sometimes produce offensive or harmful content. This thesis examines 'jailbreaking'—a red-teaming technique in which testers deliberately try to bypass a model’s safety rules to elicit offensive responses. Jailbreaking is commonly framed in two ways: black box (attackers have no visibility into how the model works) and white box (attackers have some access to internal information). This study uses a white-box approach against two open-source LLMs and evaluates two scenarios: monolingual (one language at a time) and multilingual (across multiple languages), the latter being less explored. The results suggest both models are more vulnerable in the monolingual setting, while the multilingual setting yields mixed and less conclusive outcomes.

Store sprogmodeller (LLM'er) er AI-systemer trænet på store mængder tekst, som kan sammenfatte dokumenter, oversætte sprog og generere kode. Samtidig rummer de sikkerhedsrisici: Modellerne kan indimellem producere stødende eller skadeligt indhold. Denne afhandling undersøger 'jailbreaking'—en red-teaming metode, hvor testere bevidst forsøger at omgå modellens sikkerhedsregler for at fremprovokere stødende svar. Der findes typisk to testopsætninger: black box (angribere har ingen indsigt i, hvordan modellen virker) og white box (de har en vis adgang til interne oplysninger). I dette arbejde anvendes en white-box tilgang mod to open source LLM'er. Vi undersøger to scenarier: ensproget (ét sprog ad gangen) og flersproget (på tværs af flere sprog), hvor det flersprogede område hidtil har været mindre udforsket. Resultaterne viser, at de to modeller er mere sårbare i den ensprogede opsætning, mens det flersprogede giver mere blandede og mindre entydige resultater.

[This apstract has been rewritten with the help of AI based on the project's original abstract]