Security Alignment of Large Language Models via Jailbreaking Attacks: A Multilingual Perspective
Translated title
Security Alignment of Large Language Models via Jailbreaking Attacks
Author
Term
4. semester
Education
Publication year
2025
Submitted on
2025-06-04
Pages
67
Abstract
Large language models (LLMs) have become popular in recent years due to their abilities in performing tasks such as text summarization, language translation, and code generation. However, research has shown that LLMs often come with security challenges. One such challenge is ensuring that the responses that LLMs produce does not contain any offensive content. Jailbreaking is a red teaming technique that aims to exploit LLMs with the intention of making them generate offensive responses. Jailbreaking is usually performed in either a black-box setting where the attackers have no access to the LLMs inner mechanisms and in a white-box setting where they have some access. This thesis explores LLM jailbreaking under a monolingual setting and a multilingual setting, which has been subject to less research, using a white-box setting against two open-source LLMs. The results indicate that under the monolingual setting the two LLMs are more vulnerable while under the multilingual setting the results are more ambiguous.
Keywords
Documents
