Home / Papers / LLM-Fuzzer: Scaling Assessment of Large Language Model Jailbreaks

LLM-Fuzzer: Scaling Assessment of Large Language Model Jailbreaks

2 Citations2024
Jiahao Yu, Xingwei Lin, Zheng Yu
journal unavailable

An automated solution for large-scale LLM jailbreak susceptibility assessment called LLM-F UZZER, inspired by fuzz testing, which generates additional jailbreak prompts tailored to specific LLMs and highlights that many open-source and commercial LLMs suffer from severe jailbreak issues, even after safety fine-tuning.

Abstract

Warning: This paper contains unfiltered content generated by LLMs that may be offensive to readers. The jailbreak threat poses a significant concern for Large Language Models (LLMs), primarily due to their potential to generate content at scale. If not properly controlled, LLMs can be exploited to produce undesirable outcomes, including the dissemination of misinformation, offensive content, and other forms of harmful or unethical behavior. To tackle this pressing issue, researchers and developers often rely on red-team efforts to manually create adversarial inputs and prompts designed to push LLMs into generating harmful, biased, or inappropriate content. However, this approach encounters serious scalability challenges. To address these scalability issues, we introduce an automated solution for large-scale LLM jailbreak susceptibility assessment called LLM-F UZZER . Inspired by fuzz testing, LLM-F UZZER uses human-crafted jailbreak prompts as starting points. By employing carefully customized seed selection strategies and mutation mechanisms, LLM-F UZZER generates additional jailbreak prompts tailored to specific LLMs. Our experiments show that LLM-F UZZER -generated jailbreak prompts demonstrate significantly increased effectiveness and transferability. This highlights that many open-source and commercial LLMs suffer from severe jailbreak issues, even after safety fine-tuning.