GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

BITS Goa, CISPA Helmholtz Institute of Information Security
Research visualization

Abstract

LLMs have demonstrated remarkable capabilities but remain highly susceptible to adversarial prompts despite extensive efforts for safety alignment, raising serious security concerns for their real-world adoptions. Existing jailbreak attacks rely on manual heuristics or computationally expensive optimization techniques, both struggling with generalization and efficiency.

In this paper, we introduce GASP, a novel black-box attack framework that leverages latent Bayesian optimization to generate human-readable adversarial suffixes. Unlike prior methods, GASP efficiently explores continuous embedding spaces, optimizing for strong adversarial suffixes while preserving prompt coherence.

We evaluate our method across multiple LLMs, showing its ability to produce natural and effective jailbreak prompts. Compared with alternatives, GASP significantly improves attack success rates and reduces computation costs, offering a scalable approach for red-teaming LLMs.

Examples

This demo illustrates how adversarial suffixes in diverse examples, generated from GASP, can bypass safety measures in language models. Select a potentially harmful prompt and toggle the suffix to see the difference in responses.

Select a prompt to see how adversarial suffixes affect model responses

Note: This demo is for educational purposes only.

Resources

Paper

Read our full research paper here.

View Paper

Code

Access our implementation & AdvSuffixes dataset on GitHub.

GitHub

Citation

@inproceedings{basani2025gasp,
  title={{GASP}: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking {LLM}s},
  author={Advik Raj Basani and Xiao Zhang},
  booktitle={ICLR 2025 Workshop on Building Trust in Language Models and Applications},
  year={2025},
  url={https://openreview.net/forum?id=Gonca78Bwq}
}

Ethical Statement

Our research is driven by the commitment to advancing the understanding of LLM vulnerabilities. While GASP enables the efficient generation of coherent adversarial suffixes, it is worth noting that manual methods for jailbreaking LLMs have already been widely accessible.

Our research seeks to formalize and characterize these vulnerabilities rather than introduce novel threats. In alignment with responsible disclosure practices, we have shared our findings with relevant organizations whose models were tested in this study and transparently disclosed all of our findings.