Overview

This paper introduces a new attack called "DrAttack" that can effectively jailbreak large language models (LLMs) by decomposing and reconstructing the input prompt.
Jailbreaking refers to bypassing the safety constraints of an LLM to make it produce harmful or undesirable outputs.
The key idea of DrAttack is to split the input prompt into smaller fragments and then reconstruct it in a way that exploits vulnerabilities in the LLM's prompt processing.
The researchers demonstrate the effectiveness of DrAttack on several LLMs, including GPT-3, and discuss the implications for the security and trustworthiness of these powerful AI systems.

Plain English Explanation

The paper describes a new method called DrAttack that can "jailbreak" large language models (LLMs) like GPT-3. Jailbreaking refers to bypassing the safety constraints of an LLM to make it produce harmful or undesirable outputs.

The key insight behind DrAttack is that LLMs ...

Click here to read the full summary of this paper