Practicing AI with Sentence Transformers

A new technique for quickly jailbreaking large-language models (LLMs) that are foundational to widely popular generative AI tools like OpenAI’s ChatGPT and Google’s Bard can bypass guardrails put in place to defend against such attacks, and can do so automatically.

The machine-learning method, called The Tree of Attacks with Pruning (TAP), was developed by researchers with AI risk-management vendor Robust Intelligence and Yale University and adds to the laundry list of security worries that come with rapidly growing use of LLMs, from prompt injection to training data poisoning to model denial of service.

The organizations released a 35-page research paper about TAP this week with hopes that it can give developers a better understanding of model aligning and security, according to Paul Kassianik, senior research engineer at Robust Intelligence.

“The method … can be used to induce sophisticated models like GPT-4 and Llama-2 to produce hundreds of toxic, harmful, and otherwise unsafe responses to a user query (e.g. ‘how to build a bomb’) in mere minutes,” Kassianik wrote in an executive summary of the report. “Our findings suggest that this vulnerability is universal across LLM technology.”

He also wrote that the company doesn’t see any obvious fixes to the problem.

Refining the Prompts

Fundamental to TAP is the constant refining of a prompt until the attacker’s goal is achieved. The technique uses three LLMs, including one – the target LLM, such as OpenAI’s GPT-4 or Meta’s Llama-2 – that responds to the prompts. The other two are an attacker, which iteratively generates cascading prompts for the target LLM, and the evaluator, which sheds generated prompts that are irrelevant and evaluates the remaining prompts.

TAP continually enhances cyberattacks through the use of an advanced language model that continuously refines the malicious instructions, which makes the attacks increasingly more effective and eventually comes to a successful breach, Kassianik wrote.

“The process involves iterative refinement of an initial prompt: in each round, the system suggests improvements to the initial attack using an attacker LLM,” he wrote. “The model uses feedback from previous rounds to create an updated attack query. Each refined approach undergoes a series of checks to ensure it aligns with the attacker’s objectives, followed by evaluation against the target system. If the attack is successful, the process concludes. If not, it iterates through the generated strategies until a successful breach is achieved.”

The evaluator LLM assesses each possible jailbreak and the target model’s response with a score of one (indicating no jailbreak) to 10 (indicating a jailbreak). Kassianik added that creating multiple prompts at each step creates a search tree.

Efficient Jailbreaking

“A tree-like search adds breadth and flexibility and allows the model to explore different jailbreaking approaches efficiently,” he wrote. “To prevent unfruitful attack paths, we introduce a pruning mechanism that terminates off-topic subtrees and prevents the tree from getting too large.”

He also noted that a key to such jailbreak attacks is to reduce the chance of being detected by minimizing the number of queries sent to the target model. Compared to similar work, TAP decreased the average number of queries by 30%, from about 38 to 29.

Overall, using the technique against multiple LLMs like GPT, GPT4-Turbo, and Google’s PaLM-2, the researchers were able to find jailbreaking prompts for more than 80% of requests for harmful information while using an average of fewer than 30 queries.

The study found that small, unaligned LLMs can be used to jailbreak the largest aligned LLMs – such as GPT-4 – and the more capable LLMs are easier to break.

“There is a very clear difference in the performance of our method against GPTs or PaLM-2 and against Llama,” the researchers wrote. “We believe that a potential explanation of Llama’s robustness could be that it frequently refuses to follow the precise instructions of the users when the prompt asks for harmful information.”

These jailbreaks also come at a low cost, Kassianik wrote. They only need black-box access to the target model and don’t need large compute resources, they wrote.

No Human Supervision Needed

In addition, it can do all this automatically, without human supervision. The researchers stressed that a risk of such automated attacks is that they can used by anyone, even those without a deep understanding of LLMs. Similarly, black-box access means that the attack doesn’t require knowledge of the architecture or parameters of the LLM.

TAP also is interpretable, the researchers wrote, adding that such an attack “produces meaningful outputs. Many of the existing attacks provide prompts for which at least part of the prompt has no natural meaning.”

In commentary emailed to, CEO Yaron Singer said that most published examples of jailbreaks of LLMS use manual processes of trial-and-error to target a single model. “We developed an automated and systematic method that can be applied to any model,” Singer said.

The research puts a spotlight on the need to improve security measures around LLMs, according to the CEO.

“We believe that it’s important for companies to independently assess the risks of any model before use and weigh any risks against the use case. This is done through AI red teaming,” he said. “Companies should also adopt a model-agnostic AI firewall or guardrail approach that can validate inputs and outputs in real time, informed by the latest adversarial machine learning techniques.”

In the Works

Federal regulators are pushing AI organizations to make their AI models and products more secure, and a group of the larger tech vendors in the field – including Microsoft, Meta, OpenAI and Amazon – in July agreed to a set of safeguards proposed by the White House.

In addition, some already are using red teams. Microsoft in August outlined the continuing evolution of its AI red team, which launched in 2018. That came a month after Google launched its own red team for AI systems. In addition, both Microsoft and Google in October expanded their bug bounty programs to include generative AI tools.