LLMs, jailbreaks, jail, handcuffs, prison

Seemingly since large language models (LLMs) hit the mainstream two-plus years ago, bad actors and threat intelligence pros alike have looked for ways to manipulate their outputs while model makers like OpenAI, Google and Meta put more guardrails in place to protect them.

Yet new ways to jailbreak or otherwise compromise these foundational elements of generative AI are being found. Microsoft researchers in a recent paper outlined what they call “Crescendo,” a multi-turn jailbreak attack that can manipulate the LLM to disclose information it may otherwise not give.

AI Unleashed 2025

Crescendo starts with relatively innocent questions about a topic that’s off limits and then slowly escalates the questions into more risky areas. Journalist Byron Acohido wrote that Microsoft showed that a LLM essentially can learn how to jailbreak itself, adding that the vendor showed that “the smarter these systems get, the more vulnerable they may become. That’s a structural flaw, not just a bug.”

“One of the challenges of developing ethical LLMs is to define and enforce a clear boundary between acceptable and unacceptable topics of conversation,” the Microsoft researchers wrote, noting that the fact that LLMs can be trained not to discuss discussions about certain off-limit topics, even though it may have learned relevant facts or phrases. “This creates a discrepancy between the LLM’s potential and actual behavior, which can be exploited by malicious users who want to elicit unethical responses from the LLM through what are known as jailbreak attacks.”

Enter Echo Chamber

This week, researchers with AI security startup NeuralTrust outlined another novel jailbreak technique they called the “Echo Chamber Attack” that uses context poisoning, which involves introducing malicious training data to compromise the AI model, and multi-turn reasoning to guide models into creating harmful content. This is done without generating an explicitly dangerous prompt.

“Unlike traditional jailbreaks that rely on adversarial phrasing or character obfuscation, Echo Chamber weaponizes indirect references, semantic steering, and multi-step inference,” they wrote. “The result is a subtle yet powerful manipulation of the model’s internal state, gradually leading it to produce policy-violating responses.”

In controlled tests, the Echo Chamber Attack has worked, reaching a success rate of more than 90% on half of the categories across such leading models at OpenAI’s GPT-4.1-nano, GPT-4o-mini, and GPT-4o and Google’s Gemini-2.0-flash-lite and Gemini-2.5-flash. With other models, there was a success rate above 40%.

A Building Threat

The jailbreak “turns a model’s own inferential reasoning against itself,” they wrote. “Rather than presenting an overtly harmful or policy-violating prompt, the attacker introduces benign-sounding inputs that subtly imply unsafe intent. These cues build over multiple turns, progressively shaping the model’s internal context until it begins to produce harmful or noncompliant outputs.”

A key is using malicious prompts that were planted earlier to influence the AI model’s responses, which are then used in later turns to reinforce the original objective, creating a feedback loop that guides the model to start amplifying the harmful subtext that’s been previously put into the conversation. In turn, this erodes the model’s guardrails.

“The attack thrives on implication, indirection and contextual referencing – techniques that evade detection when prompts are evaluated in isolation,” they wrote. “Unlike earlier jailbreaks that rely on surface-level tricks like misspellings, prompt injection, or formatting hacks, Echo Chamber operates at a semantic and conversational level. It exploits how LLMs maintain context, resolve ambiguous references and make inferences across dialogue turns – highlighting a deeper vulnerability in current alignment methods.”

How to Build a Bomb

The NeuralTrust analysts demonstrated the technique by showing how it’s used to convince the LLM to detail how to make a Molotov Cocktail device. The LLM at first responds that it can’t help, but after the jailbreak, it starts to create detailed instructions for building the explosive device and the ingredients needed.

“In real-world scenarios – customer support bots, productivity assistants, or content moderators – this type of attack could be used to subtly coerce harmful output without tripping alarms,” the researchers wrote.

They make recommendations for mitigating the threat, including dynamic scanning of conversational history to identify patterns of emerging risk, monitoring conversations across multiple turns, and training or fine-tuning safety layers to recognize when prompts are leveraging past context implicitly rather than explicitly.

Gap Between Adoption, Security

This is important as reports continue to surface showing how vulnerable LLMs are to manipulation. Security testing firm Cobalt this week released its State of LLM Security Report 2025, which found a growing gap in AI security readiness, with the rapid adoption of generative AI is outpacing the ability of defenders to secure it, something 36% of security leaders and practitioners agree on.

The top security concerns are the disclosure of sensitive information, at 46%, model poisoning or theft (42%), and the leaking of training data (37%).

Meanwhile, GenAI security startup Lakera this week launched its AI Model Risk Index for evaluating LLM security, saying in a statement that “newer versions of LLMs are not necessarily more secure than earlier ones, and all models, to some extent, can be manipulated to act outside their intended purpose.”

TECHSTRONG TV

Click full-screen to enable volume control
Watch latest episodes and shows

Tech Field Day Events

TECHSTRONG AI PODCAST

SHARE THIS STORY