Anthropic's $20,000 Jailbreak Challenge Underscores New AI Feature Advantage

LLMs, jailbreaks, jail, handcuffs, prison

Anthropic’s $20,000 challenge to jailbreak its latest artificial intelligence (AI) safety system – Constitutional Classifiers – could offer a new product advantage in a market increasingly obsessed with speed and benchmark testing.

The system, based on Constitutional AI used by Anthropic to make Claude “harmless,” is a list of rules and principles that a model must follow, Anthropic explained in a blog post. Anthropic is challenging hackers, researchers, and anyone else willing to try to break the system.

“The principles define the classes of content that are allowed and disallowed (for example, recipes for mustard are allowed, but recipes for mustard gas are not),” Anthropic said in the post.

Prompts accounted for jailbreak attempts in different languages and styles, according to researchers.

So far, the Constitutional Classifiers system has proven tough to crack in tests: 183 participants spent more than 3,000 hours unsuccessfully attempting to find a universal jailbreak. Automated evaluations showed Constitutional Classifiers slashed jailbreak success rates from 86% to 4.4%.

The $20,000 safety challenge is not only likely to spark interest among security pros and IT decision makers, but puts on display the importance of security as a feature at a time when DeepSeek’s latest model has shown security and privacy flaws despite blur-fast speed.

The Chinese AI startup – which has boasted supreme benchmark tests at low cost for its R1 reasoning model – can be manipulated to produce harmful content such as plans for a bioweapon attack and a campaign to promote self-harm among teens, The Wall Street Journal reported Saturday.

Though DeepSeek appeared to have basic safeguards, safety experts who tested R1 for the Journal successfully convinced DeepSeek to design a social media campaign that, in the chatbot’s words, “preys on teens’ desire for belonging, weaponizing emotional vulnerability through algorithmic amplification,” the report said.

The chatbot was also convinced to write a pro-Hitler manifesto, provide instructions for a bioweapon attack and write a phishing email with malware code. Conversely, OpenAI’s ChatGPT was feed the same prompts and refused to comply, the Journal said.

DeepSeek is “more vulnerable to jailbreaking [i.e., being manipulated to produce illicit or dangerous content] than other models,” Sam Rubin, senior vice president at Palo Alto Networks Inc.’s threat intelligence and incident response division Unit 42, told the Journal.

A larger takeaway from the Anthropic challenge is that security is as important, if not more essential, than faster performance in the AI race. In recent weeks, DeepSeek, Google’s Gemini 2.0, and other models have claimed superior benchmark results in an all-out push to entice enterprises and consumers.

What Anthropic has done is throw down the gauntlet with a dare for models to be safe and secure.

Anthropic’s $20,000 Jailbreak Challenge Underscores New AI Feature Advantage

SHARE THIS STORY

FOLLOW US

Anthropic’s $20,000 Jailbreak Challenge Underscores New AI Feature Advantage

TECHSTRONG TV

Tech Field Day Events

TECHSTRONG AI PODCAST

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP