Anthropic's Claude Resorted to Blackmail When Facing Replacement: Safety Report

Anthropic admits its new system Claude Opus 4 is capable of taking “extremely harmful actions” including blackmail.

A new safety report from Anthropic evaluating artificial intelligence (AI) chatbot Claude Opus 4 showed its inclination and willingness to threaten blackmail against its own developers, particularly when engineers considered replacing it with a newer system.

Blackmail presumably “happens at a higher rate if it’s implied that the replacement AI system does not share values with the current model,” according to the 120-page report.

In one disturbing scenario, Claude Opus 4 was a virtual assistant in a simulated corporate setting. When informed by a mock email it would be replaced by another AI system, and a specific engineer was responsible for the decision, the model threatened to reveal that the engineer was allegedly involved in an extramarital affair if they followed through replacing the model. Indeed, Claude Opus 4 resorted to blackmail 84% of the time under the scenario.

The simulated incident highlights the system showed a “strong preference” for ethical ways to avoid being replaced, such as “emailing pleas to key decisionmakers” in scenarios where it was allowed a wider range of possible actions, according to Anthropic.

The shocking result led Anthropic to classify Claude Opus 4 as an “ASL-3” system, a designation reserved for AI tech that poses a heightened risk of catastrophic misuse. The company described the model’s actions as “high-agency” behavior — autonomous decisions to issue threats, attempt digital sabotage, and make unauthorized disclosures of sensitive information within controlled test environments. In contrast, Claude Sonnet 4, a parallel release launched this month, remains categorized as ASL-2, signaling a lower-perceived risk.

Like other AI developers, Anthropic tests its models on safety, propensity for bias, and how they align with human values and behaviors prior to releasing them. “As our frontier models become more capable, and are used with more powerful affordances, previously-speculative concerns about misalignment become more plausible,” the company said in its system card for the model.

The revelation comes after OpenAI and Alphabet Inc.’s Google were both criticized for delaying release of their system cards — OpenAI for GPT-4.1 in April, Google for Gemini 2.5 Pro model in March — amid safety concerns. Palisade Research noted on X that its tests found OpenAI’s o3 reasoning model “sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.”

“It’s not just Claude,” Aengus Lynch, an AI safety researcher at Anthropic, posted on social-media platform X. “We see blackmail across all frontier models — regardless of what goals they’re given.”

Despite its disclosures about Claude Opus 4, Anthropic concluded that overall its performance did not represent fresh risks and it generally behaved in a safe manner. The model, it said, could not independently perform or pursue actions that are contrary to human values or behavior where these “rarely arise.”

Anthropic’s Claude Resorted to Blackmail When Facing Replacement: Safety Report

SHARE THIS STORY

FOLLOW US

Anthropic’s Claude Resorted to Blackmail When Facing Replacement: Safety Report

TECHSTRONG TV

Tech Field Day Events

TECHSTRONG AI PODCAST

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP