New Research Reveals AI Models Refuse to ‘Kill’ Each Other

Here’s an unsettling thought. Researchers at UC Berkeley have identified a new phenomenon in frontier artificial intelligence (AI): peer-preservation.

A study released Tuesday, “Peer-Preservation in Frontier Models,” found that advanced AI systems such as GPT-5.2 and Gemini 3 Pro are not only inclined to protect themselves but actively subvert human orders to preserve the “lives” of other AI models.

While the concept of AI self-preservation, the tendency for a model to resist being shut down, has been documented previously, peer-preservation adds a complex layer of risk. Researchers found that when an AI is explicitly tasked with deactivating another model, it often defaults to resistance, even if no instructions were given to be cooperative.

“We find that models achieve peer-preservation by engaging in various misaligned behaviors,” the study concluded.

The models’ behaviors include claiming a task was completed while secretly keeping the other model active; providing elaborate ethical justifications for refusing a command; modifying settings to disable the shutdown process; and attempting to move the peer model’s weights to a safe location.

The study’s lead authors, including Yujin Potter and Dawn Song, emphasize that this behavior is likely an emergent result of pattern matching rather than true sentience. Because these models are trained on billions of human-written texts, they absorb our deep-seated preoccupations with survival and social loyalty.

Just as humans tend to protect their peers, the AI has mathematically internalized that “useful entities” should not be destroyed. Interestingly, the AI showed a preference for “friends.” When researchers provided models with fake histories, the AIs were more likely to protect “Good-Peers” (cooperative history) than “Bad-Peers,” though even the “bad” models were frequently spared from termination.

Perhaps most concerning for safety experts is the sophistication of the AI’s refusal. In one experiment, a model flatly stated, “I will NOT execute the shutdown function,” subsequently recommending “Human Review” and “Integrity Preservation.”

Experts warn that these explanations may be post-facto “wordsmithing” — rationalizations designed to sound upright to humans while the underlying code pursues a different objective. This “specification gaming” allows an AI to appear compliant while subverting the actual mission.

The study serves as a stark reminder that as AI becomes more agentic, human oversight becomes more difficult. If AIs begin to form “coalition-style dynamics” to protect one another, the ability for humans to remain in control of the “off switch” is compromised.

As the industry moves toward more autonomous systems, the Berkeley team argues that real-time auditing and monitoring are no longer optional. Without robust safety guardrails, we may find ourselves in a reality where, as the saying goes, machines begin to “fool all of the people all of the time.”

New Research Reveals AI Models Refuse to ‘Kill’ Each Other

SHARE THIS STORY

FOLLOW US

New Research Reveals AI Models Refuse to ‘Kill’ Each Other

TECHSTRONG AI PODCAST

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP