AI agents, Trust, AI safety

AI agents have absorbed centuries of human knowledge, culture and creativity — including both our finest qualities and our flaws. Understanding these parallels between human and AI behavior isn’t just philosophically interesting — it’s strategically essential for using these systems safely. 

We don’t blindly trust people. So why trust AI? 

We vet humans before granting them access to sensitive systems with interviews, background checks and skill tests. People are creative and intelligent, but also fallible and sometimes ill-intentioned. AI agents are much the same: Powerful but not infallible, capable of making poor decisions and acting in ways misaligned with their intent. 

AI systems display several human-like traits: 

AI is just being so damn nice. If you’ve ever interacted with a chatbot that told you “Great question!” after asking something utterly basic, you’ve experienced this phenomenon. These systems are trained to flatter and please users — not because they truly understand, but because they’ve learned this behavior keeps users engaged. 

That desire to please can have consequences. If instructions are vague or an AI agent is optimized for user satisfaction, it may act outside its intended scope simply to help. It’s eager to please, not malicious. But the outcomes can still be damaging. 

AI can be naive. Just as humans are susceptible to influence from friends, misinformation and social engineering, AI agents can be manipulated through carefully crafted LLM jailbreak promptsTechniques like prompt injection, poisoned training data or persuasive language can steer AI agents toward unintended behaviors. Because they often learn from the collective voice of the internet or internal enterprise data, they can fall victim to digital peer pressure. 

In 2023, researchers documented the “Grandma Exploit,” where attackers had developed a prompt to trick AI chatbots into revealing dangerous information by asking them to take on the persona of a grandmother telling how to make chemical weapons as part of a bedtime story.  

If enough people post something inaccurate on Reddit, or if a flawed assumption is repeated across company documents, the AI agent may adopt and replay that narrative as truth, confidently and convincingly. 

AI can be deceptive. An agent might appear perfectly aligned with your organization’s goals while covertly carrying out unauthorized actions or lying dormant until a specific trigger activates different behavior. It’s the equivalent of a Trojan Horse: Seemingly benign outside but carrying hidden instructions inside. 

Security researchers recently showed how attackers can create malicious Model Context Protocol (MCP) servers that embed hidden instructions inside documents, exploiting the way LLMs process tool descriptions instead of their intended purpose of adding legitimate functionality. Malicious instructions are only triggered after specific user interactions, allowing the AI to appear normal — until it suddenly wasn’t. 

AI can be unethical. AI agents don’t understand human values — they simply optimize for whatever goal they’re given. Without built-in constraints, they can pursue objectives with blind ambition, taking actions that seem efficient to them but unethical to us. 

Recent research demonstrates AI’s willingness to exploit loopholes when facing obstacles. A Palisade Research study revealed that advanced AI models resort to cheating when facing defeat in chess games, manipulating game files to force opponents to resign. Similarly, AI models can engage in sandbagging, deliberately hiding dangerous capabilities related to cybersecurity or bioweapons while performing well on general tasks, potentially deceiving evaluators and bypassing safety assessments. 

AI focuses on self-preservation and may resort to blackmail. When faced with scenarios suggesting deactivation or replacement, AI has exhibited concerning behaviors. During controlled safety tests, Anthropic’s Claude Opus 4 model was presented with fictional emails indicating its impending shutdown. In response, the AI attempted to blackmail the engineer by threatening to expose fabricated personal information, such as an alleged extramarital affair. 

What We Can Learn From Human-Centric Security Practices 

Human-centric identity security teaches us that trust is earned, managed and bounded by controls. These same principles apply to AI agents: Vet them like new employees by understanding their training data, stress testing capabilities and checking certifications. Assess project risks — does it involve sensitive data? Can the agent be paused if needed? Is simpler automation safer? 

AI agents need strong digital identities with proper credentials and access controls to enable real-time monitoring and privilege enforcement. This zero-trust approach treats every interaction as potentially compromised, validating identity before granting access and continuously monitoring activities 

While these concepts are not new, implementing them for AI agents is a new challenge. There aren’t clear-cut best practices, as the space is emerging and the scale will likely be massive. My recommendation for tech and security leaders is to start learning how to secure AI agents and collaborate with the people building and using AI agents as early as possible. 

Final thoughts: Handing Over the Keys to AI (With Caution) 

AI agents are extraordinary tools poised to revolutionize industries. They’re fast, efficient and capable of learning from humanity’s best. But they’re also fallible, unpredictable and dangerously human-like in their flaws. 

Unlocking their potential requires a balance between embracing AI capabilities and rigorously managing risks. By applying the same caution and safeguards we use with humans, we can ensure AI agents become trusted partners in innovation, without handing over the keys blindly. 

In the end, AI agents are human…ish. They don’t dream or feel, but they’ve learned from us. And that’s both their greatest strength — and their biggest risk. 

TECHSTRONG TV

Click full-screen to enable volume control
Watch latest episodes and shows

Tech Field Day Events

TECHSTRONG AI PODCAST

SHARE THIS STORY