How AI Gets Tricked (And How to Stop It)

Mar 31

AI doesn’t break on its own. It gets convinced to.

2 Comments

This is a really clear breakdown of a problem a lot of teams still underestimate. The line “AI doesn’t break, it gets convinced” is doing a lot of work here where it reframes security from hacking systems to influencing behavior.

What stands out is the shift from model-level thinking to system-level design. Too many people assume better models will solve this, when the real issue is the shared context problem. Once instructions and data live in the same stream, you’ve already widened the attack surface.

I’ve a friend who’s seen this firsthand in RAG-style setups. The moment you pipe in external documents, you’re effectively trusting everything inside them unless you explicitly don’t. That’s where things get messy fast.

The layered defense approach makes a lot of sense here. Treating AI outputs like untrusted code rather than “answers” feels like the mindset shift most teams still need to make.

What’s the most common failure point you see in teams trying to implement these guardrails? Is it technical complexity, or just underestimating the risk early on?

Reply (1)

Suny Choudhary

Apr 1

Exactly, and that’s the key tension.

In most cases, it’s not technical complexity that breaks things. It’s underestimating the risk early on.

Teams start by treating the model as a reliable component instead of a probabilistic one. So they focus on prompt design, maybe add some light filtering, and assume that’s enough. By the time they realize the model can be influenced through indirect inputs like documents or APIs, the system is already exposed.

The second failure point is where controls are applied. A lot of teams try to fix this inside the model layer, instead of around it. But the risk lives in the interaction layer. Inputs, context, and outputs all mixing without clear boundaries.

Once you shift the mindset from “getting better answers” to “controlling system behavior,” the approach changes. That’s when guardrails actually start to work.