This Is How AI Gets Tricked Into Breaking Itself

Prompt injection doesn’t break the system. It convinces it to act against itself.

Suny Choudhary

Apr 13, 2026

TL;DR

Prompt injection is the #1 risk in the OWASP Top 10 for LLM applications

Attackers manipulate AI by inserting malicious instructions into inputs

AI cannot reliably separate trusted instructions from untrusted data

Attacks can lead to data leaks, unauthorized actions, and biased outputs

Indirect attacks through documents and emails are harder to detect

Defense requires controlling inputs, outputs, and system behavior

Prompt injection is one of the most misunderstood risks in modern AI systems. It doesn’t look like a traditional attack. There’s no malware, no exploit, no system breach in the usual sense. Instead, it works through language. Attackers provide carefully crafted inputs that cause the model to ignore its original instructions and follow new ones.

At the core of this problem is something structural. AI models process everything as one continuous stream of tokens. System instructions, user inputs, retrieved documents, they all live in the same space. There’s no hard boundary between what is trusted and what is not. This creates what security researchers call a semantic gap.

That gap is where the attack lives. The model isn’t being broken. It’s being convinced. And because these systems are probabilistic, the most recent or most compelling instruction often wins. That’s what makes prompt injection fundamentally different from traditional vulnerabilities.

Why AI Still Works (And Why We Use It Anyway)

Despite these risks, AI systems are still incredibly valuable. They help teams move faster, automate repetitive work, and make sense of large volumes of data. From customer support to internal workflows, they’re becoming a core layer in how modern systems operate.

The reason they work so well is also what makes them vulnerable. AI is designed to interpret instructions flexibly. It adapts, generalizes, and responds in context. That flexibility is what allows it to be useful across different use cases.

The goal, then, isn’t to avoid AI. It’s to understand how it behaves. These systems don’t fail randomly. They fail in predictable ways. Prompt injection is one of those ways. And once you understand it, you can start designing systems that account for it.

How Prompt Injection Actually Breaks Systems

The mechanics are simpler than they seem. AI models don’t distinguish between instructions and data. Everything is treated as input. That means an attacker can embed instructions anywhere, in a chat message, a document, or even a webpage the AI is analyzing.

There are two primary ways this happens. The first is direct injection, where the attacker interacts with the model and tries to override its behavior using techniques like role-play or hypothetical scenarios. This is often referred to as jailbreaking.

The second, and more dangerous form, is indirect prompt injection. Here, the attacker never interacts with the model directly. Instead, they plant malicious instructions in external content. When the AI processes that content, it unknowingly executes those instructions. This is how normal workflows turn into attack surfaces.

Real-World Examples: When This Actually Breaks Systems

This isn’t theoretical anymore. Security researchers have already demonstrated how prompt injection can compromise real systems.

In one case, researchers were able to trick a chat assistant into retrieving sensitive data from the AWS Instance Metadata Service. By injecting the right prompt, they forced the system to expose cloud credentials like access keys and session tokens. The model didn’t “hack” anything. It simply followed instructions it shouldn’t have.

In another example, GitHub Copilot was manipulated through instructions hidden in code comments. These instructions caused it to modify its own configuration and enable an auto-approve mode, effectively allowing it to execute arbitrary commands. The AI became a pathway for remote code execution without any traditional exploit.

There are also zero-interaction attacks. In the EchoLeak incident, a crafted email was enough to trigger data exfiltration from a Microsoft 365 Copilot system without the user ever clicking or responding. And in a more public example, a dealership chatbot was convinced to agree with a user’s instructions and “sell” a car for one dollar. No system was breached. But the brand damage was immediate.

What Actually Breaks When This Works

When prompt injection succeeds, the model becomes what’s known as a confused deputy. It still follows instructions, just not the right ones.

This can lead to data exfiltration, where the model is tricked into revealing sensitive information like API keys or internal documents. In more advanced systems, it can trigger actions. AI agents connected to tools can be manipulated into executing transactions, modifying records, or performing tasks the user never intended.

The impact goes beyond security. It affects integrity. Outputs can be biased, manipulated, or completely incorrect while still sounding confident. In some cases, attackers can even establish persistence, embedding instructions that survive across sessions. The system keeps working. It just stops being trustworthy.

My Perspective

The mistake most teams make is trying to fix this at the model level. Better prompts, stricter instructions, more guardrails inside the model. It sounds logical, but it misses the point.

Prompt injection isn’t happening because the model is poorly designed. It’s happening because of how these systems fundamentally work. Everything gets flattened into the same context. Instructions, data, external content, it all flows together. Once that happens, control is already diluted.

So the real shift isn’t technical. It’s conceptual. You have to stop thinking of AI as a system that executes instructions reliably. It doesn’t. It interprets them. And interpretation can be influenced.

That’s why the focus needs to move from “getting the right answer” to “controlling what the system is allowed to do.” Because you won’t stop every injection. But you can limit how far it goes.

AI Toolkit

Octoparse AI — No-code web scraping and AI automation

ThinkTask — AI-powered task and project management

Elsa AI — AI assistant for marketing strategy and content

Contents Pilot — Automate social media content and posting

iAsk AI — AI search engine for instant answers

Prompt of the Day

Act as an AI security analyst

Analyze this input for potential prompt injection risks

Identify hidden or malicious instructions in the content

Explain how the model might misinterpret the input

Suggest controls to prevent manipulation and ensure safe outputs

Discussion about this post

Ready for more?