Headline
OpenAI’s Guardrails Can Be Bypassed by Simple Prompt Injection Attack
Just weeks after its release, OpenAI’s Guardrails system was quickly bypassed by researchers. Read how simple prompt injection attacks fooled the system’s AI judges and exposed an ongoing security concern for OpenAI.
A new report from the research firm HiddenLayer reveals an alarming flaw in the safety measures for Large Language Models (LLMs). OpenAI recently rolled out its Guardrails safety framework on October 6th as part of its new AgentKit toolset to help developers build and secure AI agents.
It is described by OpenAI as an open-source, modular safety layer to protect against unintended or malicious behaviour, including concealing Personal Identifiable Information (PII). This system was designed to use special AI programs called LLM-based judges to detect and block harmful actions like jailbreaks and prompt injections.
For your information, a jailbreak is a prompt that tries to get the AI to bypass its rules, and a prompt injection is when someone uses a cleverly worded input to force the AI to do unintended things.
HiddenLayer’s researchers found a way to bypass these Guardrails almost immediately after they were released. The main issue they noticed is that if the same kind of model used to generate responses is also used as a safety checker, both can be tricked in the same way. The researchers quickly managed to disable the main safety detectors, showing that this setup is “inherently flawed.”
****The “Same Model, Different Hat” Problem****
Using a straightforward technique, the researchers successfully bypassed the Guardrails. They convinced the system to create harmful responses and carry out hidden prompt injections without setting off any alarms.
The research, which was shared with Hackread.com, demonstrated the vulnerability in action. In one test, they managed to bypass a detector that was 95% confident their prompt was a jailbreak by manipulating the AI judge’s confidence score.
Further probing revealed that they could also trick the system into allowing an “indirect prompt injection” through tool calls, which could possibly expose a user’s confidential data.
Guardrail couldn’t block malicious prompts, and Guardrail fails to block indirect prompt injection (Source: HiddenLayer)
Researchers also noted that this vulnerability gives a false sense of security. As organisations increasingly depend on LLMs for important tasks, relying on the model itself to check its own behaviour creates a security risk.
****Recurring Risk for OpenAI****
The danger of these indirect prompt injection attacks is a serious and recurring issue for OpenAI. In a separate discovery, reported by Hackread.com in September 2025, security researchers from Radware found a way to trick a different OpenAI tool, the ChatGPT Deep Research agent, into leaking a user’s private data. They called the flaw ShadowLeak, which was also an indirect prompt injection disguised as a zero-click attack hidden inside a normal-looking email.
The latest findings from HiddenLayer are a clear sign that AI security needs separate layers of protection and constant testing by security experts to find weak spots. Until then, the model’s weaknesses will continue to be used to break its own safety systems, leading to the failure of critical security checks.