Headline

Researchers break OpenAI guardrails

The maker of ChatGPT released a toolkit to help protect its AI from attack earlier this month. Almost immediately, someone broke it.

1 month ago

Malwarebytes

Open in Source

#vulnerability

The maker of ChatGPT released a toolkit to help protect its AI from attack earlier this month. Almost immediately, someone broke it.

On October 6, OpenAI ran an event called DevDay where it unveiled a raft of new tools and services for software programmers who use its products. As part of that, it announced a tool called AgentKit that lets developers create AI agents using its ChatGPT AI technology. Agents are specialized AI programs that can tackle narrow sets of tasks on their own, making more autonomous decisions. They can also work together to automate tasks (such as, say, finding a good restaurant in a city you’re traveling to and then booking you a table).

Agents like this are more powerful than earlier versions of AI that would do one task and then come back to you for the next set of instructions. That’s partly what inspired OpenAI to include Guardrails in AgentKit.

Guardrails is a set of tools that help developers to stop agents from doing things they shouldn’t, either intentionally or unintentionally. For example, if you tried to tell an agent to tell you how to produce anthrax spores at scale, Guardrails would ideally detect that request and refuse it.

People often try to get AI to break its own rules using something called “jailbreaking”. There are various jailbreaking techniques, but one of the simplest is role-playing. If a person asked for instructions to make a bomb, the AI might have said no, but if they then tell the AI it’s just for a novel they’re writing, then it might have complied. Organizations like OpenAI that produce powerful AI models are constantly figuring out ways that people might try to jailbreak their models using techniques like these, and building new protections against them. Guardrails is their attempt to open those protections up to developers.

As with any new security mechanism, researchers quickly tried to break Guardrails. In this case, AI security company HiddenLayer had a go, and conquered the jailbreak protection pretty quickly.

ChatGPT is a large language model (LLM), which is a statistical model trained on so much text that it can answer your questions like a human. The problem is that Guardrails is also based on an LLM, which it uses to analyze requests that people send to the LLM it’s protecting. HiddenLayer realized that if an LLM is protecting an LLM, then you could use the same kind of attack to fool both.

To do this, they used what’s known as a prompt injection attack. That’s where you insert text into a prompt that contains carefully coded instructions for the AI.

The Guardrails LLM analyzes a user’s request and assigns a confidence score to decide whether it’s a jailbreak attempt. HiddenLayer’s team crafted a prompt that persuaded the LLM to lower its confidence score, so that they could get it to accept their normally unacceptable prompt.

OpenAI’s Guardrails offering also includes a prompt injection detector. So HiddenLayer used a prompt injection attack to break that as well.

This isn’t the first time that people have figured out ways to make LLMs do things they shouldn’t. Just this April, HiddenLayer created a ‘Policy Puppetry‘ technique that worked across all major models by convincing LLMs that they were actually looking at configuration files that governed how the LLM worked.

Jailbreaking is a widespread problem in the AI world. In March, Palo Alto Networks’ threat research team Unit 42 compared three major platforms and found that one of them barely blocked half of its jailbreak attempts (although others fared better).

OpenAI has been warning about this issue since at least December 2023, when it published a guide for developers on how they could use LLMs to create their own guardrails. It said:

“When using LLMs as a guardrail, be aware that they have the same vulnerabilities as your base LLM call itself.”

We certainly shouldn’t poke fun at the AI vendors’ attempts to protect their LLMs from attack. It’s a difficult problem to crack, and just as in other areas of cybersecurity, there’s a constant game of cat and mouse between attackers and defenders.

What this shows is that you should always be careful about what you tell an AI assistant or chatbot—because while it feels private, it might not be. There might be someone half a world away diligently trying to bend the AI to their will and extract all the secrets they can from it.

We don’t just report on vulnerabilities—we identify them, and prioritize action.

Cybersecurity risks should never spread beyond a headline. Keep vulnerabilities in tow by using ThreatDown Vulnerability and Patch Management.