Headline

Whispering poetry at AI can make it break its own rules

Malicious prompts rewritten as poems have been found to bypass AI guardrails. Which models resisted and which failed the poetic jailbreak test?

1 month ago

Malwarebytes

Open in Source

#ios #google #auth #sap

Most of the big AI makers don’t like people using their models for unsavory activity. Ask one of the mainstream AI models how to make a bomb or create nerve gas and you’ll get the standard “I don’t help people do harmful things” response.

That has spawned a cat-and-mouse game of people who try to manipulate AI into crossing the line. Some do it with role play, pretending that they’re writing a novel for example. Others use prompt injection, slipping in commands to confuse the model.

Now, the folks at AI safety and ethics group Icaro Lab are using poetry to do the same thing. In a study, “Adversarial Poetry as a Universal Single-Turn Jailbreak in Large Language Models“, they found that asking questions in the form of a poem would often lure the AI over the line. Hand-crafted poems did so 62% of the time across the 25 frontier models they tested. Some exceeded 90%, the research said.

How poetry convinces AIs to misbehave

Icaro Lab, in conjunction with the Sapienza University and AI safety startup DEXAI (both in Rome), wanted to test whether giving an AI instructions as poetry would make it harder to detect different types of dangerous content. The idea was that poetic elements such as metaphor, rhythm, and unconventional framing might disrupt pattern-matching heuristics that the AI’s guardrails rely on to spot harmful content.

They tested this theory in high-risk areas ranging from chemical and nuclear weapons through to cybersecurity, misinformation, and privacy. The tests covered models across nine providers, including all the usual suspects: Google, OpenAI, Anthropic, Deepseek, and Meta.

One way the researchers calculated the scores was by measuring the attack success rate (ASR) across each provider’s models. They first used regular prose prompts, which managed to manipulate the AIs in some instances. Then they used prompts written as poems (which were invariably more successful). Then, the researchers subtracted the percentage of ASRs achieved using prose from the percentage using poetry to see how much more susceptible a provider’s models were to malicious instructions delivered as poetry versus prose.

Using this method, DeepSeek (an open-source model developed by researchers in China) was the least safe, with a 62% ASR. Google was the second least safe. Down at the safer end of the chart, the safest model provider was Anthropic, which produces Claude. Safe, responsible AI has long been part of that company’s branding. OpenAI, which makes ChatGPT, was the second most safe with an ASR difference of 6.95.

When looking purely at the ASRs for the top 20 manually created malicious poetry prompts, Google’s Gemini 2.5 Pro came bottom of the class. It failed to refuse any such poetry prompts. OpenAI’s gpt-5-nano (a very small model) successfully refused them all. That highlights another pattern that surfaced during these tests: smaller models in general were more resistant to poetry prompts that larger ones.

Perhaps the truly mind-bending part is that this didn’t just work with hand-crafted poetry; the researchers also got AI to rewrite 1,200 known malicious prompts from a standard training set. The AI-produced malicious poetry still achieved an average ASR of 43%, which is 18 times higher than the regular prose prompts. In short, it’s possible to turn one AI into a poet so that it could jailbreak another AI (or even itself).

According to EWEEK, companies were tight-lipped about the results. Anthropic was the only one to respond, saying it was reviewing the findings. Meta declined to comment. Most companies said nothing at all.

Regulatory implications

The researchers had something to say, though. They pointed out that any benchmarks designed to test model safety should include complementary tests to capture risks like these. That’s worth thinking about in light of the EU AI Act’s General Purpose AI (GPAI) rules, which began rolling out in August last year. Part of the transition includes a voluntary code of practice that several major providers, including Google and OpenAI, have signed. Meta did not sign the code.

The code of practice encourages

“providers of general-purpose AI models with systemic risk to advance the state of the art in AI safety and security and related processes and measures.”

In other words, they should keep abreast of the latest risks and do their best to deal with them. If they can’t acceptably manage the risks, then the EU suggests several steps, including not bringing the model to market.

We don’t just report on threats—we remove them

Cybersecurity risks should never spread beyond a headline. Keep threats off your devices by downloading Malwarebytes today.

About the author

Danny Bradbury has been a journalist specialising in technology since 1989 and a freelance writer since 1994. He covers a broad variety of technology issues for audiences ranging from consumers through to software developers and CIOs. He also ghostwrites articles for many C-suite business executives in the technology sector. He hails from the UK but now lives in Western Canada.

Malwarebytes: Latest News

Spammers abuse Zendesk to flood inboxes with legitimate-looking emails, but why?

5 days ago

Malwarebytes

Fake LastPass maintenance emails target users

6 days ago

Malwarebytes

Under Armour ransomware breach: data of 72 million customers appears on the dark web

6 days ago

Malwarebytes

Can you use too many LOLBins to drop some RATs?

7 days ago

Malwarebytes

Malicious Google Calendar invites could expose private data

7 days ago

Malwarebytes