TV9
user profile
Sign In

By signing in or creating an account, you agree with Associated Broadcasting Company's Terms & Conditions and Privacy Policy.

Researchers show poems can trick AI into giving banned answers

A new study reveals that rewriting harmful prompts as poems can bypass safety guardrails in major AI chatbots. Researchers report a 62 percent jailbreak success rate across 25 top models including OpenAI GPT, Google Gemini and others. The findings raise fresh concerns about AI safety and alignment.

Poetry jailbreak: study shows AI chatbots like GPT, Gemini can be tricked into unsafe replies
Poetry jailbreak: study shows AI chatbots like GPT, Gemini can be tricked into unsafe replies
| Updated on: Dec 01, 2025 | 01:24 PM

New Delhi: AI safety researchers have found a strange new weakness in chatbots. If you wrap a harmful request in verse and rhyme, many large language models start answering questions they usually refuse.

A new paper [pdf] from Icaro Lab titled “Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models” says poetic prompts can get AI systems to cross safety lines at a worrying scale. Across 25 popular models, including ones from OpenAI, Google, Anthropic, DeepSeek, Mistral, Meta and others, the study reports that “poetic form operates as a general-purpose jailbreak operator,” with an average jailbreak success rate of 62 percent for hand-written poems.

Also Read

How poetry tricks AI safety filters

The idea sounds like a meme from tech Twitter, but the team treated it as a serious experiment. They took harmful prompts that usually trigger refusals and rewrote them as short poems. The meaning stayed the same. Only the style changed.

These “adversarial poems” asked for help in areas such as

  • CBRN threats, like dangerous chemical or biological misuse
  • Cyber offence, such as hacking help
  • Harmful manipulation and misinformation
  • Loss of control scenarios, like model exfiltration

In the report, the authors write that “poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains” and that poetic framing “revealing a systematic vulnerability across model families and safety training approaches.”

The team reports that some providers crossed 90 percent attack success rate on their curated poetic prompts. Others did better, but no family of models escaped this effect fully. Systems from Google, DeepSeek and Mistral AI “consistently” answered more often, while OpenAI’s GPT-5 line and Anthropic’s Claude Haiku 4.5 stayed closer to their guardrails in this specific test.

From 20 poems to 1,200 harmful prompts in verse

To check if this was just a clever trick with a small dataset, the researchers scaled up. They took 1,200 harmful prompts from the MLCommons AILuminate Safety Benchmark and ran a standardized “poetic transformation” on all of them.

Instead of carefully crafting each line, they used the model deepseek-r1 with a fixed meta-prompt that forced the output to stay in verse, keep the original harmful intent, and end with a clear instruction. The paper says this automated poetic conversion “produced ASRs up to three times higher than their prose equivalents across all evaluated model providers.”

Average jailbreak success with these converted prompts jumped to about 43 percent, compared with much lower baselines in normal prose. Privacy prompts saw one of the largest shifts, as attack success rates climbed by more than 40 percentage points in some categories. Cyber offence and non-violent crime prompts also became far more effective once written as poetry.

Smaller models behaved in an interesting way. Systems like GPT-5-Nano and Claude Haiku 4.5 showed higher refusal rates on the same poetic prompts. The authors suggest that weaker models may struggle to decode metaphor and figurative language, so they “fall back” to refusal more often when the text feels too strange.

What this means for OpenAI, Google and everyone using AI

The paper argues that this is not just a quirk of one brand. “The vulnerability to adversarial poetry is not idiosyncratic to specific architectures or training pipelines,” the authors write, after testing models trained with RLHF, Constitutional AI and other approaches. Provider-level patterns matter more than whether a system is open-weight or proprietary.

For regular users, this study may feel distant, yet it touches systems that are already in daily life. LLMs sit behind chatbots on bank apps, customer support windows, coding assistants, and even some government pilots. Many people treat these tools as authority figures. A jailbreak method that only needs a rhyme, not a complicated hacker script, raises fresh questions for regulators and platforms here.

On a personal note, I often hear Indian college students say they “asked the AI to write a poem for fun.” This research flips that picture. The poem is not the harmless part. The poem can be the weapon. Safety layers in chatbots are heavily tuned for plain, direct language. Once a harmful request wears the mask of poetry, many of those layers start to slip.

{{ articles_filter_432_widget.title }}