All Major LLMs Exposed to Multi-Turn Manipulation, Warn Researchers

The safety guardrails of several prominent large language models (LLM) can be bypassed if a user tricks the LLM into having a multi-pronged, ongoing conversation, researchers at Cisco have warned.

The researchers examined commonly used LLMs and frontier AI models including OpenAI’s ChatGPT, Anthropic’s Claude, Google Gemini, Amazon Nova, xAI’s Grok and others to test how their built-in safety guardrails held up against potential threats from real-world attackers.

They found that many of the models could be tricked into performing actions they should not be able to.

This was achieved by deploying multi-turn conversations: dialogue between the user and the LLM which spans multiple back and forth exchanges.

While guardrails in LLMs are designed to prevent users from entering malicious commands, the researchers found that by engaging the LLMs in conversations and querying the responses the protections faltered.

“Multi-turn evaluation matters for one reason: it is where attackers actually live. Real adversaries iterate. They reframe refusals, decompose tasks across turns, adopt personas, and escalate gradually,” said Cisco.

No Guardrails Completely Safe From Bypass

The research found that no model was completely safe from being exploited by multi-turn-based manipulation of guardrails. Cisco warned that this challenges how enterprises are currently evaluating AI safety and security.

The warning comes at a time when many organizations are rolling out AI and LLMs for use by employees, clients and customers, but are relying on safety benchmarks that misrepresent real-world risk.

Read more: What Fronter AI Models Like Mythos and GPT-Cyber Mean for Modern Cybersecurity

The report warned that most safety around LLMs is based on single-prompt testing, but attackers don’t stop after one try – and all models were affected by multi-turn attack success rates (ASR).

Techniques which enabled researchers to bypass guardrails though multi-turn conversations included adopted personas in roleplay, ambiguity and misdirection around context and reframing requests upon initial refusals to interact by the LLM.

How the LLMs were configured also made a difference to how resilient they were to manipulation. For example, researchers found that GrokAI became much more vulnerable to safety protections being bypassed when ‘reasoning mode’ was enabled.

While governing bodies and regulators are beginning to call for evaluation practices that current benchmarks do not fully address, Cisco warned that much more needs to be done to prevent LLMs from being easily exploited or manipulated by adversaries.

“The rapid deployment of frontier large language models has generated a parallel ecosystem of safety and security benchmarks. However, a growing body of evidence indicates that this ecosystem suffers from structural limitations that can systematically understate risk, conflate safety with capability, and leave critical attack surfaces unmeasured,” said the report.

Source

What's Hot

Iran-Linked MuddyWater Poses as Ransomware Gang to Mask Espionage

CISA Warns of Actively Exploited Joomla JCE Flaw Allowing PHP Code Execution

Researchers Trick AI Browsers Into Leaking Credentials

CISA Warns of Actively Exploited Joomla JCE Flaw Allowing PHP Code Execution

Researchers Trick AI Browsers Into Leaking Credentials

Google Vertex AI SDK Flaw Let Attackers Hijack Model Uploads via Bucket Squatting

North Korean Hackers Turn JSON Services into Covert Malware Delivery Channels

macOS Stealer Campaign Uses “Cracked” App Lures to Bypass Apple Securi

North Korean Hackers Target Crypto Firms with ClickFix and Zoom Lures

Why SOC Burnout Can Be Avoided: Practical Steps

Cyber M&A Roundup: Cyber Giants Strengthen AI Security Offerings

Most Popular

North Korean Hackers Turn JSON Services into Covert Malware Delivery Channels

macOS Stealer Campaign Uses “Cracked” App Lures to Bypass Apple Securi

North Korean Hackers Target Crypto Firms with ClickFix and Zoom Lures

Our Picks

What it takes to fool facial recognition

What parents should know to protect their children from doxxing

Why that next data breach alert could be a trap

Subscribe to Updates

What's Hot

All Major LLMs Exposed to Multi-Turn Manipulation, Warn Researchers

No Guardrails Completely Safe From Bypass

Related Posts

Subscribe to Updates