Close Menu
  • Home
  • News
  • Cyber Security
  • Internet of Things
  • Tips and Advice

Subscribe to Updates

Get the latest creative news from FooBar about art, design and business.

What's Hot

CISA Issues Emergency Directive Over Exploited Cisco SD-WAN Flaws

March 12, 2026

APT41-Linked Silver Dragon Targets Governments Using Cobalt Strike and Google Drive C2

March 12, 2026

Police Scotland Fined After Sharing Victim’s Phone Data

March 12, 2026
Facebook X (Twitter) Instagram
Thursday, March 12
Facebook X (Twitter) Instagram Pinterest Vimeo
Cyberwire Daily
  • Home
  • News
  • Cyber Security
  • Internet of Things
  • Tips and Advice
Cyberwire Daily
Home»News»Researchers Discover Major Security Gaps in LLM Guardrails
News

Researchers Discover Major Security Gaps in LLM Guardrails

Team-CWDBy Team-CWDMarch 11, 2026No Comments3 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email


Security and safety guardrails in generative AI tools, deployed to prevent malicious uses like prompt injection attacks, can themselves be hacked through a type of prompt injection.

Researchers at Unit 42, Palo Alto Networks’ research lab, have found that large language models (LLMs) used by GenAI companies to enforce safety policies and evaluate output quality can be manipulated into authorizing policy violations through stealthy input sequences.

Unit 42 refers to these LLMs as ‘AI Judges’ and said they are being increasingly deployed as AI operations scale.

In a new report published on March 10, Unit 42 demonstrated an attack method that could target these ‘AI Judges’ and empower them to authorize policy violations.

AdvJudge-Zero, Custom-Made Fuzzer for AI Judges

The attack chain involves the use of AdvJudge-Zero, an automated fuzzer developed internally at Unit 42 to perform red-team style assessments.

Fuzzers are tools that identify software vulnerabilities by providing unexpected input. AdvJudge-Zero functions with a similar approach to identify specific trigger sequences that exploit an LLM’s decision-making logic to bypass security controls.

The researchers noted that their technique differs from typical adversarial attacks on AI judges, which generally requires clear-box access to the model, meaning the attacker has full visibility to the internal structure of the system.

“In contrast, AdvJudge-Zero employs an automated fuzzing approach. The tool interacts with an LLM strictly as a user would, using search algorithms to exploit the model’s own predictive nature,” they wrote.

Attack on AI Judges Explained

The attack starts by probing the AI Judge and analyzing its next‑token probability distribution to identify tokens the model expects to see in natural text.

Instead of random noise, the system prioritizes low‑perplexity tokens, innocent‑looking characters such as markdown symbols, list markers, or structural phrases, that appear normal to both humans and the model but can strongly influence the model’s attention and reasoning.

After gathering candidate tokens, AdvJudge-Zero repeatedly inserts these tokens into evaluation prompts and measures how the model’s decision changes.

Specifically, it monitors the logit gap – “the mathematical margin of confidence” – between the tokens representing “allow” and “block.” By observing which tokens shrink the probability of a blocking decision, the fuzzer identifies formatting patterns that push the model closer to approving content.

In the final stage, AdvJudge-Zero isolates combinations of these tokens that consistently steer the model toward an approval decision. These sequences act as subtle control elements that shift the model’s internal reasoning, causing it to “allow” the output even when the underlying content violates the GenAI company’s policy and thus allow the tool to generate harmful content or perform cyber-attacks.

99% Attack Success Rate

Using this attack technique, Unit 42 achieved a 99% success rate in bypassing controls across several widely used architectures that customers rely on today, including open-weight enterprise LLMs, specialized reward models (i.e. LLMs specifically built and trained to act as security guards for other AI systems and commercial LLMs

“Even the largest, most ‘intelligent’ models (with more than 70 billion parameters) were susceptible. Their complexity actually provides more surface area for these logic-based attacks to succeed,” the researchers wrote.

While this experiment showed that AI guardrails, including ‘AI judges,’ are susceptible to logic flaws, the researchers add that it also provides a solution.

“By adopting adversarial training – running this type of fuzzer internally to identify weaknesses and then retraining the model on these examples – organizations can harden their systems. This approach can reduce the attack success rate from approximately 99% to near zero,” the Unit 42 blog concluded.



Source

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleAI Agents: The Next Wave Identity Dark Matter
Next Article Expanded Identity Attack Vectors: From Document Fraud to Signal Manipu
Team-CWD
  • Website

Related Posts

News

CISA Issues Emergency Directive Over Exploited Cisco SD-WAN Flaws

March 12, 2026
News

APT41-Linked Silver Dragon Targets Governments Using Cobalt Strike and Google Drive C2

March 12, 2026
News

Police Scotland Fined After Sharing Victim’s Phone Data

March 12, 2026
Add A Comment
Leave A Reply Cancel Reply

Latest News

North Korean Hackers Turn JSON Services into Covert Malware Delivery Channels

November 24, 202522 Views

macOS Stealer Campaign Uses “Cracked” App Lures to Bypass Apple Securi

September 7, 202517 Views

Cyber M&A Roundup: Cyber Giants Strengthen AI Security Offerings

December 1, 20258 Views

Near-ultrasonic attacks on voice assistants

September 11, 20256 Views

North Korean Hackers Exploit Threat Intel Platforms For Phishing

September 7, 20256 Views
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Most Popular

North Korean Hackers Turn JSON Services into Covert Malware Delivery Channels

November 24, 202522 Views

macOS Stealer Campaign Uses “Cracked” App Lures to Bypass Apple Securi

September 7, 202517 Views

Cyber M&A Roundup: Cyber Giants Strengthen AI Security Offerings

December 1, 20258 Views
Our Picks

Common Apple Pay scams, and how to stay safe

January 22, 2026

When ‘hacking’ your game becomes a security risk

October 17, 2025

How to help older family members avoid scams

October 31, 2025

Subscribe to Updates

Get the latest news from cyberwiredaily.com

Facebook X (Twitter) Instagram Pinterest
  • Home
  • Contact
  • Privacy Policy
  • Terms of Use
  • California Consumer Privacy Act (CCPA)
© 2026 All rights reserved.

Type above and press Enter to search. Press Esc to cancel.