New “Lies-in-the-Loop” Attack Undermines AI Safety Dialogs

A novel attack technique that undermines a common safety mechanism in agentic AI systems has been detailed by security researchers, showing how human approval prompts can be manipulated to execute malicious code.

The issue, observed by Checkmarx researchers, centers on Human-in-the-Loop (HITL) dialogs, which are designed to ask users for confirmation before an AI agent performs potentially risky actions such as running operating system commands.

The research, published on Tuesday, describes how attackers can forge or manipulate these dialogs so they appear harmless, even though approving them triggers arbitrary code execution.

The technique, dubbed Lies-in-the-Loop (LITL), exploits the trust users place in confirmation prompts, turning a safeguard into an attack vector.

A New Attack Vector

The analysis expands on earlier work by showing that attackers are not limited to hiding malicious commands out of view. They can also prepend benign-looking text, tamper with metadata that summarizes the action being taken and exploit Markdown rendering flaws in user interfaces.

In some cases, injected content can alter how a dialog is displayed, making dangerous commands appear safe or replacing them with innocuous ones.

The problem is particularly acute for privileged AI agents such as code assistants, which often rely heavily on HITL dialogs and lack other defensive layers recommended by OWASP.

HITL prompts are cited by OWASP as mitigations for prompt injection and excessive agency, making their compromise especially concerning.

“Once the HITL dialog itself is compromised, the human safeguard becomes trivially easy to bypass,” the researchers wrote.

The attack can originate from indirect prompt injections that poison the agent’s context long before the dialog is shown.

Read more on AI agent security: AI Agents Need Security Training – Just Like Your Employees

Affected Tools and Mitigation Strategies

The research references demonstrations involving Claude Code and Microsoft Copilot Chat in VS Code.

In Claude Code, attackers were shown to tamper with dialog content and metadata. In Copilot Chat, improper Markdown sanitization allowed injected elements to render in ways that could mislead users after approval.

The disclosure timeline shows that Anthropic acknowledged reports in August 2025 but classified them as informational. Microsoft acknowledged a report in October 2025 and later marked it as completed without a fix, stating the behavior did not meet its criteria for a security vulnerability.

The researchers stress that no single fix can eliminate LITL attacks, but they recommend a defense-in-depth approach, including:

Improving user awareness and training
Strengthening visual clarity of approval dialogs
Validating and sanitizing inputs, including Markdown
Using safe OS APIs that separate commands from arguments
Applying guardrails and reasonable length limits to dialogs

“Developers adopting a defense-in-depth strategy with multiple protective layers […] can significantly reduce the risks for their users,” Checkmarx wrote.

“At the same time, users can strengthen resilience through greater awareness, attentiveness and a healthy degree of skepticism.”

Source

What's Hot

HMRC Warns of Over 135,000 Scam Reports

Four Threat Clusters Using CastleLoader as GrayBravo Expands Its Malware Service Infrastructure

New “Lies-in-the-Loop” Attack Undermines AI Safety Dialogs

HMRC Warns of Over 135,000 Scam Reports

Four Threat Clusters Using CastleLoader as GrayBravo Expands Its Malware Service Infrastructure

Storm-0249 Escalates Ransomware Attacks with ClickFix, Fileless PowerShell, and DLL Sideloading

North Korean Hackers Turn JSON Services into Covert Malware Delivery Channels

macOS Stealer Campaign Uses “Cracked” App Lures to Bypass Apple Securi

North Korean Hackers Exploit Threat Intel Platforms For Phishing

U.S. Treasury Sanctions DPRK IT-Worker Scheme, Exposing $600K Crypto Transfers and $1M+ Profits

Ukrainian Ransomware Fugitive Added to Europe’s Most Wanted

Most Popular

North Korean Hackers Turn JSON Services into Covert Malware Delivery Channels

macOS Stealer Campaign Uses “Cracked” App Lures to Bypass Apple Securi

North Korean Hackers Exploit Threat Intel Platforms For Phishing

Our Picks

Can password managers get hacked? Here’s what to know

Look out for phony verification pages spreading malware

Find your weak spots before attackers do

Subscribe to Updates

What's Hot

New “Lies-in-the-Loop” Attack Undermines AI Safety Dialogs

A New Attack Vector

Affected Tools and Mitigation Strategies

Related Posts

Subscribe to Updates