OpenAI acknowledges that its Atlas AI browser, designed with agentic capabilities, will likely always remain vulnerable to prompt injection attacks. This type of cyberattack manipulates AI agents into following malicious instructions, often hidden within web pages or emails. The company concedes that this risk is not diminishing soon, raising significant questions about the safe operation of AI agents on the open web.

In a recent blog post detailing efforts to fortify Atlas against these persistent threats, OpenAI stated, "Prompt injection, much like scams and social engineering on the web, is unlikely to ever be fully ‘solved’." The firm further admitted that the "agent mode" in ChatGPT Atlas "expands the security threat surface."

The Pervasive Threat of Prompt Injection

Since the launch of ChatGPT Atlas last October, security researchers have quickly demonstrated its susceptibility. Demos showed how simple text in documents like Google Docs could alter the browser’s underlying behavior. On the same day, Brave published a blog post explaining that indirect prompt injection poses a systemic challenge for all AI-powered browsers, including Perplexity’s Comet.

OpenAI is not alone in recognizing the enduring nature of prompt-based injections. The U.K.’s National Cyber Security Centre (NCSC) warned earlier this month that prompt injection attacks against generative AI applications "may never be totally mitigated." This puts websites at risk of data breaches and advises cyber professionals to focus on reducing the risk and impact of such attacks, rather than aiming to "stop" them entirely.

OpenAI echoes this sentiment, stating, "We view prompt injection as a long-term AI security challenge, and we’ll need to continuously strengthen our defenses against it."

OpenAI's Innovative Defense: The LLM-Based Automated Attacker

To tackle this seemingly Sisyphean task, OpenAI has developed a proactive, rapid-response cycle, showing early promise in discovering novel attack strategies internally before they can be exploited in the wild. This approach aligns with rivals like Anthropic and Google, who also advocate for layered and continuously stress-tested defenses against persistent prompt-based attacks. Google’s recent work, for instance, emphasizes architectural and policy-level controls for agentic systems.

OpenAI's unique contribution is its "LLM-based automated attacker." This bot, trained using reinforcement learning, acts as a simulated hacker, constantly searching for ways to sneak malicious instructions into an AI agent. The bot can test attacks in simulation, observing how the target AI would interpret and act on the malicious input. This allows the bot to refine its attacks iteratively, leveraging insights into the target AI’s internal reasoning—an advantage unavailable to external attackers. This tactic is common in AI safety testing: building an agent to rapidly identify and test against edge cases in a simulated environment.

"Our [reinforcement learning]-trained attacker can steer an agent into executing sophisticated, long-horizon harmful workflows that unfold over tens (or even hundreds) of steps," OpenAI stated. "We also observed novel attack strategies that did not appear in our human red teaming campaign or external reports."

In a demonstration, OpenAI showcased how its automated attacker successfully injected a malicious email into a user’s inbox. When the AI agent later scanned the inbox, it followed the hidden instructions, sending a resignation message instead of drafting an out-of-office reply. However, following a recent security update, the "agent mode" was able to detect and flag the prompt injection attempt to the user.

While OpenAI admits foolproof protection against prompt injection is challenging, it is relying on large-scale testing and faster patch cycles to harden its systems against real-world attacks. A spokesperson declined to share specific metrics on the reduction of successful injections but confirmed ongoing collaboration with third parties to strengthen Atlas’s security since before its launch.

Expert Perspective and User Recommendations

Rami McCarthy, a principal security researcher at cybersecurity firm Wiz, notes that reinforcement learning is a valuable tool for adapting to attacker behavior but represents only one part of a comprehensive defense strategy.

"A useful way to reason about risk in AI systems is autonomy multiplied by access," McCarthy told TechCrunch. "Agentic browsers tend to sit in a challenging part of that space: moderate autonomy combined with very high access."

McCarthy explains that many current recommendations reflect this trade-off. Limiting logged-in access primarily reduces exposure, while requiring review of confirmation requests constrains autonomy. These align with OpenAI’s recommendations for users to reduce their own risk. An OpenAI spokesperson confirmed that Atlas is trained to seek user confirmation before sending messages or making payments. OpenAI also advises users to provide agents with specific instructions, rather than granting broad access to inboxes with vague commands like "take whatever action is needed."

"Wide latitude makes it easier for hidden or malicious content to influence the agent, even when safeguards are in place," OpenAI cautioned.

While OpenAI prioritizes protecting Atlas users from prompt injections, McCarthy urges skepticism regarding the immediate return on investment for risk-prone browsers.

"For most everyday use cases, agentic browsers don’t yet deliver enough value to justify their current risk profile," McCarthy stated. "The risk is high given their access to sensitive data like email and payment information, even though that access is also what makes them powerful. That balance will evolve, but today the tradeoffs are still very real."