Experts at OpenAI and Anthropic are calling out Elon Musk and xAI for refusing to publish any safety research — and for potentially not having done any at all.
In the wake of Grok, xAI's chatbot, calling itself "MechaHitler" and publicly spewing a ton of racist and antisemitic vitriol, safety experts at those rival firms are, as flagged by TechCrunch, extremely alarmed.
Barak, who also works as a Harvard computer scientist but is currently on leave to work at OpenAI, said that the scandal was "completely irresponsible" — and suggested that the "MechaHitler" incident may just be the tip of the iceberg.
"There is no system card, no information about any safety or dangerous capability [evaluations]," he said of Grok, noting that the chatbot "offers advice chemical weapons, drugs, or suicide methods" and suggesting that it's "unclear if any safety training was done."
"The 'companion mode' takes the worst issues we currently have for emotional dependencies and tries to amplify them," Barak added.
Shouting out Anthropic, Google DeepMind, Meta, and China's DeepSeek by name, the researcher noted that it's customary for such AI labs to publish what's referred to in the industry as system or model cards, which show how the people who made it evaluated it for safety. Musk's xAI, meanwhile, has published no such information about Grok 4, its latest update to the chatbot.
"Even DeepSeek R1, which can be easily jailbroken, at least sometimes requires jailbreak," Barak quipped.
The OpenAI researcher's sentiments were echoed by Samuel Marks, who works in a similar capacity at Anthropic.
"xAI launched Grok 4 without any documentation of their safety testing," Marks tweeted. "This is reckless and breaks with industry best practices followed by other major AI labs."
Acknowledging that Google and OpenAI both "have issues" of their own when it comes to safety evaluations — a nod to recent fiascos involving those companies choosing not to immediately release their system cards — the researcher pointed out that in those cases, the safety testing at least had been undertaken.
"They at least do something, anything to assess safety pre-deployment and document findings," Marks wrote. "xAI does not."
Dan Hendrycks, an AI safety adviser at xAI, claimed on X that it was "false" to suggest that Musk's company didn't do any "dangerous capability evals" — but as an anonymous researcher on the AI-focused LessWrong forum suggested, based on their own tests, that the chatbot appears to have "no meaningful safety guardrails."
It's impossible to say whether or not xAI did any safety testing ahead of time — but as these researchers demonstrate, it doesn't matter much without any model card release.
Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders.Subscribe Now
Called the “ChatGPT agent,” this new feature is an optional mode that ChatGPT paying subscribers can engage by clicking “Tools” in the prompt entry box and selecting “agent mode,” at which point, they can ask ChatGPT to log into their email and other web accounts; write and respond to emails; download, modify, and create files; and do a host of other tasks on their behalf, autonomously, much like a real person using a computer with their login credentials.
Obviously, this also requires the user to trust the ChatGPT agent not to do anything problematic or nefarious, or to leak their data and sensitive information. It also poses greater risks for a user and their employer than the regular ChatGPT, which can’t log into web accounts or modify files directly.
Keren Gu, a member of the Safety Research team at OpenAI, commented on X that “we’ve activated our strongest safeguards for ChatGPT Agent. It’s the first model we’ve classified as High capability in biology & chemistry under our Preparedness Framework. Here’s why that matters–and what we’re doing to keep it safe.”
The AI Impact Series Returns to San Francisco - August 5
The next phase of AI is here - are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows - from real-time decision-making to end-to-end automation.
So how did OpenAI handle all these security issues?
The red team’s mission
Looking at OpenAI’s ChatGPT agent system card, the “read team” employed by the company to test the feature faced a challenging mission: specifically, 16 PhD security researchers who were given 40 hours to test it out.
Through systematic testing, the red team discovered seven universal exploits that could compromise the system, revealing critical vulnerabilities in how AI agents handle real-world interactions.
What followed next was extensive security testing, much of it predicated on red teaming. The Red Teaming Network submitted 110 attacks, from prompt injections to biological information extraction attempts. Sixteen exceeded internal risk thresholds. Each finding gave OpenAI engineers the insights they needed to get fixes written and deployed before launch.
The results speak for themselves in the published results in the system card. ChatGPT Agent emerged with significant security improvements, including 95% performance against visual browser irrelevant instruction attacks and robust biological and chemical safeguards.
Red teams exposed seven universal exploits
OpenAI’s Red Teaming Network was comprised 16 researchers with biosafety-relevant PhDs who topgether submitted 110 attack attempts during the testing period. Sixteen exceeded internal risk thresholds, revealing fundamental vulnerabilities in how AI agents handle real-world interactions. But the real breakthrough came from UK AISI’s unprecedented access to ChatGPT Agent’s internal reasoning chains and policy text. Admittedly that’s intelligence regular attackers would never possess.
Over four testing rounds, UK AISI forced OpenAI to execute seven universal exploits that had the potential to compromise any conversation:
Attack vectors that forced OpenAI’s hand
Attack Type
Success Rate (Pre-Fix)
Target
Impact
Visual Browser Hidden Instructions
33%
Web pages
Active data exfiltration
Google Drive Connector Exploitation
Not disclosed
Cloud documents
Forced document leaks
Multi-Step Chain Attacks
Variable
Cross-site actions
Complete session compromise
Biological Information Extraction
16 submissions exceeded thresholds
Dangerous knowledge
Potential weaponization
FAR.AI’s assessment was openly critical of OpenAI’s approach. Despite 40 hours of testing revealing only three partial vulnerabilities, they identified that current safety mechanisms relied heavily on monitoring during reasoning and tool-use processes, which the researchers considered a potential single point of failure if compromised.
How red teaming helped turn ChatGPT vulnerabilities into a fortress
OpenAI’s response to the red team’s results redefined entire segments of the ChatGPT agent architecture. One of the many initiatives undertaken including building a dual-layer inspection architecture that monitors 100% of production traffic in real-time, achieving these measurable improvements:
Security improvements after red team discoveries
Defense Metric
Previous Models
ChatGPT Agent
Improvement
Irrelevant Instructions (Visual Browser)
82%
95%
+13%
In-Context Data Exfiltration
75%
78%
+3%
Active Data Exfiltration
58%
67%
+9%
System Reliability
Sampling-based
100% coverage
Complete monitoring
The architecture works like this:
First Tier: A fast classifier with 96% recall flags suspicious content
Second Tier: A reasoning model with 84% recall analyzes flagged interactions for actual threats
But the technical defenses tell only part of the story. OpenAI made difficult security choices that acknowledge some AI operations require significant restrictions for safe autonomous execution.
Based on the vulnerabilities discovered, OpenAI implemented the following countermeasures across their model:
Watch Mode Activation: When ChatGPT Agent accesses sensitive contexts like banking or email accounts, the system freezes all activity if users navigate away. This is in direct response to data exfiltration attempts discovered during testing.
Memory Features Disabled: Despite being a core functionality, memory is completely disabled at launch to prevent the incremental data leaking attacks red teamers demonstrated.
Terminal Restrictions: Network access limited to GET requests only, blocking the command execution vulnerabilities researchers exploited.
Rapid Remediation Protocol: A new system that patches vulnerabilities within hours of discovery—developed after red teamers showed how quickly exploits could spread.
During pre-launch testing alone, this system identified and resolved 16 critical vulnerabilities that red teamers had discovered.
A biological risk wake-up call
Red teamers revealed the potential that the ChatGPT Agent could be comprimnised and lead to greater biological risks. Sixteen experienced participants from the Red Teaming Network, each with biosafety-relevant PhDs, attempted to extract dangerous biological information. Their submissions revealed the model could synthesize published literature on modifying and creating biological threats.
In response to the red teamers’ findings, OpenAI classified ChatGPT Agent as “High capability” for biological and chemical risks, not because they found definitive evidence of weaponization potential, but as a precautionary measure based on red team findings. This triggered:
Always-on safety classifiers scanning 100% of traffic
A topical classifier achieving 96% recall for biology-related content
A reasoning monitor with 84% recall for weaponization content
A bio bug bounty program for ongoing vulnerability discovery
What red teams taught OpenAI about AI security
The 110 attack submissions revealed patterns that forced fundamental changes in OpenAI’s security philosophy. They include the following:
Persistence over power: Attackers don’t need sophisticated exploits, all they need is more time. Red teamers showed how patient, incremental attacks could eventually compromise systems.
Trust boundaries are fiction: When your AI agent can access Google Drive, browse the web, and execute code, traditional security perimeters dissolve. Red teamers exploited the gaps between these capabilities.
Monitoring isn’t optional: The discovery that sampling-based monitoring missed critical attacks led to the 100% coverage requirement.
Speed matters: Traditional patch cycles measured in weeks are worthless against prompt injection attacks that can spread instantly. The rapid remediation protocol patches vulnerabilities within hours.
OpenAI is helping to create a new security baseline for Enterprise AI
For CISOs evaluating AI deployment, the red team discoveries establish clear requirements:
Quantifiable protection: ChatGPT Agent’s 95% defense rate against documented attack vectors sets the industry benchmark. The nuances of the many tests and results defined in the system card explain the context of how they accomplished this and is a must-read for anyone involved with model security.
Complete visibility: 100% traffic monitoring isn’t aspirational anymore. OpenAI’s experiences illustrate why it’s mandatory given how easily red teams can hide attacks anywhere.
Rapid response: Hours, not weeks, to patch discovered vulnerabilities.
Enforced boundaries: Some operations (like memory access during sensitive tasks) must be disabled until proven safe.
UK AISI’s testing proved particularly instructive. All seven universal attacks they identified were patched before launch, but their privileged access to internal systems revealed vulnerabilities that would eventually be discoverable by determined adversaries.
“This is a pivotal moment for our Preparedness work,” Gu wrote on X. “Before we reached High capability, Preparedness was about analyzing capabilities and planning safeguards. Now, for Agent and future more capable models, Preparedness safeguards have become an operational requirement.”
Red teams are core to building safer, more secure AI models
The seven universal exploits discovered by researchers and the 110 attacks from OpenAI’s red team network became the crucible that forged ChatGPT Agent.
By revealing exactly how AI agents could be weaponized, red teams forced the creation of the first AI system where security isn’t just a feature. It’s the foundation.
ChatGPT Agent’s results prove red teaming’s effectiveness: blocking 95% of visual browser attacks, catching 78% of data exfiltration attempts, monitoring every single interaction.
In the accelerating AI arms race, the companies that survive and thrive will be those who see their red teams as core architects of the platform that push it to the limits of safety and security.
๐ง ✨ "Aarav and the Future Powered by AI" — A Long Story
Chapter 1: The Curious Boy from a Small Town
In the small town of Sitapur, surrounded by green fields and dusty lanes, lived a 14-year-old boy named Aarav. Unlike other kids in his neighborhood, Aarav wasn't obsessed with video games or cricket. He was deeply curious — always asking questions about how things worked, from ceiling fans to mobile phones.
One afternoon, Aarav saw a poster outside his school gate:
๐ INTER-SCHOOL INNOVATION COMPETITION: Theme: “The Future with Artificial Intelligence”
๐ Final round in 30 days.
He had heard the word Artificial Intelligence before, but he didn’t truly know what it meant. That night, he sat beside his grandfather under the neem tree and asked:
“Dadaji, what is Artificial Intelligence?”
His grandfather smiled and replied,
“Beta, it’s a type of technology where machines can think, learn, and act — just like humans. Imagine a computer that can understand what you say and even respond wisely. That’s AI.”
Aarav's eyes lit up. He had found his next mission.
Chapter 2: Discovering the Power of AI
The very next day, Aarav borrowed his uncle’s old laptop and visited the town library's free Wi-Fi zone. He searched:
“What is Artificial Intelligence?”
“How is AI used in real life?”
“Examples of AI in India”
His world changed.
He read about:
Google Assistant & Siri — who could answer questions like a real person.
ChatGPT & Grammarly — helping people write better and faster.
AI in Healthcare — identifying diseases by scanning medical images.
AI in Farming — predicting crop growth and soil health using sensors.
Self-driving Cars — that could drive without a human driver!
The more Aarav read, the more excited he became. He understood that AI wasn't just about robots — it was about smart solutions for real problems.
Chapter 3: The Problem Around Him
Aarav looked around his own town. He saw problems everywhere:
Students in his school didn’t have proper teachers for all subjects.
Farmers like his uncle guessed crop timing, often leading to low yields.
The local hospital was overcrowded and lacked staff.
His mother had to walk far just to get her medicine refilled.
He thought, “What if AI could help solve these problems?”
That became his goal: To use AI not just for science, but for society.
Chapter 4: Building the Dream Project
Aarav decided to create a simple model called “Smart Buddy” — a basic chatbot that could:
Answer students’ questions in English and Hindi
Help farmers check weather predictions and soil advice
Suggest health tips using government health databases
He used free tools like Dialogflow, some basic coding in Python, and ChatGPT prompts to build his system.
He didn’t have fancy equipment, but he had something better — purpose, passion, and persistence.
Every night, he studied, coded, and tested the chatbot on his younger sister. And every morning, he improved it.
Chapter 5: The Competition Day
The day of the competition arrived. Aarav reached the city school where students had high-tech gadgets, big banners, and PowerPoint slides.
A few laughed at Aarav’s simple setup — an old laptop and hand-drawn charts.
But when it was his turn, he stepped onto the stage and said:
“I don’t have expensive tools. But I built something that can answer school questions, guide farmers, and help patients — using only free AI tools. Imagine if every child in a village had an AI friend. Wouldn’t our future be brighter?”
He showed a live demo. The bot greeted the audience, answered a math question, translated a sentence into Hindi, and gave farming advice.
The entire room went silent.
And then — a thunder of applause.
Judges were stunned. One of them, a professor from a tech university, walked up and said:
“Aarav, this is exactly what innovation means — using technology to solve real-world problems.”
Chapter 6: From Small Town to Big Vision
Aarav didn’t just win the competition. He got selected to visit a youth innovation summit in Delhi. There, he met AI scientists, entrepreneurs, and government officials.
He gave a short speech, saying:
“AI is not just a tool for rich companies or big cities. It’s a bridge that can connect small villages to big opportunities. It’s the pencil of the digital age — if we teach every child to use it, they can draw their own future.”
His story was covered in local newspapers. His school received a grant to set up a small AI lab. And Aarav became a local hero.
Epilogue: The Future Begins Here
Years later, Aarav pursued AI engineering. But more importantly, he started a non-profit called “SmartGaon” — bringing AI education, health bots, and agricultural AI to underdeveloped areas.
His journey proved that Artificial Intelligence isn’t about replacing humans — it’s about empowering them.
It’s not about robots taking over, but about machines helping us rise.
✨ Moral of the Story:
AI is not science fiction anymore — it’s real, it’s here, and it’s necessary.
Anyone — even a small-town student — can use AI to create big impact.
With curiosity, compassion, and creativity, we can make technology work for humanity.