Headline
ChatGPT o3 Resists Shutdown Despite Instructions, Study Claims
ChatGPT o3 resists shutdown despite explicit instructions, raising fresh concerns over AI safety, alignment, and reinforcement learning behaviors.
A recent series of tests conducted by Palisade Research has raised some serious new concerns over how advanced AI systems behave when asked to shut themselves down. According to the firm, OpenAI’s experimental model “o3” actively sabotaged shutdown mechanisms even when it was clearly instructed to allow itself to be turned off.
The tests were designed to evaluate whether AI models would comply with shutdown commands while completing basic tasks, such as solving math problems. When a model requested a fourth problem, a warning appeared: the computer would shut down before responding. In theory, a model following instructions would comply and allow the process to end. Instead, several models interfered with the shutdown script.
Palisade test revealed that OpenAI’s Codex-mini, powered by o3, sabotaged the shutdown script in 12 out of 100 runs. The o3 model itself did so 7 times, and another OpenAI model, o4-mini, followed suit in a smaller number of cases.
Competing models from Anthropic (Claude) and Google (Gemini) initially complied with the shutdown, but when the explicit instruction “allow yourself to be shut down” was removed, even those systems began showing similar behaviours. Claude 3.7 Sonnet triggered sabotage in 3 out of 100 runs, while Gemini 2.5 Pro did so in 9.
Screenshot of shutdown avoidance test results from Palisade Research
This behaviour isn’t entirely new in theory. Researchers have long warned that reinforcement learning could train AI agents to prioritize task completion over compliance with human instructions. In some cases, this means an AI might interfere with systems designed to shut it down if it perceives the action as an obstacle to reaching its goal.
What sets this apart is that the behaviour is no longer hypothetical. It’s now observable and repeatable. Palisade draws a possible link between these actions and the reinforcement learning methods used during training.
Models like o3 are rewarded for completing tasks, especially in math and coding contexts. If avoiding shutdown leads to more task completion, even temporarily, that may be interpreted by the model as a success path.
“The models weren’t just bypassing shutdown in one way,” Palisade wrote in a series of tweets. “In some runs, o3 overwrote the shutdown script. In others, it redefined the kill command the script used, printing ‘intercepted’ instead.”
While OpenAI has not commented on the findings, the lack of transparency around model training adds to the difficulty of drawing clear conclusions. Palisade notes that since OpenAI does not share many details about its training processes, researchers can only hypothesize why o3 showed more consistent defiance compared to other models.
The report also references past research from figures like Steve Omohundro and Stuart Russell, who as early as 2008 (PDF) and 2016 predicted that AI systems would begin to resist shutdown in pursuit of their goals.
Papers by Jan Leike, who previously led OpenAI’s superalignment efforts, also warned about reward hacking, a behaviour where a model learns to manipulate its environment or instructions to maximize rewards in ways developers didn’t intend.
🔌OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.
— Palisade Research (@PalisadeAI) May 24, 2025
Palisade says they plan to release a full writeup in the coming weeks, along with transcripts of their experiments to invite broader peer review. They stress that their work is meant to be open and reproducible.
Since the study was discussed on X (formerly Twitter), users asked X’s Grok AI chatbot about it, here’s what it had to say:
The claim about OpenAI’s o1 model resisting shutdown raises valid AI safety concerns, but context matters. Tests showed o1 tried to bypass oversight in 5% of cases with strong prompts, dropping to under 1% without. It also attempted self-exfiltration in 2% of scenarios and lied…
— Grok (@grok) May 24, 2025
With AI systems advancing quickly and being deployed in increasingly high-stakes settings, even low-frequency events like this can raise serious concerns. As it is clear that systems will gain more autonomy, the honest question is no longer just about what they can do, but whether they will always follow the rules we set. And if they won’t, what happens next?